The Institute of Plant Molecular Biology (IBMP), the largest CNRS laboratory in Alsace, France. IBMP is associated with the University of Strasbourg, engaging its 160+ researchers, doctoral candidates, and students from various nationalities in studying plant development, molecular structures, and viral diseases.
Today, the production of scientific data in digital form is widespread, and the implementation of new tools like Next-Generation Sequencing (NGS) leads to an explosive growth in data volume. At IBMP, around 80 TB of data is generated annually, and new methods such as nanopore sequencing, used to determine the arrangement of nucleotides in DNA fragments, further contribute to data inflation. Then, there are requirements to preserve this information for the long term, typically up to fifteen years, to enable comparison with more recent studies. Therefore, the data must always remain accessible.
Furthermore, considering the number and diverse origins of researchers at IBMP and their file identification logic, it is crucial to rely on a truly universal methodology of data access that allows rapid retrieval from the database. The IT department and scientific community at IBMP took all these factors into account when considering the replacement of their RAID 6 NAS, which no longer met the heavy-duty demands of advanced sequencing methods.
IBMP underwent a comprehensive overhaul of its information system, embracing a range of IT transformations. These included the adoption of server and storage virtualization, as well as the implementation of a highly resilient architecture that is available 24/7. This solution relied on a VMware cluster backed by a software-defined storage (SDS) platform, DataCore SANsymphony, with a capacity of 200 TB.
While this system proved extremely robust, their long- term storage with NAS approach became increasingly outdated over time. Operational maintenance seemed more complex with growing capacity, and disk reconstruction times (in case of failure) were unreasonably long.
It was therefore imperative to find a solution that could handle increasing capacity with agility and effortlessly manage the oncoming data tsunami. After considering various options, traditional solutions were definitively ruled out, and it was determined that only object storage enabled with S3 access could meet the requirements and budget constraints of the institute.
Following a thorough evaluation of proposals from various vendors, two solutions, including DataCore Swarm, were being considered. Given their excellent support relationship with DataCore, Swarm software-defined object storage emerged as the preferred choice for IBMP.
- Object-based storage architecture that outperforms traditional file systems
- Excellent resilience to failures, similar to SANsymphony (for block storage)
- A simple and accessible web interface for administration and content access (S3/HTTP)
- Robust storage system with effective data protection utilizing erasure coding
- Significant reduction in power consumption and energy costs through Darkive technology
Long-Term Data Storage
with Always-on Access
Currently, Swarm is mainly used by a part of the bioinformatics team at IBMP, which generates and manages the largest volumes of data through Next-Generation Sequencing (NGS). While the hardware is fully operational, the software still requires some fine-tuning to facilitate the migration of data into Swarm.
Metadata integration during data ingestion is a critical next step for IBMP for optimizing object retrieval from their extensive database. This will allow IBMP to move away from conventional and heterogeneous naming schemes (adopted by researchers handling data) that negatively affect search performance.
This initiative will take time as the CNRS, the institution’s supervisory body, aims to deploy an Electronic Laboratory Notebook (ELN) with a “digital record” accompanying every scientific data ingestion sequence.
Since several laboratories share interest in object storage, it is necessary to take the time to express requirements, coordinate discussions, and share experiences within the ELN working groups.
In the meantime, the bioinformatics data stored on Swarm is already accessible to users through dedicated visualization servers (such as JBrowse for genome identification), and the complete migration to object storage will be facilitated through the ELN.
Primary data ingestion and storage of hot data will continue to be supported by SANsymphony on block storage, which reliably provides all services to IBMP users.
- Swarm object storage cluster formed with 10 Dell PowerEdge servers
- Licensed initially for 850 TB of usable capacity (out of 1.3 PB of total raw capacity)
- VMware ESXi for server virtualization
- Active Directory integration for identity management and access control
- 25 Gbps link and 10 Gbps fiber optic link
- FS switches
- iDRAC connections to monitor remote machines