Everyone knows data is growing at exponential rates. In fact, the numbers can be mind-numbing. That’s certainly the case when it comes to genomic data where 40,000PB of storage each year will be needed by 2025. Understanding, managing and storing this massive amount of data was the topic at our SNIA Cloud Storage Technologies Initiative webcast “Moving Genomics to the Cloud: Compute and Storage Considerations.” If you missed the live presentation, it’s available on-demand along with presentation slides.
Our live audience asked many interesting questions during the webcast, but we did not have time to answer them all. As promised, our experts, Michael McManus, Torben Kling Petersen and Christopher Davidson have answered them all here.
Q. Human genomes differ only by 1% or so, there’s an immediate 100x improvement in terms of data compression, 2743EB could become 27430PB, that’s 2.743M HDDs of 10TB each. We have ~200 countries for the 7.8B people, and each country could have 10 sequencing centers on average, each center would need a mere 1.4K HDDs, is there really a big challenge here?
A. The problem is not that simple unfortunately. The location of genetic differences and the size of the genetic differences vary a lot across people. Still, there are compression methods like CRAM and PetaGene that can save a lot of space. Also consider all of the sequencing for rare disease, cancer, single cell sequencing, etc. plus sequencing for agricultural products.
Q. What’s the best compression ratio for human genome data?
A. CRAM states 30-60% compression, PetaGene cites up to 87% compression, but there are a lot of variables to consider and it depends on the use case (e.g., is this compression for archive or for withing run computing). Lustre can compress data by roughly half (compression ratio of 2), though this does not usually include compression of metadata. We have tested PetaGene in our lab and achieved a compression ratio of 2 without any impact on the wall clock.
Q. What is the structure of the Genome processed file? It is one large file or multiple small files and what type of IO workload do they have?
A. The addendum at the end of this presentation covers file formats for genome files, e.g. FASTQ, BAM, VCF, etc.
Q. It’s not just capacity, it’s also about performance. Analysis of genomic data sets is very often hard on large-scale storage systems. Are there prospects for developing methods like in-memory processing, etc., to offload some of the analysis and/or ways to optimize performance of I/O in storage systems for genomic applications?
A. At Intel, we are using HPC systems that are using an IB or OPA fabric (or RoCE over Ethernet) with Lustre. We are running in a “throughput” mode versus focusing on individual sample processing speed. Multiple samples are processed in parallel versus sequentially on a compute node. We use a sizing methodology to rate a specific compute node config to provide, for example, our benchmark on our 2nd Gen Scalable processors. This benchmark is 6.4 30x whole genomes per compute node per day. Benchmarks on our 3rd Gen Scalable processors are underway. This sizing methodology allows for the most efficient use of compute resources, which in turn can alleviate storage bottlenecks.
Q. What is the typical access pattern of a 350G sequence? Is full traversal most common, or are there usually focal points or hot spots?
A. The 350GB is comprised of two possible input file types and 2 output file types. For input file types they can be either a FASTQ file, which is an uncompressed, raw text file, or a compressed version called a uBAM (u=unaligned). The output file types are a compressed “aligned” version called a BAM file, output of the alignment process; and a gVCF file which is the output of the secondary analysis. This 350GB number is highly dependent on data retention policies, compression tools, genome coverage, etc.
Q. What is the size of a sequence and how many sequence are we looking at?
A. If you are asking about an actual sequence of 6 billion DNA bases (3 billion base pairs) then each base is represented by 1 byte so you have 6 GB. However, the way the current “short read” sequencers work is using the concept of coverage. This means you run the sequence multiple times, for example 30 times, which is referred to as “30x”. So, 30 times 6GB = 180GB. In terms of My “thought experiment” I considered 7.8B sequences, one for each person on the planet at 30x coverage. This analysis use the ~350GB number which all the files mentioned above.
Q. Can you please help with the IO pattern question?
A. IO patterns are dependent on the applications used in the pipeline. Applications like GATK baserecal and SAMtools have a lot of random IO and can benefit from the use of SSDs. On the flipside, many of the applications are sequential in nature. Another thing to consider is the amount of IO in relation to the overall pipeline, as the existence of random IO does not inherently mean the existence of a bottleneck.
Q. You talked about Prefetch the data before compute which needs a compressed file signature of the actual data and referencing of it. Can you please share some details of what is used now to do this?
A. The current implementation of Prefetch via workload manager directives (WLM) is based on metadata queries done using standard SQL on distributed index files in the system. This way, any metadata recorded for a specific file can be used as a search criterion. We’re also working on being able to access and process the index in large concatenated file formats such as NetCDF and others which will extend the capabilities to find the right data at the right time.
Q. For Genome and the quantum of data do you see Quartz Glass a better replacement to tape?
A. Quartz Glass is an interesting concept but one of many new long term storage technologies being researched. Back in 2012 when this was originally announced by Hitachi, I thought it would most definitely replace many storage technologies, but it’s gone very quiet the last 5+ years so I’m wondering whether this particular technology survived.