Brandon Kruse has built a business on storing genomes. He shares his opinions on big data that scientists should start to think about.
- Why we shouldn’t be calling it ‘big data’
- The lifespan of a genome in storage
- How much data storage costs labs
Brandon Kruse is a serial entrepreneur in the IT industry, but he’s also the founder of a company (agtc.io) that stores sequencing data in the life science community. I recently spoke to Brandon to uncover the challenges and solutions to large quantities of genomic data.
Firstly, how did you make the switch from telecoms to genomics?
“First and foremost, I’m into IT and data, however, genomics has always been interesting to me.” explains Brandon. , “It is a happy coincidence that I share my space with HudsonAlpha Institute for Biotechnology, a genomic research institute that’s also a HiSeq XTen customer from Illumina. HudsonAlpha made the decision to invest in the XTens because of the demand for fast genome sequencing across a variety of fields and applications. In learning about genomic data challenges, it was interesting to see how the industry as a whole is falling short in terms of storage’.”
“There’s a lot of uncertainty in the industry”
Brandon goes on to illustrate the challenges in the industry: “A petabyte is a 1,000 terabytes, to give you an idea; HudsonAlpha produced a data throughput of 200 terabytes/year last year prior to obtaining the HiSeq XTen. Putting the XTens online meant lots more data, ergo lots more storage. With new sequencers down the road and the cost of sequencing dropping so fast, it has accordingly drastically increased the cost of meeting storage requirements. Analyzing the data is HUGE business.”
And this business can be put into context: “imagine you’re a company and 2016 has come around and you get a standard 2% rise in the data storage budget. But that’s just not enough because data output is going to be 40-60% higher. The recent advances in whole genome sequencing have been extremely disruptive.”
So, on this subject, what is your opinion on generating all this ‘big data’?
“That’s a great question and let me answer it like this: there’s so many organizations referring to next gen sequencing as ‘big data’. The processing steps for making this data useful isn’t actually big data, they’re doing ‘large data’ or ‘lotsa data’ at these steps.
By this I mean they’re storing a lot of data. No analysis “en masse” occurs until much later in a whole genome sequencing (WGS) process. The aggregate data in the future is going to be hugely valuable. Even Google and Amazon are jumping on board .”
So genomes could potentially be seen as a commodity in the future. But will laboratoires be going bigger (eg. HiSeq XTen) or going smaller (eg. MiniSeq) in the future?
With current efforts of pursuing denser and more accurate coverage across a full genome it’s only reasonable to assume that standard coverage depth will increase, so that 40x will become the minimum average depth, and that number will likely increase. That means you’re running the genome 30 or 40 times. HudsonAlpha’s clinical sequencing pipeline is already running genomes at 40x, and it’s expected to become industry standard. Furthermore, with expected changes in Illumina chemistry, he’s expecting data storage demands to increase to over 50%.
“It costs $5m a year to run the X10, but if you’re hosting on Amazon etc. then your costs are exponentially going to grow as you increase the total volume of data stored on the cloud platform.”
Okay, let’s focus on this then. What are the true data costs?
Brandon takes me through a practical example of how much a lab could be paying for genome storage and he explains it as highly variable. It’s dependent on a myriad of factors, including: how often do you want to access it? And which file type do you want to store?
Working through an example of FASTQ (being the most popular)
One sample of WGS at 30x = 70GB.
If you want to access the data for this single sample once per month = $9.45 monthly cost
However, if you want to access the data once per year, the bill drops dramatically to $0.79 a month.
“In addition to the cost of storing the data, cloud storage providers push the cost of accessing the service to the user (which takes up most of the bandwidth). A lot of researchers don’t take this into consideration; the majority of people simply aren’t ready to handle the data storage issues.“
The fear of losing data – How valuable is it?
I ask Brandon if people are afraid of losing their data in the virtual world and he admits that they are, but they shouldn’t be.
“(We’re) losing 1 FASTQ file every 1.2million years. It costs a lot less to keep samples than it is to keep virtual data. Freeze something for 5 years and then you can bring it out again. Sample degradation, losing the FASTQ file, these are distinct considerations. For us there’s a big term called ‘storage durability’ – at agtc.io it stores it 3 locations, what this means it that it is effectively losing 1 FASTQ file every 1.2million years – agtc.io stores sequencing data at three locations, meaning that a user can expect the integrity of a FASTQ file to remain for 1.2 million years.”
And with the cost of sequencing decreasing exponentially, you can expect labs to be sequencing more and more in the future with their budgets appearing to be increasing accordingly.
Data sets are larger and more frequent in their generation. AGTC.io provides an end‐to‐end platform for highly available, durable, and secure storage. Deleting and re‐sequencing a) doesn’t actually happen b) isn’t a real answer to the problem. AGTC.io allows you to keep your data without paying the enormous overhead associated with current enterprise and cloud storage solutions.