The New Problem of Genetic Data

In Next Generation Gene Sequencing, Don’t Forget the Data…and the Answers

In the next wave of gene sequencing techniques, the focus is mostly on the inputs.  Like this new nanopore approach by a computational physicist from the University of Illinois Urbana-Champaign.  By pulsing an electric field on and off around a strand of DNA, they can induce the DNA to expand and relax as it fits through the nanopore…just the behavior needed to read each protein.  So much innovation on the front end.  What about the outputs?

In a recent press release, one industry guru wants us to spend more time thinking about what to do with the data than how to generate it:

“[The] difficult challenge is accurately estimating what researchers are going to do with the data downstream. Collaborative research efforts, clever data mash-ups and near-constant slicing and dicing of NGS datasets are driving capacity and capability requirements in ways that are difficult to predict,” said Chris Dagdigian, principal consultant at BioTeam, an independent consulting firm that specialises in high performance IT for research. “Users today need to consider a much broader spectrum of requirements when investing in storage solutions.”

Unfortunately, one of today’s myths is that storage solutions are prepared to do the ‘near-constant slicing and dicing’ Mr. Dagdigian mentions.  Too often, high performance computers (née supercomputers) are used to sticking a big storage system on the end and dumping data.  The problem is that without industry leading tools to get data out of the storage system, the real challenge doesn’t end in the sequencing…it’s just beginning.

Is this a new problem?  Some think so.  For example, George Magklaras, senior engineer at the University of Oslo says “The distribution and post-processing of large data-sets is also an important issue. Initial raw data and resulting post-processing files need to be accessed (and perhaps replicated), analyzed and annotated by various scientific communities at regional, national and international levels. This is purely a technological problem for which clear answers do not exist, despite the fact that large-scale cyber infrastructures exist in other scientific fields, such as particle physics.  However, genome sequence data have slightly different requirements from particle physics data and thus the process of distributing and making sense of large data-sets for Genome Assembly and annotation requires different technological approaches at the data- network and middleware/software layers.”

New problems need new solutions.