Chris Dagdigian, co-founder and senior director of BioTeam, discussed the need to rethink data storage at Bio-IT 2021. This article originally appeared in Genetic Engineering and Biotechnology News in December 2021.
A life sciences IT expert proclaims that the age of unlimited data storage is over, and that it’s time for scientists to take data management seriously
In the biopharmaceutical industry, data is seldom well managed. More often, it is simply hoarded. This phenomenon has attracted the notice of Chris Dagdigian, co-founder and senior director of BioTeam, an IT consultancy for life sciences organizations. According to Dagdigian, life sciences organizations that accumulate data in petabyte-sized heaps may put themselves at “existential risk.” Indeed, these organizations are getting to the point where they could be, well, buried alive.
“Organizations that used to throw money at expanding storage rather than understanding or curating it have started to hit limits,” Dagdigian observes. At these organizations, scientists must stop acting as though storage is unlimited. Instead, Dagdigian says, they need to “view storage as a consumable resource, just like laboratory reagents.”
The need to rethink storage was discussed by Dagdigian at Bio-IT 2021, and he elaborates on his theme in this article, which presents insights shared directly with GEN. According to Dagdigian, a new social contract between research and IT requires that scientists themselves “keep a tidy home” by managing their data more efficiently. “IT can’t make data storage decisions for scientists,” he stresses.
Curate what you store
Some data isn’t worth storing to begin with. Dagdigian advises adopting edge computing—analyzing data near its creation point and saving only what’s good. “When you conduct a brief quality control (QC) test on data coming off an instrument, do a local analysis,” he suggests. “Only if the data passes the QC check should you move it into the core for analysis and full lifecycle treatment.”
A perhaps harder step for scientists—particularly those who work in data-intensive fields or who actively avoid managing their data—is to become responsible for archiving data and making tiering decisions. Such decisions aren’t appropriate for IT to make, and they can’t be made based on the age of a file or when it was last accessed.
Scientists may counter that extra storage is cheap—less than $25 for a two-terabyte portable hard drive. But Dagdigian tells GEN, “The cost of a single storage drive is minor relative to the true, fully loaded cost of keeping your scientific data stored safely, reliably, redundantly, and securely over its full lifecycle, which may be years when publications are involved, or forever when the data is related to an FDA filing.”
Overall lifecycle costs include:
- Multiple devices to ensure redundancy.
- Technical, consumable, and operational elements for disaster recovery.
- Secondary facilities and redundant infrastructure (such a networking links) and local backup resources such as tapes.
- Hardware and software updates, repairs, and refreshes.
According to Dagdigian, scientists are surprised that the true cost of a terabyte of data stored over time can be many thousands of dollars.
Consider future growth
Small and mid-sized companies are turning to cloud storage providers such as Egnyte (which claims to deliver a “united foundation” for all enterprise, scientific or clinical data needs) and Box (which claims to provide “seamless file sharing”). “They are useful in many scenarios,” Dagdigian says. “There are solutions for handling standard business data, for ingesting data from outside parties, and maybe for managing scientific data of less than a few terabytes.”
The problem, he points out, is poor scalability and lack of compatibility with Linux. “Egnyte and Box can’t be used in most real-world data lakes or analytic environments at the petabyte or even ‘lots of terabytes’ scale,” Dagdigian elaborates. “If I was building a data-intensive biopharma from scratch, I’d draw a hard line in the sand … and say that we would have to do something else for large-scale scientific data.”
That “something else” could be the data commons (data harmonized around common pipelines) or data lakes (repositories) that large organizations use. Either of these approaches break data out of traditional silos for enterprise-wide accessibility. Data still must be curated, however.
As biopharma firms develop their data management strategies, they also should realize that the dominant use case is no longer a human browsing files and folders. “We have billions of files and a high volume of data that no human will ever look at manually,” Dagdigian stresses. “The dominant consumers of data at this scale will be machines, pipelines, and software workflows.”