CHICAGO (GenomeWeb) — In 2015, entomologist and geneticist Gene Robinson and his University of Illinois at Champaign-Urbana colleagues authored a paper arguing that the term "genomical" should replace "astronomical" as an indicator of size.
The paper, which appeared in PLOS Biology, suggested that once exabase-scale genomics arrives in the early 2020s, the amount of data produced by sequencing will surpass that generated by astronomers — the inspiration for "astronomical" in the first place. By 2025, Robinson's team wrote, somewhere between 100 million and 2 billion human genomes could be sequenced, requiring 2 exabytes to 40 exabytes of storage space.
By contrast, if 2015 trends hold, people by 2025 could be uploading between 1,000 hours and 1,700 hours of video per minute to YouTube, necessitating 1 exabyte to 2 exabytes of storage space, and sending 1.2 billion Twitter tweets per day, requiring about 1.5 petabytes of storage. The Australian Square Kilometer Array Pathfinder, the world's largest astronomy project, is projected to acquire about 25 zettabytes of images per year by 2025 and will need a single exabyte of storage space annually.
"That was meant to invoke the concept that genomics truly is big data, and that's what the paper was about, showing that genomics truly ranks in the upper tier of big data sets," Robinson, who presented at the annual Intelligent Systems for Molecular Biology (ISMB) conference of the International Society for Computational Biology last month, said in a new interview.
The term "genomical" was intended for multiple audiences, according to Robinson, director of the Carl R. Woese Institute for Genomic Biology at UIUC. "It was a shorthand to get people to be thinking, 'Wow, this is where genomics is going … we want to be ready for it," he explained.
"Those that are closer to biology may have been tracking this, but the computing community is very large, broad, and very diverse," Robinson said. "There are segments that work in spaces that are far away from biological problems, so we wanted to make sure that the computing community, as well as other scientific communities, are aware of the coming nature of genomics."
UIUC hosts a center of excellence for the National Institutes of Health's Big Data to Knowledge (BD2K) program, and researchers there are certainly aware of Robinson's work. In fact, center Codirector Saurabh Sinha was one of the authors of the PLOS Biology paper.
"[The center] embraces the genomical concept, the concept that these are really large-scale datasets and that there need to be new computational tools, not only to be able to handle datasets that are larger than we've ever had before, but also to derive knowledge from them," Robinson said.
However, three years after publication of the paper, "genomical" has not actually supplanted "astronomical" in the scientific parlance, though that was not the primary goal. "I think the paper has served an important purpose in highlighting the size, the magnitude of the datasets, the computational challenges that are going to be posed as genomics moves into the next phases, including the Earth BioGenome Project," Robinson said.
Indeed, Robinson is one of the founding members of the Earth BioGenome Project, which also came about in 2015.
Robinson joined Harris Lewin of the University of California, Davis, and John Kress from the Smithsonian National Museum of Natural History to develop this global effort to sequence all known eukaryotic species on Earth. The project intends to sequence the genomes of 9,000 eukaryotic families, 150,000 to 200,000 genera, and 1.5 million to 2 million species over a 10-year period, at an estimated cost of $4.7 billion, to generate a reference genome that's at least as good as the human reference.
That ambitious effort really is just getting off the ground, though the Earth BioGenome Project has developed a high profile in some circles. Notably, Lewin presented at the World Economic Forum in Davos, Switzerland, in January.
"There is increasing interest from federal agencies and foundations throughout the world," Robinson said. "We are building an international consortium and we are getting strong expressions of interest from different parts of the world." Robinson said.
So far, the participants have mostly been laying the groundwork and building infrastructure for the massive amount of sequencing planned, but some of the actual science is underway. "A lot of the work is already going on because there already are multiple taxon-based consortia: plant-based genomes, vertebrates, insects," Robinson noted.
Because the BD2K center at Illinois is in its final year of operations now, at least according to the initial contract, Sinha soon will have more time to support Robinson on the Earth BioGenome Project.
"I was completely swamped by the BD2K center's responsibilities until now, so I would imagine that I will resume my discussions with Gene this semester, in the fall," he said.
UIUC's portion of BD2K is called the KnowEng Center, for which it created a computational knowledge engine to apply analytics and machine learning to secondary genomic analysis such as classification and regression. Sinha called it "knowledge-guided analysis of genomics data in the cloud."
The university built algorithms and analytics tools to assess information from multiple sources, including public databases and user-provided genomics data. "We have aggregated all of those databases into a massive knowledge network," Sinha said. "It's a heterogeneous network that represents all of that information, and then that becomes part of the input to our algorithms."
Because KnowEng is built in an Amazon Web Services environment, its "knowledge-guided analysis" is available to anyone with an AWS account. That includes the Earth BioGenome Project.
"The same functionality will be needed for the EBP as well," Sinha said. "You will have sporadic collections of prior knowledge relevant to various organisms and various species that you would like to tap into when you are looking at any one less-studied species or any one smaller collection of less-studies species."
While the EBP will need different analytics algorithms than the ones developed for BD2K, "this need for knowledge network-driven analysis of user-selected subsets of data would be the same," according to Sinha.
"We would like to exploit that software-level technology, not so much the algorithm-level technology, to help with the EBP," he said.
Robinson said that the EBP is not intended to replace or duplicate any other efforts in the genomics world.
"It's meant to coordinate and to lay out the overarching vision for the entire project, to provide a roadmap for how to reach that, to be able to deal with the need for a common standard, and, going back to the initial point of contact, to announce to the computing world that this is coming down the pike," Robinson explained.
"Part of the EBP mission is to describe the overarching vision so that the various allied groups, including the computing community — especially the computing community — are able to plan for this and then take their rightful place as partners in EBP," Robinson said. "Many of the members of the working group are some of the leaders in computational genomics, so they are well aware of the challenges" of working with such genomical datasets.
Sinha said to expect some news on the technology side of EBP in the early part of 2019.