
CHICAGO – After spending half a decade and close to $10 million to upgrade and reconfigure its data storage, the Jackson Laboratory is embarking on a new effort to develop a strategy for managing all of its data science activities.
To lead this initiative, Jackson Laboratory, or Jax, has hired Paul Flicek as its first-ever chief data science officer. Flicek will join Jax in July to create and execute an organization-wide data science strategy, as well as lead and manage relationships with Jax's data and analytics partners.
Bioinformatics is not new there, but it has been disjointed. Bar Harbor, Maine-based Jax has been around since 1929, but the 2014 opening of the Jackson Laboratory for Genomic Medicine on the campus of the University of Connecticut Health Center has accelerated the production of data.
Bar Harbor has historically focused on mouse models, including genetics. The Farmington, Connecticut, location is dedicated to human genetics.
"The ambitions here are bigger than what two sites at the Jackson Laboratory can do," Flicek said. "It's about making connections between the mammalian models and the data that exist for those and [for] human diseases and other translational aspects."
Flicek was most recently associate director of the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), an organization he maintains an affiliation with. He is known for leading development of the Ensembl genome browser, as well as for his involvement with projects including the Encyclopedia of DNA Elements (ENCODE), the 1,000 Genomes Project, and the International Human Epigenome Consortium.
"Scientifically, this is just a really good fit for me," Flicek said of the new job. "I've been working with large-scale data, and I've been doing research on the interconnections between species to help us learn about how biology functions for basically my entire career. This is really a nice way to bring everything together."
Flicek has worked with many species in his career, including mice, and he has been involved in genomics since the turn of the century. In recent years, he has seen high-throughput data become a central aspect of all of biology, including genomics, so having a large-scale data strategy is now imperative.
"I think it's the right time for Jax to make this investment," he said.
Jax press material about Flicek's hiring referred to both a "global data science initiative" and a "comprehensive data science strategy." Flicek said that one of his first tasks will be to work with other Jax leaders to develop an "overarching strategy that various initiatives fit into."
He said that there are some pilot projects underway that he would like to convert to full-scale initiatives when the job starts. Flicek said it was too soon to discuss specifics, though he offered some general goals.
"It's about leveraging the Jax mouse data, which is really unique in the world, and making that as connected and integrated as possible with relevant human data so it can facilitate translational research," Flicek said. "What we want data science at Jax to do is enable and accelerate biological discovery for researchers."
When he arrives at Jax at midyear, Flicek first wants to survey the bioinformatics landscape there and determine how the data science program he will be heading can better disseminate knowledge to the research community.
Flicek said that the job "gives [me] an opportunity to basically build things from the start … and have the potential to make large-scale impacts" on genomic research.
"The obvious ways to do that are to make the mouse data as accessible and as coherent as possible and integrate that with the human data to enable people to ask questions" about the data, Flicek said.
Flicek sees parallels between EMBL-EBI and Jax in that both have developed publicly accessible informatics tools. Jax is particularly noted for Mouse Genome Informatics (MGI).
Both institutions also have what Flicek labeled "'explore-the-space' research" in bioinformatics, experimental work that may or may not end up in public software and database tools. "This interchange between service-based informatics and research informatics is something that I think is also similar," he said.
However, Flicek will be responsible for a wider range of data at Jax than he was at EMBL-EBI, including imaging and related metadata. "New technologies will push new data types, as well," he said. "That's an area that I'm looking forward to learning."
Flicek gushed about the possibilities arising as the cost of whole-genome sequencing declines. "Highly accurate whole-genome sequences from nearly any species at low cost is incredibly exciting, and it allows simultaneously for the sequencing of hundreds of thousands or millions of humans, but also for other species," he said.
Flicek also said he is excited about opportunities to innovate in machine learning and artificial intelligence in bioinformatics. He named protein structure prediction software AlphaFold 2 from Google-affiliated DeepMind Technologies as an example of how AI is serving translational research today.
He expressed frustration about the challenges related to integrating the work of computational and experimental scientists. "There's communication that needs to take place as more computational scientists potentially come into biology from nonstandard or nonbiological background training," such as physics and mathematics, according to Flicek.
Data standards are important to any interoperability program, and they are often sorely lacking in bioinformatics. "Data standards are the key aspects to reproducibility in biology and insights generally," Flicek said.
But Flicek was optimistic that consensus can be found.
He has worked with Jax before because both the US lab and EMBL-EBI are members of the International Mouse Phenotyping Consortium. Jax also manages the widely used Human Phenotype Ontology because co-creator Peter Robinson is now a faculty member on the Farmington campus.
EMBL-EBI and Jax both subscribe to the FAIR principles that data should be findable, accessible, interoperable, and reusable. Both institutions also are involved with the Global Alliance for Genomics and Health (GA4GH), a coalition that offers a blueprint because it likes to embark on "driver projects" to test and validate standards.
Flicek said it was possible that Jax runs "driver-like" projects for new standards that support the organization's mission.
"Having data be effectively used is kind of a gift that keeps on giving," he said. "Highly used datasets can be incredibly valuable for the community in ways that are sometimes surprising to the people who originally generated them."