NEW YORK – The Broad Institute's Data Sciences Platform (DSP) keeps improving the storage, analysis, and management of ever-growing genomic datasets, enabling researchers around the world to do their work.
The platform is among the institute's two largest operations, along with the sequencing center, known as the Genomics Platform. With some 250 employees, the DSP is structured much like a technology company and is even headquartered in the global tech hub of Kendall Square in Cambridge, Massachusetts.
"It looks pretty much like a commercial tech company," said DSP head Clare Bernard. "We have all the same roles as you would see in a commercial tech company," with a lot of software developers and computational biologists. The only difference is that the Broad does not have a large sales and marketing operation.
In a recent interview in her office, Bernard explained that the Broad is made up of a collection of programs and platforms. Programs are organized around areas of research like academic departments, while platforms are built around technological capabilities.
Bernard called herself a "typical" DSP employee in that she has a scientific background, though not necessarily in computational biology or genomics. She has a Ph.D. in particle physics from Boston University and completed her dissertation on the Atlas experiment at the Switzerland-based European Organization for Nuclear Research, better known by its French acronym, CERN.
The main similarity between CERN and the Broad is that both generate massive amounts of scientific data for the benefit of the global research community.
After moving back to the US, Bernard worked for a Boston-area software startup that supported business-to-business sales with machine learning for data integration. "I definitely fell in love with software and fell in love with tech, but I definitely missed science," she said.
Five years ago, she joined the Broad as a project manager under Chief Data Officer Anthony Philippakis. She worked her way up the DSP ladder until she landed the top job in March 2021, around the time the Broad opened the Eric and Wendy Schmidt Center, a hub for interdisciplinary research at the nexus of biology and machine learning. Philippakis, who also serves as codirector of the Schmidt Center, remains her boss.
The largest of the DSP's efforts is the Terra data platform, a cloud-based environment for large-scale analysis of omics data.
Formerly known as FireCloud, the Broad, with assistance from Google and two University of California institutions and financial support from the US National Cancer Institute, created the platform to improve data management. The DSP was formed soon after, in 2015.
"The [quantity] of the data we were generating and had under management was doubling every eight months," Bernard said. "It's really hard to double the size of your data center every eight months," so the Broad made the decision to partner with Google and move its data hosting to a cloud environment.
"The institute, we were either going to have to stop sequencing or move to the cloud," Bernard said. "It was a very binary moment in time."
Terra hosts data from several large-scale projects funded by the US National Institutes of Health (NIH), including the All of Us research program; Analysis, Visualization, and Informatics Lab-Space (AnVIL); and Count Me In, a nonprofit cancer research organization that the Broad cofounded.
The Broad, of course, is known for creating or co-creating widely used genomic analysis tools and datasets, including the Genome Analysis Toolkit (GATK) and the Genome Aggregation Database (gnomAD).
With Terra, the Broad embraced the popular philosophy of allowing users to bring analytics tools to a centralized data repository rather than forcing users to download massive datasets.
Bernard said that data sharing is "critical" for the future of precision medicine. "The field of genomics has always been really good about sharing data, but we haven't been good about operationalizing that sharing," she said.
"Putting a dataset on a server and then having everybody download it to their own individual infrastructure isn't good for a variety of reasons," including security and cost. "It means genomics becomes a sport of kings where only people at very well-funded institutions can do the research."
"For Terra … we don't consider ourselves the owners of any data. We consider ourselves the custodians of the data," Bernard said. Even data created in other Broad departments does not "belong" to the DSP, and the Broad produces a whole lot of sequencing data.
Growth of functional genomics
Bernard explained that the DSP has four areas of software and capabilities, the largest being Terra. Another is a "direct-to-participant" platform for running patient-centric research studies.
A third group, called the methods team, which is largely made up of Ph.D.s in computational biology, machine learning, physics, and mathematics, is responsible for maintenance and ongoing development of GATK. That team also works on extraction of data from clinical records as well as from single-cell functional genomics experiments.
The DSP also has a data engineering group, responsible for data pipelines. "Our bread and butter is genomic data processing," Bernard said, but the group is increasingly handling clinical data, as well.
"Direct-to-participant studies are becoming more and more popular, particularly in rare disease and cancer," Bernard said. "One of the things that I really like about the Broad is, it's a place that takes on challenges that are not only intellectually difficult but also operationally difficult and where for-profit business models can be a little challenging."
Terra hosts several clinical decision pipelines, though Bernard said that the Broad has not yet partnered with any commercial clinical decision support companies or vendors of laboratory information systems or electronic health records to deliver information to the point of care. "We are thinking about it," she added, noting that the Broad does return results to participants of the All of Us program.
While GATK is mature, updates continue. Current development is more focused on structural and copy number variants than earlier work, which emphasized SNPs and indels, according to Bernard. Machine learning is enabling work on these newer, more difficult computational challenges.
The DSP is having to adapt as the field of genomics evolves. "In the past, a lot of what we did was genomic data. Now it's a lot more functional genomics and a lot more clinical data," she said.
Bernard said she believes everyone should have their genomes sequenced, though she acknowledged that the question of who pays for it is unresolved. "I had two kids in the last five years," she said. "I think it would have been fantastic if I had gotten carrier screening before I had my kids and had my kids sequenced as newborns."
Generative artificial intelligence has become a buzzword, largely thanks to the hype around ChatGPT, but Bernard said that "the intersection of AI and [machine learning] and biology is a really important area," one that the Broad has been thinking about for some time.
The Eric and Wendy Schmidt Center and two DSP partnerships with Microsoft have brought the Broad into those arenas.
Bernard said that AI takes much of the manual labor out of structuring clinical records to prepare data for research, and the Broad has been experimenting with GPT-4, a multimodal chatbot developed by OpenAI. "These large language models are very, very good at structuring data," she said.
"Used correctly, there's huge potential to leverage these algorithms to structure data, find data, and do really interesting research."
From PCR to long-read sequencing
Danielle Perrin, senior director of administration and operations for the Broad's Genomics Platform, said that the institute has processed more than 37 million COVID-19 tests to date. At one point in 2021, the sequencing center handled about 10 percent of all PCR tests run in the US.
The Genomics Platform, a few blocks from Bernard's Kendall Square office, will cease COVID-19 PCR testing by June 30 and has already decommissioned dozens of PCR instruments, Perrin said during a tour of that building.
While the sequencing center is filled with Illumina hardware, it also has rooms dedicated to long-read Pacific Biosciences and Oxford Nanopore sequencers. The Broad is also serving as a test site for the new Ultima Genomics UG100 for single-cell RNA sequencing. Perron said that the institute is experimenting with blended genome-exome analysis on its long-read sequencers.
While the DSP is not the only place Broad sequencing data is analyzed, Bernard said that most of the long-read data her group is currently processing is for All of Us, simply because that is where the demand is coming from.
Though the sequencing center had been analyzing a lot of SARS-CoV-2 samples, the pandemic did not cause a data explosion because the genome of SARS-CoV-2 is so much smaller than a human genome.
The Broad had been working with the Bill and Melinda Gates Foundation to study epidemic preparedness with Terra, but the rapid onset of the pandemic in early 2020 accelerated that work.
"I think the part that was more [novel] to us was the different kinds of users of the platform," such as officials from public health labs, Bernard said.