CHICAGO – The latest of Seven Bridges Genomics' efforts to diversify reference genomes is its largest and perhaps most complex to date, an attempt to address the Brazilian population.
The Charlestown, Massachusetts-based bioinformatics company recently joined with the University of São Paulo (USP), the Associação Genomas Brasil (Brazil Genome Association), and the Google Cloud Platform to launch DNA do Brasil (DNABr), a project to build a reference genome representing the ethnically diverse Brazilian population.
Brazil has a population of more than 210 million, with no majority race. People of mixed race make up about 43 percent of the total. The country has a history of African slavery, immigration from Europe and, to a lesser extent, Asia, a small population of Indigenous people, and much admixture.
"It's a huge country with very different demographic histories," said DNABr researcher Tábita Hünemeier. "I think the main idea [of DNABr] is trying to unravel this enormous variation that we have in Brazil."
Yet, reference genomes have tended to be very Eurocentric or Asiacentric. Plus, European reference genomes tend to be representative of only small parts of Europe, typically excluding Portugal — Brazil's former colonial master — as well as other Southern European countries like Italy that current Brazilians descended from. "There's a lot of underrepresented populations," said Hünemeier, a population and evolutionary geneticist who just started a two-year visiting professorship at the Institute of Evolutionary Biology in Barcelona, Spain.
Hünemeier said that the reference genome is an effort to understand the genetic impact of all this diversity in a large country.
"We have not just one Brazilian population. There are several Brazilian populations … that we have to study," Hünemeier said. "You have to know the demographic history to try to tackle this issue."
Seven Bridges has previously built reference genomes for populations not represented in Genome Reference Consortium Human Build 38 (GRCh38) and other common references. Not only does GRCh38 exclude people of color, it represents an individual, rather than a population, which is what the Brazil project is trying to produce.
The end goal is to build a more representative reference genome to improve clinical applications such as diagnostics and precision drug development, according to Serhat Tetikol, product director of computational biology at Seven Bridges. "The issue is to be able to get your raw data to an actionable set of mutations or variants or differences in the genome that can be used in any genomic-based application," Tetikol said.
"We really see this as leading an important step forward to applying precision medicine for individuals of all ancestries," added Seven Bridges CSO Brandi Davis-Dusenbery.
The USP-led DNABr project aims to sequence 15,000 whole genomes from blood samples taken from several longitudinal studies around the country. The data is being processed and stored on the Google Cloud, and Seven Bridges will apply its Graf genome analysis software to construct a reference genome that can grow as new genomes are added to the research cohort.
Graf is designed to view human reference genomes as graphs rather than linear haploid DNA sequences.
Tetikol, who leads Graf development for Seven Bridges, said that a graph analysis can be viewed as a set of branches originating from the genomic sequence and joining back together at a different point that indicates a biomarker. "This gets very complicated," he said, and it requires bioinformatics algorithms to make the raw data actionable.
Seven Bridges algorithms include the year-old Graf Population Solution, a collection of workflows, services, and graph references that facilitate large-scale population genomics studies. The product supplements the company's earlier existing Graf Germline Variant Detection workflow and Pan-Genome Reference offerings with a next-generation sequencing analysis pipeline and graph references for five major racial groups to provide more diverse and accurate read alignment and variant discovery than is available with standard Eurocentric reference genomes.
A graph strategy can produce "dramatic improvements" over linear analysis when looking at insertions, deletions, and structural variants, according to Davis-Dusenbery.
"Our collaborators realized that [the linear method] is indeed underperforming," Tetikol said. "It's clear, and they were looking for better alternative approaches to analyzing their sequencing data."
In a study uploaded to the BioRxiv preprint server last year, Seven Bridges produced a fourfold increase in detection of structural variants with Graf, even on a population level, with a pan-African dataset.
"Usually, structural variants discovery has to be done on an individual level, and when you try to apply it to a large population, you run into consistency issues with detection," Davis-Dusenbery said. "But when you have a representative graph in place, you can do a better job and you can extend that to a large number of samples."
The DNABr partners have sequenced and analyzed the first 3,000 genomes for their project with traditional linear analysis tools such as the Genome Analysis Toolkit (GATK).
Lygia da Veiga Pereira, a USP geneticist who is chief of the National Laboratory of Embryonic Stem Cells, known by its Portuguese acronym, LaNCE, said that DNABr only has funding to analyze the first 4,000 genomes, a step that should be complete within a month. Half of that money came from Brazil's Ministry of Health and the rest courtesy of a donation from clinical diagnostics giant Diagnosticos da America (DASA), which is headquartered near Sao Paolo.
Pereira said that the Ministry of Health recently approved funding for 6,000 additional genomes, though the money likely will not come in until the end of the year. "We won't be able to start further sequencing until the beginning of 2023," she said. Funding is still not available for the final 5,000 genomes of the planned 15,000.
In the meantime, DNABr will be applying Graf to the existing 3,000 sequences, plus the other 1,000 that are funded, to see how the Seven Bridges software performs against the linear analysis on this subset of the total cohort.
Tetikol said that Seven Bridges will have a first version of the reference done by midsummer, but validation will necessarily follow that step. "We need to do a measurement on how diverse [the graph reference is] and how divergent it is from the linear reference," he said.
Even the initial analysis has turned up some admixtures that are unique to Brazil. "It will be very interesting to see what's the phenotypic impact of these combinations," Pereira said. "The hypothesis we're testing is whether that strategy allows us to find more variants than the traditional strategies" when studying an admixed population.
Hünemeier expects that the graph method will identify at least 20 percent more variants than the linear analysis.
Eventually, the Ministry of Health will take over all the data the DNABr project is generating, according to Pereira.