This story has been updated with additional detail about the Uganda Genome Resource.
CHICAGO – The proposed Three Million African Genomes (3MAG) project has not yet reached the stage of a formal plan, but bioinformatics and data-control experts are urging organizers to tread carefully when it comes to protecting a potentially massive and highly valuable store of genomic and phenotypic data.
"There's a very fundamental privacy risk. No one can make it go away [because] our genomes define us," said Nicki Tiffin, a computational biologist at the University of Cape Town in South Africa who specializes in ethical issues surrounding research and re-use of genetic data from African populations.
"I think that this has to be planned very well because this is 3 million individuals," added Segun Fatumo, a computational geneticist at the Medical Research Council/Uganda Virus Research Institute (MRC/UVRI), and the Entebbe, Uganda, campus of the London School of Hygiene & Tropical Medicine. "There has to be some discussion … about the ethics and privacy."
Tiffin was the lead consultant on a white paper the Wellcome Trust produced earlier this year for a group called the African Population Cohorts Consortium. That document spelled out a vision for developing and harnessing infrastructure for population research across Africa.
"I'm absolutely in support of aspirational targets for genomic science on the continent," Tiffin said. "But we also need to be practical about the best way to do it."
Indeed, 3MAG is purely aspirational at this point. Ambroise Wonkam, a medical geneticist at University of Cape Town who is leading the effort to create 3MAG, confirmed that the idea is still an unrealized vision and "not yet a living project."
In an email, Wonkam said that he and his colleagues have had discussions with stakeholders including the African Society of Human Genetics, the Human Heredity and Health in Africa (H3Africa) initiative, the African Academy of Sciences, Genomics England, the International Common Disease Alliance, the International HundredK+ Cohorts Consortium, and some potential industry partners. They also have had preliminary talks with potential funders like the US National Institutes of Health and the UK-based Wellcome Trust, both of which support H3Africa.
Wonkam earlier this year estimated that the price tag for 3MAG would run about $450 million per year for a decade, including the cost of developing biorepositories and data infrastructure. Organizers are aiming to sequence about 300,000 people a year over that time.
Rina Shainski, cofounder and chair of Duality Technologies, a Newark, New Jersey-based maker of data-encryption technology that enables secure collaboration with sensitive data, said that for such an undertaking, she can envision some information like certain variants and medical conditions being stored under encryption.
An institution interested in building a cohort from the database might conduct some "data window shopping," according to Shainski, by running some basic analytics to look for correlations that might help their research.
"The problem today with that approach is that when you do data window shopping, you end up getting the data," Shainski said. "There is no such way as partially looking or partially touching the data. You either get it or not, and once you get it, the person who contributed it basically doesn't have any control on what's being done."
She said that it would be imperative with a pan-African dataset to keep personal information data under lock and key rather than ceding control of the data.
In trying to replicate a study to support a hypothesis, researchers typically have to spend months signing data-sharing and collaboration agreements without even knowing if the dataset contains the associations they seek.
"That makes us very cautious about which hypotheses we follow up on, because it's just a big-time investment to figure out legally how to share the data to follow up this hypothesis," said Alexander Gusev, a quantitative geneticist at Boston's Dana-Farber Cancer Institute and Harvard Medical School.
It would be nice to be able to send over a list of things they are interested in to another institution to query its own database before entering into any data-access agreements, according to Gusev. The technology is not there yet, but it is moving in that direction, he said.
Duality relies on homomorphic encryption, a method of encoding data as ciphertext to allow computation without decryption. In other words, computation proceeds while data remains encrypted.
Duality and Dana-Farber demonstrated the efficacy of this approach last year in a paper published in the Proceedings of the National Academy of Sciences.
Since 3MAG plans have not been released, it is unclear whether this type of technology might be part of the program. Still, its organizers can learn from previous large-scale genomics efforts.
Gusev noted that population studies involving Native American groups have been "problematic" because individuals did not consent to having their personal data used by biomedical researchers. Eventually, tribes got "very protective" of data, making it difficult to study these populations.
"I think we really need to get it right the first time when doing similar kinds of analysis in countries in Africa or other places where there hasn't been as strong an established collaboration between the scientific community and the local people who are actually providing the data," Gusev said. "I think this is also a case where we can't really play it by ear because there's a high risk that these populations that we're already underserving are going to opt out of the whole process if we screw it up."
In Africa, Fatumo is involved in two projects that could inform a large-scale, pan-African genomics research program.
Fatumo leads a team at MRC/UVRI that has built the Uganda Genome Resource, featuring genetic and phenotypic data generated from a rural community of about 22,000 people. A study describing the genetics of 6,400 of those people was published in Cell in 2019.
Working with Nigerian genomics research, services, and development company 54Gene, Fatumo and colleagues have sequenced the whole genomes of thousands of Ugandans, then performed genome-wide association studies to measure cardiometabolic traits.
When the Ugandan team published its research, it made 2,000 whole-genome sequences publicly available on the European Genome-phenome Archive. While a data-access committee controls who gets to see the full dataset, the Uganda Genome Resource includes a set of GWAS summary statistics that is available through a GWAS catalog.
"What that means is that nobody needs to ask me anything. They go there, they download [the summary dataset], and they use it for anything they want to use it for," Fatumo said.
Fatumo now is involved in a much larger project in his native Nigeria, serving as one of the leads for the Non-Communicable Diseases Genetic Heritage Study (NCD-GHS), a public-private research consortium started in early 2020 to gather genomic, clinical, and other data from 100,000 citizens that will initially be used to study the genetic basis of various noncommunicable diseases. 54Gene is funding that effort.
Fatumo noted that Nigeria — the most populous country in Africa — has about 500 languages and more than 300 ethnic groups, representing the kind of diversity that a pan-African genomics project would encounter.
With knowledge of the Uganda project as well as other genomics efforts in smaller African populations, Fatumo said that the NCD-GHS consortium has learned the importance of community engagement and transparency when it comes to explaining data usage.
Once funding and a strategic plan are in place, Fatumo recommends that 3MAG organizers look across the continent to see where others have already obtained genetic samples. Africa has more than 3,000 ethnic groups, who collectively speak at least 2,100 languages, and some are better represented than others.
"If you want to do 3 million people, you should strategically get samples from every ethnic group in Africa," Fatumo said. That works out to an average of 1,000 from each ethnicity, though it might make sense to collect more specimens from larger groups.
Since many populations and communities are isolated, 3MAG organizers would do well to develop transparent policies for community engagement, according to Fatumo.
For the Uganda Genome Resource, MRC/UVRI and 54Gene created engagement committees that included community leaders like clergy and academicians. "You engage the [leaders] so those people give you access to the community every time," Fatumo said.
When the Uganda Genome Resource project returned blood test results for cholesterol, type 2 diabetes, hypertension, and infectious diseases to participants, individuals did not receive full genetic reports, but rather explanations of conditions they might have or be predisposed to developing. The program also helped connect people with the proper care.
"That is kind of the support that they want," Fatumo said. "When they feel supported, they feel part of what you are doing."
Tiffin, who will be taking a professorship at the South African National Bioinformatics Institute at the University of the Western Cape in October, agreed that relationship-building is key to trust.
"To mitigate the privacy risks … there has to be a proper community engagement component to the processes and a relationship with the participants and the communities from which they come," she said.
"We have to be very cautious of how we report for geographically distinct and genetically distinct populations in Africa to make sure that we're not putting labels on population groups that can be damaging," Tiffin said.
She stated a personal belief that 3MAG likely will be "embedded within existing studies where there are relationships already with communities and with populations" to ensure that people understand what participation entails.
"With informed consent, we often forget the informed part," Tiffin said.
"I think it comes down to about being absolutely scrupulous in our community engagement programs and ensuring that we do the best possible science and ensuring that the benefits actually come back to the people on the continent who participated in these studies," Tiffin added.
She said that all too often, researchers from outside Africa come in and "set up a completely independent data economy or ecosystem and then harvest all the health data from their encounters with participants," but never link the data back to national and regional health services.
"They don't build the electronic health record infrastructure for the health services, and then at the end of the study, all those data die with the study," Tiffin said.
A more equitable and ethical way of approaching research would be to integrate more closely with health services rather than just piggybacking on care delivery systems. This, Tiffin said, would increase access to care and potentially improve outcomes, as well.
Tiffin also suggested that genotyping is necessary but not adequate by itself. "We are, of course, all interested in studies of ancestry and genetic origins, etc., but those don't actually help sick people in Africa necessarily," she said.
"What we really need to do with our 3 million genomes are 3 million detailed longitudinal phenotypes. That will make the research beneficial to people on the ground," Tiffin said.
"Obviously, we want to use the Three Million [African] Genomes initiative to be able to better understand health, well-being, and disease, and understand the genomic and genetic contributors to that. But we can't do that if we don't have the other side of the coin, which is the phenotype data," she said.
Wonkam said that the next step will be to hold an in-person meeting — if the COVID-19 situation allows it — to draft a white paper outlining a formal plan for 3MAG. "There is a lot of enthusiasm around 3MAG, but we believe that the planning is key and we should take the necessary time, probably the next 12 months, to finalize consultations and set a trusted governance process for the project," Wonkam wrote.