CHICAGO – In developing the data and research infrastructure for the National Institute of Health's All of Us study, Vanderbilt University Medical Center (VUMC) has leaned heavily on the in-house BioVU bank of longitudinal medical records and DNA samples.
Recently, a multidisciplinary team at Vanderbilt, with assistance from researchers at Marshfield Clinic in Wisconsin and the University of Oklahoma Health Science Center, has been developing a formula for calculating phenotype risk scores. Joshua Denny, associate professor of biomedical informatics and medicine at VUMC and the lead for the All of Us data center, was corresponding author for a paper on this subject that appeared in Science in March 2018.
He said that another paper describing the methods of developing this risk score is in the revision stage. Denny's bioinformatics lab previously developed the phenome-wide association study (PheWAS) technique for linking deidentified genomic and electronic health records data with billing codes.
Denny heads the data and research center for All of Us, the Precision Medicine Initiative's program to collect, store, and disseminate health, genetic, lifestyle, and environmental data on at least 1 million US residents. Vanderbilt is the lead organization on the data side of the program, with considerable assistance from Alphabet life sciences subsidiary Verily and the Broad Institute.
Getting to this point has been a long journey fraught with trial and error to overcome shortcomings in technology infrastructure and off-the-shelf software, but the philosophy has not changed much since BioVU collected its first sample on Feb. 22, 2007, a date that Dan Roden, Vanderbilt's senior vice president for personalized medicine and a clinical pharmacologist and bioinformatician by trade, recalled instantly.
Plus, there is a key difference between BioVU and All of Us. BioVU's agreement with patients stipulates that Vanderbilt will not return samples or ever try to reidentify individuals, similar to the UK Biobank.
All of Us, however, "started with the premise that we are going to return information," Denny said. All of Us does scrub identities like BioVU does, but the program does allow for tissue donors to be recontacted for research purposes or to return samples.
"I think fundamentally it comes down to design of the program and what expectations are," Denny said. "Programs like BioVU are designed fundamentally from day one that it's a deidentified, nonrecontactable resource and there's no linkage backward," Denny said.
BioVU uses Vanderbilt's Epic Systems EHR for phenotyping purposes. "I tell people who want to use BioVU that if it's in the electronic health record, we can get at it. And if it's not, we can't," Roden said.
Roden said that BioVU was designed so there would be no "back door" to reidentifying patients. "One of the things that we include in our data use agreement is that there will be no attempt to reidentify people from the from the phenotypic data that they get," he said.
BioVU initially was opt-out, but the National Institutes of Health changed its rules in 2015 to insist that people be made to opt into any NIH-funded genomic research programs, and Vanderbilt complied with that directive.
"All we changed was the mechanism that people used to sign up or to not sign up," Roden said. None of the back-end technology changed, nor did the commitment to safeguard patient anonymity.
For its genotype-phenotype matching operations, VUMC actually maintains two separate images of its Epic Systems electronic health record.
One, called the Research Derivative, is an "operational, identified data store," according to Denny to support precision medicine and use cases including clinical trial recruitment. In such cases, researchers might need to know if specific people or kinds of patients have appointments on a given day, so this database is updated as frequently as every day.
"That is an external database outside of Epic that is maintained in essentially a nightly update," Denny said. This is distinct from BioVU.
BioVU runs on the other version, called the Synthetic Derivative. This database is walled off from reidentification technology and is updated no more than three or four times a year.
"For clinical trial recruitment, you need to be pretty up to date, but for genetic research, people don't necessarily want the data changing all the time," Denny said.
Vanderbilt will only update the Synthetic Derivative when new datasets becomes available. "But you don't really need to [update for] new billing codes or medication entries that came in yesterday. We don't find that you need that kind of frequency of updating for a genetic analysis," Denny said.
Roden said that "versioning" is an issue with population studies. Users simply don't want data changing every time they login. It becomes impractical to deal with updated cohort counts or medication lists on a daily or weekly basis.
BioVU initially updated this database as often as monthly. "We didn't realize number one, the logistics of that, and, number two, the way in which things like versioning does really get in the way," Roden said.
The computational process of deidentifying notes can take as much as a full day for full-text deidentification, according to Denny.
"The other problem is that you have got to get all of your dataset kind of squared off and synched at a certain time point," Denny said. "You want to have your notes and your billing codes and your meds squared off to the same date range and as well as you can, so that's something else you can't do on a continuous basis."
When BioVU started, Vanderbilt informaticians wanted to update the synthetic derivative frequently, but the institution lacked the hardware infrastructure to do so, Denny said.
"Pre-cloud, EHRs in 2007 were not really architected to do cross-panel queries at all, not to mention searching all the notes," he said. Denny would write queries that could take as long as a week to run.
Eventually, Vanderbilt got the infrastructure to handle updates as frequently as weekly. "At some point, we probably were updating it at a monthly timeframe," Denny said, "but we worried more about the issues of versioning."
When building BioVU and going through the approval process, the philosophy among developers was that researchers generally did not need to know the identity of the patients. They just needed to get access to blood samples and tissue left over from other tests to extract DNA and then have the bioinformatics team link the genetic material to the EHR, Denny said.
"We wanted to also provide things like a level of protection against fears of what could happen if someone came and wanted to subpoena someone's DNA," he explained.
Not everyone who wants to be part of BioVU even makes it into the database. Some are randomly excluded, and others don't have enough excess biological material available for them.
"That was one of the ways in which we engender trust and one of the agreements we made virtually with our patient population and talking through our advisory boards," Denny said.
Roden said that Vanderbilt has learned since BioVU's inception that the patient population is "not a monolith," and each person has different preferences.
"There are patients who really want to donate their DNA but really don't want to ever be recontacted. And then there are people who are willing to participate who would like to get something back but they realize that as part of this program they won't," he said. "I do think that it's important to recognize that people's attitudes toward this kind of work vary tremendously."
On the technical side, EHRs were originally designed to handle billing and evolved into record stores, but all but the most specialized systems have never been optimized for research.
"There's lots of ways you can query the data in an operational EHR, but it's not always amenable to all the stuff you want to do," Denny said.
Notably, Data standards for EHRs and other components of the synthetic derivative were severely lacking when BioVU started. Vanderbilt has since adopted a common data model, mapping laboratory and medication entries through common vocabularies.
Electronic prescribing was not common then, but it has become the norm since the federal government started offering care providers incentives for e-prescribing in 2008 and for EHRs three years later. Vanderbilt was an early adopter, so it had to build natural-language processing tools to extract medication data from medical records for BioVU and other research uses.
This mapping is particularly important for rare diseases.
"You may still want to be able to capture that case that happened a decade ago and the medication they were on, so it's useful to take these different data sources and put them together," Denny said.
"We started with a data format that looked a lot like our EHR, but put into a relational database," he added. Then they realized that they could start standardizing items on an external taxonomy.
This became necessary when Vanderbilt abandoned its hybrid of a home-grown and McKesson EHR in favor of an Epic installation in 2017.
"We could either try to make our old data look like Epic or we can try to make Epic look like our old data. Both of those are problematic," Denny said. "Or we can move them both to a third standard and try to do things that are standardized more for the community."
Vanderbilt thus chose the Observational Medical Outcomes Partnership (OMOP) Common Data Model for both BioVU and All of Us.
"Different participants are going to have different views and different ways in which they want to contribute," Denny said. "I think there are advantages to a multiplicity of models and then it's incumbent upon us to figure out how to make sure that we can actually work together to do research because really we need to take advantage of a large population."
As of mid-September, the BioVU community had 281 active projects with 167 different principal investigators, and the projects tend to be widely encompassing. Denny said, for example, that most of eMERGE counts as a single project for this purpose.
BioVU projects involved about 870 users at Vanderbilt University Medical Center alone, and Denny said he was aware of at least 250 published articles that used BioVU data. The latter number could be much higher, as Vanderbilt lacks an efficient way to track all publications that relied on BioVU, according to Denny.
"You've got to skin your knee a few times in the process of learning to walk and run," Denny said. "I feel like BioVU and the synthetic derivative has been a great playground for us to learn a lot of the lessons that we can build into things like All of Us."