Researchers from five institutions participating in the Electronic Medical Records and Genomics Network, or eMERGE, recently published a paper in BMC Medical Genomics that provides an update of the consortium's progress since it launched in 2007.
The network was launched to conduct genome-wide association studies in participants with phenotypes and environmental information derived from electronic medical records.
The first phase of the project kicked off in 2007 with funding from the National Human Genome Research Institute and the National Institute of General Medical Sciences and involved five participating sites — the University of Washington, Marshfield Clinic, Mayo Clinic, Northwestern University, and Vanderbilt University — and a coordinating center at Vanderbilt.
Now at the tail end of the first phase, the researchers report that so far they have conducted GWAS genotyping for cataract and high-density lipoprotein cholesterol, dementia, electrocardiographic QRS duration, peripheral arterial disease, and type 2 diabetes.
The authors also said that they are performing a sixth GWAS for resistant hypertension with 2,000 additional samples selected from all five sites.
Last summer, NHGRI said that it planned to provide more than $25 million for a second phase of the project — $22 million over four years to eight study investigators and $3.5 million to a single coordinating center. Applications were due last November and the anticipated start date of the awards is July 1, 2011 (BI 07/23/2010).
This week, BioInform spoke with Catherine McCarty, a senior research scientist at Marshfield's Center for Human Genetics and one of the co-authors on the paper, about the challenges of sharing data culled from multiple systems and some informatics lessons learned along the way. What follows is an edited version of the conversation.
Can you give me a broad picture of the current phase of the project?
We are still in the first phase. [The project] was initially funded with the intention of funding for four years. But the network has been so successful that they decided to reissue and ... I assume all sites are competing to be renewed and then to bring some additional sites in as well.
The RFA for the renewal called for sites that might have some more racial ethnic diversity or perhaps a pediatric population.
The paper talks in some detail about each site's efforts to create appropriate consent forms and establish data-sharing practices. Let's talk a little bit about what the process was like as well as any challenges that came up.
I'll start with two different issues that ran in parallel. We have a couple of different levels of data sharing. One is that we all deposit defined datasets into dbGAP and we know that other sites outside the network have applied to access those datasets.
The other is sharing within the network. We wanted to develop early on a broad mass data sharing agreement to easily share data amongst the sites and the central administrative coordinating center at Vanderbilt. That actually took a while to happen because any time a lawyer wants to make a change, [he or she] has to go back to all the other sites who have already approved it to see if they approve it. Hopefully we will have a really simple data-sharing agreement ... to share with the wider research community.
The other is an informed consent document. One of the working groups within [the] consent community consultation group came up with a model consent language and it's available on the web. [We] worked through it to try to make very simple language and pretty broad that would apply to groups wanting to set up biobanks.
It's not just the data that we share. We share algorithms [that] use a combination of [things like] diagnostic codes, procedures, and lab values to identify a phenotype. We share them internally initially to see that these algorithms transfer across medical records. Once we find that they do and they validate high enough — we look for a predictive value of 95 percent ideally — then we make them more widely available to the research community.
What are some of the challenges of making sure that these algorithms work across the different sites and the different EMRs?
Some sites will be capturing information in a coded format where others aren't. We are really finding that for almost every phenotype you need some sort of natural language processing tool to pull things out of the notes. You need more or less depending on your electronic medical record and whether or not things are collected in a coded format.
The other thing that we are trying to do through the network is [to] identify areas that we would want to push — either with the vendor if it is a purchased EMR or [for example] ourselves and Vanderbilt who have internally developed EMRs —the development staff to try and capture some things in a more coded format. As an example, I do ophthalmic research and it's notorious for not capturing things in a coded format because [we] tend to draw pictures. They may be captured electronically but not in an easy way to get back out. So we are working with our information systems here to identify areas that could be captured more easily in a coded format so that it's easy to get back out again both for clinical and research purposes.
You write in the paper that the biggest lessons learned to date have come from the informatics arena. Can you elaborate on what that means?
If you just run a query on a database where you've got an electronic medical record attached to a biobank and look for the people enrolled in the biobank to see how many of them carry a particular diagnostic code, you can really cut the number in half, ultimately, of how many people truly have that diagnosis and there are lots of reasons for that. These codes were initially set up for billing purposes, not for research, obviously, and really not even for clinical care, because with clinical information what they really need [is] most likely in the notes.
So somebody might be assessed for a certain condition. Well, that turns up as a diagnostic code when they ultimately may not have [that condition] and further tests will show that they don't. In most systems, the physicians go back and [try to] pull that diagnosis out but it doesn't usually come out, that we find. As I said, you almost always need some sort of natural language processing.
The paper also suggests that there are some challenges with developing algorithms to define phenotype from EMRs. Why is that the case?
Two things actually. Ultimately [an] EMR is really good if you can get in and get all the information that you need and that includes going in and looking at the notes.
The other challenge has nothing to with the electronic [aspect]. It has more to do with the health system itself. [For example, in] most of the primary care visits just because there is no indication that somebody has a given disease doesn’t mean that they don’t have it. You have to look to see that they would have had a screening test recently enough to rule out the possibility that they don’t have whatever it is that you are looking for. So you get challenges both to identify people with and without the disease.
Let's talk about the sixth GWAS study that’s mentioned in the paper, which uses data from 2,000 samples from across the network.
That one's being genotyped now. We hope to have some data next month. We did two cross-network phenotypes. One that we were able to do with the existing genotype data [because we had] enough cases and controls with what had already been genotyped and then one that was going to require additional samples to be genotyped to have sufficient power.
Hypothyroidism is the one that is just using available genotype data and then adding the phenotypes in, and resistant hypertension [is the other]. So [in cases] where people are on three or more meds and not responding, we need additional cases to be genotyped across the network.
For quality assurance purposes, all the controls have all been genotyped, some of the cases as well, and some of them are re-genotyped just to make sure there wasn’t some bias introduced by having all of the new cases done at one time and the other ones before.
Is the combined study a test run for your data sharing practices and phenotype algorithms?
It is. That’s exciting, because there are lots of groups that have biobanks and they may not consider them biobanks. Every healthcare institution that has a laboratory facility that gets tumor samples, [for example, has to keep them] for a certain amount of time. They may not necessarily have the controls but they certainly have the cases and the potential for research is enormous. It's potentially such a cost effective way to do this research.
Any challenges there?
The EMR is great but it can take some time. We are finding it really depends on the specialty as well and how far back an EMR goes. As an example, our EMR is internally developed. It's existed since the late 60s, [but] we only went fully and completely electronic [a few] years ago. Prior to that, if you want to look at, [for example,] resistant hypertension in an older person ... you are going to have to go back and look at a print record prior to a certain date. So all of this research is going to be easier for our kids because it would have been electronic all along.
Finally, in the paper you write that eMERGE is the first step in developing data-driven approaches to incorporate genomic information into routine clinical care. What are some of the other steps?
Step number one has to be that we all end with having electronic health data and that they are linked somehow. I alluded to this before when I mentioned that it can be really hard to identify controls. If someone is doing research at a cancer institute, it's really hard to identify the controls because the only information that they have is related to their cancer care and it's hard to access information from other institutions. We need ultimately to make this work, and not just on the research side but [also] for patient care. This information somehow has got to be linked for a physician to be able to use and access.
We are developing these informatics tools now on the research side but then ultimately we are going to need a lot of informatics tools on the patient care delivery side for the physicians. [For example, if] we identify an algorithm that includes age, height, weight, gender, what a person eats, the level of radon in their home, and 20 genetic markers, that’s a huge amount of information to try and process. [We] are going to need the informatics tools to use all that information and provide it back to the physician and patient in some sort of a useful, digestible format.
All of our data has to be electronic before that and I don’t know what the latest statistics are but less than half of us have our health data stored electronically. It's really striking.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.