NEW YORK (GenomeWeb) – Geisinger Health System recently announced that it is launching a new biomedical and translational informatics program within its research division and named Marylyn Ritchie, a professor in the Pennsylvania State University's department of biochemistry and molecular biology, as director of the new program.
"This is a critically important recruitment for Geisinger, especially given the tremendous new data resource generated through our large-scale DNA sequencing project with Regeneron Pharmaceuticals aimed at sequencing at least 100,000 Geisinger patients in the next five years,” David Ledbetter, executive vice president and chief scientific officer of Geisinger Health System, said in a statement. "[She] is a pioneer in the development of new methods to leverage large-scale genetics data with electronic health data for the discovery of new genetic associations that should rapidly advance personalized medicine and improve health outcomes."
Ritchie is a statistical and computational geneticist with extensive experience in all aspects of genetic epidemiology and translational bioinformatics as it relates to human genomics. She also has expertise in conducting genome- and phenome-wide association studies, using next-generation sequencing techniques, integrating multiple omics datasets, and developing data visualization approaches.
Ritchie is the principal investigator of the Pharmacogenomics Research Network Statistical Analysis Resource, and leads genomics data coordination efforts for the Electronic Health Records and Genomics Network. She is also director for Penn State's Center for Systems Genomics. In addition to working with Geisinger, she'll retain her research laboratory at Penn State.
This week, GenomeWeb sat down with Ritchie, who began in her new role on Jan. 1, to talk about balancing her roles at both institutions as well as first steps for Geisinger's new program. What follows is an edited version of the conversation.
According to Geisinger's announcement, you'll be maintaining your research laboratory at Penn State. How will you manage the two positions?
My primary appointment will continue to be at Penn State as a professor of biochemistry and molecular biology. Over the next year Penn State has agreed to allow me to focus my administrative, teaching, and service time to develop this program at Geisinger, while still actively running my ongoing research program and laboratory at Penn State. This will entail a bit of a commute, but it's manageable.
Does that mean that your appointment at Geisinger is only for a year?
We are still working out the details of how my time will be spent starting after the initial year. But the plan is that I will continue my leadership of the collaborative program with Geisinger. The real goal is to build a group with Geisinger and Penn State but we need to figure out what that is going to look like as we progress through this first year.
What's the need for a distinct informatics program at Geisinger?
They have an enormous electronic health system with [about] 3 million patient records and they are adding a lot of genomic data to this research dataset that can be used. Currently, in a collaborative project, they are sequencing about 1,000 whole exomes a week. And so we have a dataset of about 20,000 samples or so with full whole-exome sequence and electronic health records, which is a sandbox for a statistical geneticist to play in. There are so many research questions you can ask in datasets like that. We really need to establish a group that can mine that data and make genomic discoveries that can improve healthcare and better our understanding of medicine.
Are there specific disease areas that research efforts would focus on?
To start, there are some specific phenotypes that we are working on. [For example], things related to metabolic traits and obesity. Part of that is opportunistic because those are phenotypes [for which] we just have a lot of patients in the EHR that have been sequenced, but the [data] collection and sequencing is being done broadly on anyone who enrolls in the MyCode [Community Health Initiative] project. Because we don't have a specific focus on who is being sequenced, one of main projects that we'll do to start out is a phenome-wide association study, or PheWAS. We'll take the exome sequence data and look very broadly across the EHR at ICD9 codes, clinical lab variables, [and use] some electronic phenotyping-defined algorithms to find phenotypes, and use those to see where … interesting signals light up in the EHR in terms of phenotypes that look interesting and then we'll use that ... to tell us where to start doing more focused follow-up studies.
What kinds of data are available in the records that Geisinger has?
On the phenotypic side, it is full electronic health records. That means all of their visits to the doctor, billing codes, procedural codes, prescriptions that were made, [and] any labs that were run, both imaging labs as well as blood and other types of tests. Because Geisinger is an integrated health system, we also have information for most of the participants who are filling their prescriptions. And so, in terms of pharmacogenomics phenotypes, you can get the second level of confidence in the data because you can see not only that patients were prescribed the medication but also that they are actually filling those prescriptions and filling them multiple times. Patient adherence to medications is a huge challenge to pharmacogenetics and so that's something that we are really excited about, that we'll be able to address that.
On the genetics side, there are some GWAS datasets on about 3,000 to 4,000 subjects. There are exome chips run on about 18,000 samples, and whole exome sequencing on about 20,000, which is increasing at a rate of 1,000 samples per week. All of those samples will also get a GWAS chip run early this year — that's something that we are working on currently. There is also a very specific targeted study... within the MyCode project ... doing some whole genome sequencing as well. I think close to 100 samples have been done.
What do you hope to have done by the end of 2015?
I hope that we will have done some recruiting. We really need to get more clinical informatics expertise and translational bioinformatics expertise. I'll be working with the folks at Geisinger and Penn State to try and identify some people that would fit in that space. In terms of the science, we really want to focus on some of the PheWAS, so doing some data mining to tell us what phenotypes would be of interest for some of our more focused hypothesis testing studies. Because of the sample size and the fact that we are generating so many samples per week, I'm really excited that we are going to be able to some more complex modeling of some of the phenotypes of interest. I'm really interested in looking for gene-gene interactions, gene-environment interactions, and pathway and network models. That's something that is usually limited by sample size but I think that in this dataset, we are going to have the samples that we need to do some of those more sophisticated modeling approaches. I would love to see us do that for a number of phenotypes ... and really look at the common variant effects, the rare variant effects, and whether there is a burden of rare variants for certain genes, also look for the epistasis signals, the gene environment signals, and the pathway signals, and really start to characterize the genetic architecture of some of the phenotypes. I'm really excited to do some of these more sophisticated data mining, machine learning, and modeling strategies.
Earlier, you talked about Geisinger and Penn State collaborating on putting this program together. Are you in touch with any other potential partner institutions?
There are a number of places that we are talking to. None are too far down yet to mention by name but ... a lot of places would like to partner with us to collaborate and work with this data. [We] are trying to figure out the best strategies to do that. Geisinger has made a huge investment in creating this resource and so we want to see it used and make discoveries that have an impact on genomic medicine.
That actually touches on one of my questions. What sort of funding is available for the program?
There is some internal Geisinger funding. There is some funding through their collaboration with Regeneron. And then a couple of federal initiatives that they already have funded and a couple that investigators there have been applying for. But we are currently applying for additional funding. Penn State is also enthusiastic about recruiting people to work on this project.
You'll obviously be working with a lot of data so what sort of infrastructure do you have in place for handling it? Do you foresee having to beef up your hardware?
Right now, a lot of the infrastructure for the association analyses ... [is] in the cloud. And that is actually giving us a huge ramp up space because ... the opportunity to increase that above and beyond where we are will be a lot more readily available than building it all in house. We do have some in-house high-performance computing and data storage and that's certainly been what has kept things going through the GWAS data and for all the EHR data. But once we get into these PheWAS studies with the whole exome data, we really need cloud resources for both the computing and the storage. That's where we are emphasizing our infrastructure efforts right now.
What about software?
So far, most of what we've done has been leveraging open source and some of these packages are things that my lab has actually developed for doing these types of analyses. We are starting a lot of conversations with some commercial companies to see [whether] there some things that they can provide that are not available in the open source community that we would need [or whether] that would make things go much faster and more efficiently if we have the commercial package. But we haven't committed to any of those yet.
A lot of the analyses [on these datasets] that you'd want to do are kind of one-off so it's very hard when you don't have access to the source to make the changes that you need to kind of tweak the analysis to exactly what you want. It's very hard with some commercial packages to get it to do exactly what you want uniquely for each question that you are asking.
What sort of analyses needs might be best served by commercial packages?
Finding computationally efficient strategies for doing some of the interaction modeling, the gene-gene and gene-environment interactions, and the pathway and network effects, those are going to be really important. We can do main effects analyses very routinely and robustly with the tools that we have and those are pretty fast even for whole-genome data. Once you get into the combinatorics of looking at sets of genes and sets of variants, you really need highly parallel, efficient algorithms to search that space effectively. We have tools that will do it that are open source but if there are commercial tools that will do it more efficiently and allow us to search more of the data that would be advantageous.
Anything else you'd like to add as we wrap up?
I really want to emphasize that I am viewing this as a great opportunity for both Geisinger and Penn State. I'm trying to build a bridge between the two entities so that we can do more amazing science. Geisinger has this huge clinical resource and there's so much data available and Penn State has this huge data science team and lots of people in genomics. It's a perfect combination.