NEW YORK (GenomeWeb) – Since it became fully operational last year, the Regeneron Genetics Center, a wholly-owned subsidiary of Regeneron Pharmaceuticals, has sequenced more than 30,000 exomes and is embarking on large-scale targeted sequencing studies, all in the pursuit of new drug targets and pharmacogenomic markers for existing products and candidates.
Early in 2014, Tarrytown, NY-based Regeneron announced the launch of a human genetics initiative to help guide the discovery and development of new therapeutics and had built its own genome center, the RGC.
In partnership with a number of collaborators, in particular the Geisinger Health System of Pennsylvania, Regeneron planned to sequence de-identified samples from patient volunteers and look for associations between their genotypes and phenotypes, including diseases.
Besides the Geisinger collaboration, which will involve more than 100,000 patients with a variety of phenotypes and diseases, the RGC has formed partnerships for more focused family-based studies. With Columbia University Medical Center, Regeneron plans to study familial diseases, such as inherited cardiometabolic diseases, cancer predisposition, and rare genetic diseases.
Meantime, a partnership with the Clinic for Special Children centers around familial forms of pediatric disorders in Amish and Mennonite populations; and a collaboration with Baylor College of Medicine focuses on the function of disease genes discovered by the Baylor Center for Mendelian Genomics. In addition, Regeneron has partnered with the National Human Genome Research Institute's Undiagnosed Diseases Program and the SickKids Foundation of the Hospital for Sick Children in Toronto.
The dual approach of population-based sampling, exemplified by the Geisinger cohort, and more focused family-based studies in specific clinical areas is "highly complementary and suits us very well," Aris Baras, the RGC's executive director, told GenomeWeb.
In addition, the company has been sequencing a limited number of samples from its ongoing clinical trial programs, with the goal of identifying pharmacogenomic markers.
A year after the initial announcement, the RGC has generated exome sequence data for more than 30,000 samples – most of them from Geisinger – and is cranking out more than 1,200 exomes per week, with plans to go up to 1,400 exomes per week later this month. In 2014 alone, the center sequenced more than 20,000 exomes.
Going forward, the center also plans a number of targeted sequencing projects, where it will sequence a couple of hundred genes in tens of thousands of individuals within a short period of time. "Generally, we view targeted sequencing as an excellent tool to really get at a priori hypotheses around certain gene sets involved in various disease processes," Baras said. "[It] allows us to interrogate some specific questions at scale very quickly."
The RGC is in the process of putting together several projects to test whether a particular gene or pathway is involved in a disease. These will involve sequencing 20,000, and possibly up to 50,000, samples within a period of three months or so. "If we waited for the exome sequence data to pile up from that, it might take us a year, or a couple of years, and importantly, it might move away our workflows from our many other projects," Baras said.
GenomeWeb recently visited the RGC and spoke with its management team (also see Q&A here). The center currently occupies two floors of renovated space in one of Regeneron's original buildings on its sprawling campus just outside of Tarrytown, about 25 miles north of New York City. Sequencing production started last July, and sample prep automation was completed in September. This fall, the center is scheduled to move into a new building across the street that is currently under construction and will offer additional space.
The RGC's current staff of more than 40 is about equally split between sequencing operations and bioinformatics and is expected to grow to 55 to 60 by the end of this year. John Overton, who came to Regeneron from the Yale Center for Genome Analysis, directs the sequencing and lab operations, while Jeffrey Reid, who joined the company from Baylor's Human Genome Sequencing Center, is in charge of genome informatics.
To store DNA samples prior to sequencing, the center has set up a biobank from Liconic Instruments, the first of its kind in the US, which has a storage capacity of 1.4 million frozen samples and currently houses more than 40,000. DNA samples arrive in racks of 96 tubes, each with a 2D barcode on the bottom that is scanned automatically to log the samples into the center's laboratory information management system. After that, a robotic arm transfers the rack, which has its own barcode, inside the freezer, where samples can be retrieved automatically as needed.
The RGC is currently equipped with 10 Illumina HiSeq 2500 sequencers, which run continuously and provide capacity for more than 60,000 exomes per year, and one Pacific Biosciences PacBio RS II to analyze difficult-to-sequence regions of the human genome.
Most of what the center has done so far is exome sequencing, though it has also "dabbled in some other things," such as RNA-seq, whole-genome sequencing, and targeted sequencing, Overton said, and is preparing for the upcoming targeted sequencing projects.
With the help of Regeneron's automation core team, in collaboration with Hamilton Robotics, the RGC has fully automated its library preparation and exome capture process on several custom-built Hamilton robots.
The system has the capacity to process several hundred thousand samples per year, which ensures that sample preparation is never a bottleneck and the sequencers can run non-stop, Overton explained. Like many other high-throughput sequencing centers, Regeneron has its own Illumina engineer onsite to keep the sequencers running.
Apart from saving labor, automating the sample prep also results in data that is more uniform than data from hand-prepped samples or samples prepared in smaller batches, which Reid said has made the analysis easier.
Sheared DNA is loaded onto one of the robots, which performs all library construction steps according to in-house developed protocols and using Regeneron's own reagent kits. For the exome capture that follows, the team uses the NimbleGen SeqCap EZ HGSC VCRome kit, originally designed by Baylor scientists for clinical research.
The goal is to sequence 85 percent of the bases with 20X coverage, but the lab generally achieves this for at least 90 percent of the bases, Overton said.
For the planned targeted sequencing projects, which will involve panels on the order of a couple of hundred genes specific for a certain disease area, the center will use an undisclosed target enrichment method that it has been developing with a third party.
For data storage and analysis, the RGC has opted for a completely cloud-based system, making it somewhat unique among other large-scale genome centers. "I like to call it the first genome center in the cloud," Reid said. Because the center was built from scratch, there was no legacy high-performance computing hardware in place. Setting up a new data center seemed cost-prohibitive, Reid said, so cloud-based informatics was the way to go.
On the infrastructure side, the RGC uses Amazon Web Services. DNAnexus is the center's platform partner, offering additional security over and above what AWS supplies and providing informatics tools to build out workflows. DNAnexus' platform also facilitates the automation of analysis steps, and allows the RGC to share data with its collaborators in a secure manner by creating data areas, Reid said.
Raw sequencing data from the Illumina machines first gets transferred to a "very modest" storage buffer at the center and then automatically pushed into the cloud via a direct connection between Regeneron and AWS East.
The upload triggers an automated series of pipeline steps for the primary and secondary data analysis using standard bioinformatic tools, resulting in annotated variant files for each sample.
Files from the data analysis integrate with the RGC's in-house LIMS, a customized Exemplar LIMS from Sapio Sciences that tracks the journey of each sample, including many quality control steps.
To find associations between genotypes and phenotypes, the Regeneron team turns to clinical data from the patients. In the case of the Geisinger collaboration, it has access to fully de-identified electronic health records of participants in the form of a database that is updated on a monthly basis. Regeneron researchers have also been collaborating closely with Geisinger's clinical informatics team on the analysis.
The data are structured and in a searchable format, so the researchers can look, for example, for patients who had aspirin prescribed, or patients who have had a body mass index over 30 within the last five years. "You can ask those kinds of questions, look at how many patients have those parameters, look at statistics around the health and the vitals and the demographics of those people, and then bring those kinds of data searches together with the genome data," Reid said.
"It's a very rich database and has phenotypes relevant to health and disease because it's actual electronic health records, not so much a research phenotype database that most other groups are working with," said Alan Shuldiner, vice president of translational genetics at the RGC. The database also contains longitudinal data, as well as information about a patient's medication history that is important for various phenotypes. "Many of them have multiple phenotypes, things like hypertension and diabetes and cardiovascular disease. One can begin to delve into the biological basis for how these diseases relate to one another, which I think is a pretty unique opportunity," he said.
To make associations between genotypes and phenotypes, the scientists for the most part apply "pretty standard statistical analyses," Baras said. One type of analysis is the 'phenotype-first' approach, where they define a phenotype that may have therapeutic potential in a certain disease area, for example extremes of blood lipids for cardiovascular disease. Another example is the study of obese patients to identify individuals who appear to be protected from the typical range of comorbidities. "You can imagine asking hundreds of these types of questions when you have millions of patients," Baras said. "You identify a cohort, and then you do standard statistical analysis to figure out whether they have certain genes enriched that protect them or cause disease."
The other approach is to start with a genotype and a hypothesis, for example that gene X is involved in causing, or protecting from, disease Y. "We can now leverage this dataset and identify individuals who have important mutations in that gene, for example loss-of-function mutations, and we can ask ourselves, 'What are the phenotypic consequences?'" Baras said.
So far, the center has not reported any results from its collaborations, but preliminary findings are currently being confirmed and replicated and "there are some efforts now" to write these up for publication and presentations. "Most of our collaborations are structured to enable publication, the dissemination of results and findings," Baras said. He declined to comment on the ownership of intellectual property resulting from the partnerships, citing the confidentiality of the agreements.
Regeneron will not report results back to patients — its sequencing facility is not a clinical laboratory — but at least some of its collaborators, including Geisinger, are planning to validate and return certain results.