NEW YORK (GenomeWeb) – The InPSYght project is about halfway towards its goal of sequencing 10,000 whole genomes in order to study schizophrenia and bipolar disorder, as well as to increase the amount of genomic data available from individuals of African ancestry. As part of the project, 500 genomes will also include linked read data generated using 10x Genomics' Chromium platform.
The main goals of including 10x Genomics data are to enable phasing of the genomes as well as to improve structural variant detection, Chris Whelan, a computational biologist at the Broad Institute, said in an interview. Whelan also discussed a 50-whole-genome pilot project last month during a presentation at the Advances in Genome Biology and Technology conference in Hollywood Beach, Florida.
The National Institute of Mental Health funded the InPSYght project in 2014 with a $16 million grant to researchers from the Broad Institute, the University of Southern California, and the University of Michigan. The analyzed samples are part of the Genomic Psychiatric Cohort at USC's Keck School of Medicine, which launched in 2008 and now includes a cohort of more than 40,000 individuals.
The individuals in the GPC also have detailed medical histories available, Carlos Pato, principal investigator of the project, who is now dean of the College of Medicine at SUNY Downstate, said in an interview. For the InPSYght project, one third of the sequenced samples will be from individuals with schizophrenia, one third from patients with bipolar disorder, and one third of samples will be controls, Pato said.
The project has two main goals: to investigate more deeply the genomics of psychiatric disorders and to generate genomic data that can be used for other population research studies. In addition, Pato said, the group plans to focus on sequencing genomes of understudied populations.
Sequencing will be performed on Illumina's HiSeq X Ten instruments and 500 samples will have linked read data from the Chromium.
The Broad has developed an automated pipeline to prepare samples for the Chromium, Whelan said. In the pilot study, the group evaluated 50 genomes that had previously undergone PCR-free sequencing on Illumina instruments.
The Chromium data covered a median of 23 megabases of sequence that is missed in a standard whole-genome sequencing dataset, Whelan said in his presentation. However, some coverage is also lost in areas of very high and very low GC content, so the net gain over standard WGS is around 16 megabases, he said.
The key advantages of including linked-read data, Whelan said, is to obtain phasing information, identify structural variants, and to have better discrimination in areas that are often difficult to tackle with short-read sequencing, like paralogous regions.
For instance, he said in the presentation, in the 50-genome pilot, the researchers identified 322 protein-coding genes in regions that were covered by linked reads but not standard sequencing, including clinically relevant genes. For example, the STRC gene, which is associated with hearing loss, was covered with linked reads, but not with Illumina sequencing alone, likely because it has a paralog gene."Barcoded linked reads act as anchors to recruit reads into paralogous loci," Whelan said.
The linked reads also allow the researchers to construct haplotype phase blocks. In the pilot, the median N50 phase block was around 3.5 megabases in size, while the longest phase block was 40 megabases, Whelan said.
Having phase information will be important for the larger project, he said. "If there's a structural variant that we're interested in, we'd like to be able to use the additional long-range phasing that we can get to from linked reads to figure out the SNP haplotype background that those events occur in," he said. That would then enable the researchers to "impute the more complex structural variants into the rest of our data that we didn't generate with the linked reads," he added.
Whelan's team is also looking to develop computational tools for the Chromium to better detect structural variants. Currently, they are working on tools that can run within the GATK framework. Doing that is just about making the existing methods for structural variant detection "barcode aware," he said. The linked-read data uses barcodes to place shorter Illumina reads in the context of long-range information. The GATK structural variant algorithm was designed to work with short-read data, but would have to be tweaked a bit to account for the barcodes that can place the short reads within a larger framework. Whelan explained that the goal is for the algorithms to "look for barcode overlaps and use that information to operate on haplotypes independently."
In addition, although only a small fraction of the total project is slated to have linked-read data as well as sequence data, the haplotype reconstruction that the linked-read data will enable will help create resources for broader diversity and ancestry studies, Whelan said.
One focus of the project is to analyze genomes from individuals of African ancestry. Currently, the vast majority of genomic data has been generated from Caucasian individuals, which can lead to incorrect variant classification and a poor understanding of variation in other populations. The African population "is tremendously understudied," Pato said. "We'll add to the overall field by focusing on African ancestry."
He said that the researchers have so far sequenced between 5,000 and 6,000 samples and expect to complete the full 10,000 genomes in the next 18 months. The data will be freely available for other research groups to analyze, Pato said. For the InPSYght project, he said, the team would likely first focus its analysis on genomic loci that have been implicated in genome-wide association studies as being relevant to schizophrenia or bipolar disorder. "We can look at a much finer, sequence-based analysis of those findings in order to define much better what's going on," he said. "We'll be able to detect more indels and other structural anomalies with the sequence data."
The sheer number of samples should also enable future research, he said. For instance, researchers studying cardiovascular disease or diabetes could use the sequence and medical data from the samples to study those diseases.