NEW YORK (GenomeWeb) – By tapping pooled, paired-end sequencing data for hundreds of individuals sequenced through the 1000 Genomes Project, a Case Western Reserve University-led team has identified hundreds of relatively common insertion or deletion variants not included in the human reference genome assembly.
As they reported in BMC Genomics earlier this month, the researchers tested the notion that they might find missing reference genome sequences using samples assessed for the first stage of the 1000 Genomes project. Using pooled reads generated with an approach known as one end anchor (OEA) sequencing, they tracked down hundreds of so-called "missing common sequences," or micSeqs, that are present in 1 percent of the population or more.
"We pooled the samples, which was the first novel [aspect] of our method," the study's first author Yu Liu, a research associate in proteomics and bioinformatics at Case Western, told In Sequence. "Secondly, we did a lot of work to filter out false positives."
The search was based on the premise that such sequences might represent a form of indel variants that have been unappreciated in the past, given their absence from the human reference genome, Liu explained.
Rather than considering copy number variations or SNPs, which have been relatively well characterized in the past, he and his colleagues focused on first finding such micSeqs so that their potentially functional effects may ultimately be unraveled.
"Basically what we tried to identify was sequence variants, which can cause phenotypic differences and can also affect risk of disease," Liu said.
With follow-up experiments that included both targeted PCR-based testing and comparative genomic analyses, the researchers verified more than three-quarters of the 309 micSeqs found in the study. And the search for new micSeqs is expected to continue, both in larger groups of individuals from the same populations considered already and in other human populations.
As ever more individuals are sequenced from different parts of the world, Liu and his co-authors noted, it has become increasingly clear that the human reference genome assembly remains incomplete, potentially missing as many as a few million bases of sequence at a stretch in a given individual.
And because high-throughput human genome sequences are routinely mapped against the reference genome, sequences omitted from the reference are often overlooked.
"As the reference genome is used in probe design for microarray technology and mapping short reads in next-generation sequencing," they explained, "this missing sequence could be a source of bias in functional genomic studies and variant analysis."
In an effort to fill in these absent pieces of the genome assembly, the team turned to OEA, an approach that hinges on pairs of paired-end sequencing reads comprised of one mapped or "anchored" read and one unmapped or "orphan" read.
The OEA read method has been used to find new genome sequences in the past, Liu explained, including those stemming from viral insertions.
When they set out to find relatively common missed sequences, though, he and his team decided to extend the method by applying it to a pooled complete collection of reads generated for hundreds of individuals sequenced to shallow coverage.
While evidence for micSeqs from any one individual was not especially strong, Liu said, the OEA reads from the pooled sequence data provided a much more complete look at the sorts of non-reference sequences that are present in a group of individuals.
To that end, the team brought together orphan reads in pooled 1000 Genomes Project data for 363 individuals of European, Asian, or African ancestry who had been assessed by low-coverage whole-genome sequencing for the first phase of that project.
When they did de novo assembly on the resulting set of OEA reads, the researchers narrowed in on 309 sequences spanning 100 base pairs or more that seemed to be missing from the human reference genome but found in 1 percent or more of the 1000 Genomes Project phase I participants. Of those, 70 percent were present in 5 percent or more of the individuals.
More than three-quarters of the potential micSeqs — which ranged in size from 100 bases to almost 1,500 bases — were confirmed through comparisons with primate genome sequences or sequences from a handful of additional genomes from European, Chinese, or African individuals.
For example, the Venter genome contained 80 of the candidate micSeqs, whereas other genomes on hand contained between 22 and 32 micSeqs.
Almost half of the micSeqs — around 45 percent — had homologues in the chimpanzee, gorilla, orangutan, gibbon, rhesus, and/or marmoset genomes, pointing to conservation for many of the sequences missed in the human reference.
The team also did direct PCR-based testing on samples from 38 more individuals, searching for 15 of the apparent micSeqs from their earlier analysis.
Consistent with the notion that these variants were authentic and relatively common, the researchers found that each of these 15 candidate micSeqs could be detected in one or more of the 38 individuals tested.
Liu noted that the group is interested in applying the pooled OEA read analysis approach to larger sequence sets — including collections of paired-end genome sequencing data on individuals from various human populations — as a means of understanding the evolution of micSeqs and their potential contributions to human traits and disease risk.
The researchers did not see any obvious overlap when they searched for disease-associated regions falling in and around micSeq sites, though there are hints that at least some of the newly identified micSeqs may have functional effects.
Through comparisons with RNA sequencing and chromatin immunoprecipitation sequencing profiles described for human cells in the past, for example, Liu and company identified a trio of micSeqs with suspected transcription factor binding sites.
Another 11 micSeqs appeared to show higher-than-usual expression in brain tissue, prompting interest in more extensive expression profiling on micSeqs in the brain and other tissues in the future.
Along with their interest in possible micSeq functions, the researchers hope to get a better sense of how these sequences evolved in the human lineage, the extent to which they are shared with related primates, and the micSeq variation that exists within and between various human populations.
Liu noted that the team is interested in scaling up its method to identify micSeqs present at lower than 1 percent frequency. That may involve looking at read data for all 2,500 or so individuals assessed for the main stage of the 1000 Genomes Project, for instance, or tapping sequence data generated by other consortiums such as the UK's 10,000 genomes project as it becomes available.
When they consulted the latest version of the human reference genome, released late last year, the study's authors were pleased to discover that a few dozen of the micSeqs they found have now been added to the GRCH38 genome release.
"These results … suggest that the remaining micSeqs are good candidates for inclusion in future releases of the reference human genome," they wrote.