COLD SPRING HARBOR, NY – Half a year after scientists at Washington University published the first human cancer genome, several research groups have embarked on projects that use second-generation sequencing technologies to comprehensively characterize single nucleotide variants, copy number changes, chromosomal rearrangements, and gene expression in various tumor types.
Last week at the Biology of Genomes conference at Cold Spring Harbor Laboratory, in a session devoted to cancer genomics, groups from Washington University School of Medicine, the Wellcome Trust Sanger Institute, the Broad Institute, Baylor College of Medicine, the Ontario Institute for Cancer Research, and the Johns Hopkins Kimmel Cancer Center presented initial results from next-gen cancer sequencing projects and mentioned some of the challenges they have encountered in interpreting the data.
Wash U: Additional AML Samples
Last November, scientists at Washington University School of Medicine published the cancer genome of an acute myelogenous leukemia patient — the first human cancer genome sequenced at high coverage (see In Sequence 11/11/2008). At the time, they used unpaired Illumina Genome Analyzer reads to sequence the cancer sample and compared it with a normal control from the same patient. They discovered a small number of somatic point mutations and indels, some of which occurred in genes of known cancer pathways.
The researchers have since completed sequencing a second AML sample and matched control, using paired-end 35-base and 50-base Illumina sequence reads, Elaine Mardis, co-director of the Genome Center at Washington University, reported at the conference. In order to prioritize the numerous single-base variants, they placed them into five tiers. Fewer than 10 mutations fell into the top tier, which comprises non-synonymous variants in coding regions and splice sites.
At least one of the missense mutations was also present in other AML samples, and one of the next steps will be to determine whether recurrent AML mutations correlate with the outcome of the patients and might be a useful prognostic marker, Mardis said.
The Wash U researchers have also started to analyze structural variants, using a new program called BreakDancer, but Mardis pointed out that AML cancers tend to have relatively few of those, compared to other cancer types.
Sequencing a third AML sample, she said, only took two weeks due to improvements in the throughput of the Illumina technology, and has so far yielded 11 “tier 1” mutations. Using an unbiased approach in sequencing cancers has paid off, she noted, since only one of these mutations would have been found if the researchers had only sequenced candidate cancer genes.
Over the next year or so, Wash U researchers plan to sequence approximately 150 cancers and their normal controls, including 30 to 50 AML samples, 30 to 50 breast tumors, 30 to 50 lung tumors, and, as part of the Cancer Genome Atlas project, five glioma and five ovarian cancer samples. In addition, they might tackle other cancers as well, Mardis said.
Sanger: Somatic Rearrangements in Breast Cancer
Like the Wash U team, Mike Stratton and his colleagues at the Wellcome Trust Sanger Institute have used paired-end sequencing on the Illumina platform to analyze human cancer genomes.
But rather than characterizing point mutations, a project that Stratton presented last week focused on somatic genomic rearrangements. Of the more than 380 mutated cancer genes known to date, he said, most are activated by chromosomal rearrangements, for example, by translocations that create fusion genes.
Since somatic rearrangements had not been studied systematically in common adult epithelial cancers, Stratton and his colleagues decided to focus on breast cancer as an example.
For their study, they chose 24 breast cancers — 15 primary tumors and nine cell lines — which they sequenced at 1-fold haploid coverage, or 6-fold physical coverage, using Illumina paired 37-base reads and 400-base pair inserts. The project was designed to pick up rearrangements in these cancers genome-wide and with sequence-level resolution, Stratton said.
In total, the scientists discovered more than 2,000 rearrangements that were present in the tumors but not in germline controls, including both interchromosomal and intrachromosomal events.
Some samples had many more rearrangements than the researchers had anticipated from cytogenetic studies, Stratton noted, which had been "blind" to intrachromosomal rearrangements.
Within individual samples — both primary tumors and cell lines — they found several patterns: Some cancers showed a large number of both inter- and intrachromosomal rearrangements that were fairly evenly spread across the genome; some had clusters of such rearrangements that were often correlated with regions of copy number changes; and others had only very few rearrangements.
Some breast cancers appeared to be particularly prone to developing tandem duplications, indicating that they might have a DNA repair defect that generates a mutator phenotype. Further analysis of the sequence breakpoints revealed more about the DNA repair processes likely involved, according to Stratton.
Of the more than 2,000 rearrangements the researchers discovered, 25 created in-frame fusion genes, and 70 led to in-frame internally rearranged genes, though not all of these appear to be expressed as fusions. It remains to be seen whether these fusions recur in other cancer genomes, he said.
Broad: A Range of Cancer Types
Researchers at the Broad Institute have analyzed multiple types of cancer — including glioblastoma, ovarian cancer, leukemia, and melanoma — by second-generation sequencing, including whole-genome sequencing, exome sequencing, and transcriptome sequencing.
According to Gad Getz, who presented several project at last week's meeting, most of these studies have been conducted with Illumina's Genome Analyzer, using paired-end reads with 400-base-pair inserts, although the institute also has an "active program" involving the Applied Biosystems SOLiD system, whose longer-insert mate pairs and higher read density could be advantageous for certain analyses, for example of copy number variants, he said.
One challenge for accurately determining somatic mutations in a tumor sample is that it usually consists of a mix of tumor and normal cells, Getz pointed out. In addition, tumor DNA is often amplified. Both a tumor's purity and its ploidy can be gauged from SNP array data prior to the experiment in order to determine the depth of sequencing required to call mutations accurately, he explained.
In one project, the researchers sequenced a glioblastoma and normal control at high coverage using the Illumina platform and found more than 3,000 somatic mutations, among them less than 30 in coding regions, and less than 30 missense mutations. They included the five mutations that had been previously discovered by capillary sequencing.
The researchers also analyzed the data for copy number variations and came up with a profile similar to that identified by Affymetrix arrays, although the accuracy of the sequencing data was higher, Getz said.
In addition, they found almost 100 intra- and interchromosomal rearrangements that were not present in the normal control.
Besides the glioblastoma and its control, the Broad scientists have thus far also analyzed a chronic lymphocytic leukemia sample and two ovarian cancers and their controls by whole-genome sequencing.
In targeted exon sequencing projects, the Broad team has analyzed either a set of several thousand genes or the entire exome of several cancers, using oligonucleotide baits for exon enrichment.
Getz mentioned that due to the uneven representation of the targets, the researchers currently require about 10-fold additional coverage for exon sequencing, compared to whole-genome sequencing. Despite this, sequencing the entire exome is currently between 25-fold and 40-fold cheaper than whole-genome sequencing, he said. So far, his team has sequenced the entire exome of a glioblastoma as well as that of an ovarian cancer and their controls.
Lastly, Getz presented some results from several transcriptome sequencing, or RNA-seq, projects, involving a chronic myelogenous leukemia and 10 melanoma cancer samples, in which he and his colleagues identified several fusion genes.
Getting the next-generation sequencing methods to work on formalin-fixed paraffin-embedded samples will have a "huge impact" on future studies, he said.
Baylor: Glioblastoma and Pancreatic Cancer
Like their colleagues at the Broad Institute, scientists at Baylor College of Medicine's Human Genome Sequencing Center have started to analyze cancer samples both by whole-genome shotgun sequencing and by exome sequencing, focusing on glioblastoma and pancreatic cancer.
Recently, they sequenced a glioblastoma and normal control at 30-fold coverage using 50-base fragment reads from the SOLiD platform, Baylor's David Wheeler reported at the meeting last week. One of the runs in this project yielded approximately 30 gigabases of data, he said.
The reason they did not use a mate-pair library was that the DNA was highly fragmented, he explained, adding that a technology is currently in development to sequence short fragments with paired-end reads on the SOLiD sequencer.
A new method the researchers developed, called eGenotyping, which rapidly screens sequence reads using in silico probes, allowed them to call genotypes quickly. They cross-validated variants with 454 sequencing data.
In the glioblastoma sample, they discovered almost 6,000 somatic mutations, among them more than 100 missense mutations. Seven of these were contained in known cancer genes that are involved in a number of cellular processes.
A copy number analysis showed, among other results, that the EGF receptor gene is amplified, a known feature of this particular tumor.
The researchers also sequenced the exome of pancreatic cancer samples using NimbleGen capture arrays and 454's sequencing technology and found that the target bases were fairly uniformly represented. Although the coverage, and hence sensitivity, is still low, early results revealed more than 3,000 missense or nonsense mutations, three of which are contained in the Catalogue of Somatic Mutations in Cancer database.
Wheeler and his colleagues are also working on sequencing captured exomes using the SOLiD platform, but this has been "moving more slowly than we wanted," he said.
Ontario Institute for Cancer Research: Colorectal Cancer
Researchers at the Ontario Institute for Cancer Research have also conducted a targeted cancer resequencing study, focusing on colorectal cancer, John McPherson reported at last week's meeting.
A previous genome-wide association study by a Canadian research consortium had identified 10 loci that are associated with an increased risk for colorectal cancer.
In order to identify the underlying causal alleles, the OICR researchers decided to sequence these regions, along with other genes implicated in hereditary colorectal cancer, in 40 sporadic cases and 40 controls as well as a number of probands with a family history of the cancer.
To enrich the targets — a total of about 3 megabases of sequence — they used both NimbleGen microarrays and Agilent's SureSelect solution-based method. Sequencing was performed with paired 76-base Illumina reads to a depth of about 50 reads per base in regions that could be analyzed, which totaled about three-quarters of the genomic targets, and about 90 percent of the exons.
In a total of 98 individuals, the researchers identified approximately 10,000 SNPs, of which almost half were novel. So far, they have found almost 150 SNPs in exons, with a number of them affecting stop codons.
It is too early to conclude which of these are causal alleles, according to McPherson. More sequencing will be required, and the alleles discovered will need to be genotyped in additional samples, he said.
Johns Hopkins: CNVs and Expression
Over the last few years, researchers at the Johns Hopkins Kimmel Cancer Center have been analyzing virtually all protein-coding genes in several breast, colorectal, pancreatic, and brain cancers (see In Sequence 9/12/2006 and 9/9/2008).
More recently, they have started to look at copy-number variations and gene expression using second-generation sequencing.
Victor Velculescu reported at last week's conference that the Hopkins team is now using paired-end 25-base SOLiD reads to analyze copy number variants genome-wide, generating more than 100 million mapped reads per sample at the moment. The analysis, which has a resolution of less than 1 kilobase, is "highly quantitative," he said, and provides a more accurate representation of copy number changes than microarrays or digital karyotyping.
He and his team have also adapted the serial analysis of gene expression method for second-generation sequencing platforms, he said.