NEW YORK (GenomeWeb) – Researchers in Australia have evaluated the BGISEQ-500 platform for cancer whole-genome sequencing, comparing it to the Illumina HiSeq X. Although the two sequencing platforms delivered largely overlapping results, there were some differences, in particular regarding somatic variants, which can be partially explained by the platforms' different read lengths.
The study, which appeared last month in PLOS One, is the first published report that used the BGISEQ-500 for cancer genome sequencing. It follows a paper by a Chinese team, published last year in GigaScience, that used the platform to sequence a human cell line and compared the results to data from the Illumina HiSeq 2500. That study found that the BGISEQ-500 was less adept at calling indels, probably because of its shorter reads.
BGI launched the BGISEQ-500 in 2015 and registered it with the China Food and Drug Administration the following year, when it also launched a smaller version of the platform, BGISEQ-50. The system is based on sequencing technology that was originally developed by Complete Genomics, which BGI acquired in 2013, but BGI further developed the approach — combinatorial probe-anchor synthesis and DNA nanoball technology — to achieve longer reads and other improvements. Initially, the BGISEQ-500 was only available in China. In the meantime, BGI subsidiary MGI Tech announced two new sequencing platforms, the MGISEQ-2000 and the MGISEQ-200, that represent upgrades to the BGISEQ-500 and -50, offering longer reads and shorter run times.
For the new BGISEQ-500 evaluation study, a collaboration between BGI and the QIMR Berghofer Medical Research institute in Brisbane, the researchers analyzed tumor and normal samples from three patients with malignant pleural mesothelioma. The Australian group extracted the DNA from the samples and sent it to BGI for whole-genome sequencing on the BGISEQ-500 and to the Kinghorn Centre for Clinical Genomics at the Garvan Institute of Medical Research in Sydney for sequencing on the HiSeq X Ten.
"We were curious to see how the data looked as the BGISEQ-500 is expected to be competitive in terms of price for whole-genome sequencing," said Nic Waddell, a researcher in the Department of Genetics and Computational Biology at QIMR Berghofer and the senior author of the study. "It would also be nice to have a good orthogonal sequencing platform to complement the Illumina platform."
Sequencing on the BGISEQ-500 was performed using 50-base paired-end reads and on the HiSeq X using 150-base paired-end reads, with similar read depths, and data from both platforms were analyzed using the same pipeline. "In general, bioinformatics pipelines can make a difference to the variants which are identified," Waddell explained, because they may use different tools or filtering strategies.
To start, the researchers compared variants identified by both sequencing platforms in the germline samples with variants from an Illumina SNP array and found that the sequencing results were more than 99 percent concordant with the array results.
Overall, the BGISEQ-500 and the HiSeq X Ten called about 3.5 million germline single nucleotide variants per patient, and 86 percent of those were identified by both platforms. However, about 1 million SNVs were unique to the HiSeq X Ten, and about 370,000 were unique to the BGISEQ-500. The two platforms also called about 230,000 germline indels per patient, of which 81.5 percent were common between them. A total of about 110,000 indels were unique to the HiSeq X Ten and about 20,000 indels were unique to the BGISEQ-500.
When they examined the variants that only one platform detected more closely, they found that many were actually present in the data of the other platform at low levels, or they were called as somatic by the other platform because it missed the other allele. Following this analysis, only about 200,000 SNVs remained truly unique to the HiSeq X Ten, and only 38,000 or so to the BGISEQ-500. Similarly, about 23,000 indels remained unique to the HiSeq X Ten and 1,300 to the BGISEQ-500.
Concordance between the two platforms was somewhat worse for somatic SNVs and indels in the tumor samples. In total, the HiSeq X Ten and the BGISEQ-500 called about 11,000 somatic SNVs across the three patient samples, of which 72 percent were identified by both platforms. About 1,500 somatic SNVs were unique to either platform. They also called a total of about 1,300 somatic indels, but only 38 percent of those overlapped. Previous comparison studies also found stronger discordance between indels than SNVs, they noted, so this result was not completely unexpected.
Of 156 somatic mutations the two platforms detected in coding regions, 70 percent were identified by both, while 13 percent were only called from the BGISEQ-500 data and 17 percent only from the HiSeqX Ten data.
"It is unclear in our study if these small numbers of [unique germline and somatic] variants are real or artifacts," Waddell said. "More validation work or resequencing of the samples would have to be done. Some of the variants unique to a platform may be real, and they may be absent in the other platform" due to differences in read length or sampling bias, she said.
One key factor that could explain the differences, the researchers wrote, is that the BGISEQ-500 reads were just 50 base pairs long, while the HiSeq X Ten reads were 150 base pairs — alignment errors tend to be higher for shorter reads, especially in AT-rich regions, and the aligner used in the study may have performed better for the longer Illumina reads. Also, the variant calling and analysis pipeline may have favored longer reads because of the way it filtered the data.
"The BGISEQ-500 sequence accuracy is shown to be comparable to the HiSeq X Ten, but the main difference likely affecting the results is the longer read length readily obtained on the HiSeq X Ten versus the BGISEQ-500," said John McPherson, a professor of biochemistry and molecular medicine at the University of California Davis School of Medicine, who was not involved in the study.
"As acknowledged in the manuscript, the shorter read length may be less effective for the tumor analysis," he said, which others have found before. "Gains through tuning of software for the shorter reads may be possible but past experience has shown that the longer read lengths capture more somatic events."
Furthermore, even though both platforms analyzed the same DNA, the different library prep protocols they require may have led to biases in sampling of different alleles, which the researchers wrote is something no platform comparison can avoid.
According to McPherson, both platforms "performed well overall" for both tumor and normal samples but their concordance "was perhaps less than desirable."
He cautioned that "while platform comparisons of emerging technologies are very important and a significant resource to the community, it must be remembered that they are often an apples-to-oranges comparison."
In addition, sequencing platforms keep evolving. Waddell said, for example, that her group is currently comparing BGISEQ-500 data with 100-base paired-end reads with HiSeq X Ten data, "which is showing promise."
For the time being, her team continues to use the Illumina platform for whole-genome sequencing projects, for example, for the International Cancer Genome Consortium (ICGC) Melanoma Project, a collaboration with the Melanoma Institute Australia, and for a mesothelioma project in collaboration with the National Centre for Asbestos Related Diseases.
More work on larger numbers of samples will be needed to figure out if BGISEQ-500 data could be added to ongoing projects, she said, "although we may consider starting new projects with the BGISEQ-500 platform."