UK bioinformatics startup Eagle Genomics and the John Innes Centre are collaborating on a commercial bioinformatics service for plant breeding based on JIC's TraitTag SNP discovery software.
Eagle, a Cambridge-based firm that launched two years ago with the goal of providing commercial support for open-source bioinformatics tools such as the Ensembl platform, plans to have a working prototype of the TraitTag service by July or August, Abel Ureta-Vidal, managing director of Eagle Genomics, told BioInform.
Ureta-Vidal worked on the Ensembl project at the European Bioinformatics Institute from 2001 to 2007 and founded Eagle with three other former EBI staffers to serve as "something of an intermediary between academia and industry."
While academic centers like EBI and the National Center for Biotechnology Information provide the research community with a wealth of free bioinformatics resources, Ureta-Vidal said that these centers can't offer "maintenance contracts to help industry use the software and the data."
In response, the company is taking a two-pronged approach to bridging the academic/industry bioinformatics gap. Eagle's primary business to date has involved helping clients deploy open-source bioinformatics platforms in-house or through a cloud-based infrastructure.
The other aspect of Eagle's business is "putting in place data-analysis pipelines that have been implemented in academia for some years, but never got to the stage where they were provided as a commercial service."
The collaboration with plant research center JIC is the first example of that model. TraitTag was developed by Martin Trick, associate head of the center's department of computational and systems biology, and Ian Bancroft, a project leader in the department of crop genetics, to identify SNPs associated with positive traits in massive sets of plant genome data generated with Illumina's Genome Analyzer.
Eagle plans to "repackage the software and the pipeline in a reliable, robust way as a service for companies," Ureta-Vidal said, adding that the firm will continue to work with the JIC researchers to improve the offering.
JIC's Bancroft is working as a consultant with Eagle and the institute is entitled to royalties based on revenues from services that are sold. Additional financial terms of the agreement were not disclosed.
Bancroft said that he will work with clients on the design of their sequencing experiments to ensure that "the right calculations are done." The client will either perform the sequencing in-house or through a third party and will then hand off the data to Eagle for analysis.
In order to keep costs down, Eagle plans to perform all the analysis on Amazon's Elastic Compute Cloud. "We're a software company and we don't have a cluster, so cloud computing is very interesting," Ureta-Vidal said. "We have been able to transform the platform to distribute the work on different nodes, so we can basically call as many machines as we need to do the calculation."
The cloud-based model is much more cost-effective than investing in an in-house cluster, he said, because "we can run the pipeline only when needed, when the clients provide us with the sequence, and calculate the data and get the report and then shut down the machines."
In terms of the turnaround time, Bancroft and Ureta-Vidal said that it would depend on the complexity of the specific project, but would probably take no more than a week for the data analysis.
While the plan is to offer the service for a "very competitive price," Ureta-Vidal said that the collaborators have not yet determined the final cost structure because they are still streamlining the process.
He added that the company has seen "some interest" from ag-bio firms in the offering and is currently in discussions with potential customers who will serve as a "test case" for the service.
[ pagebreak ]
No Reference Required
The JIC researchers said that TraitTag's advantage for plant-breeding applications is that it does not require a reference genome to identify SNPs of interest within large sequence data sets. As a result, it is expected to be useful for many crop species that have not yet had their genomes sequenced.
Bancroft noted that TraitTag was developed specifically for crop genomes, which are difficult to analyze due to their polyploid nature. "The big issue there is that it can be difficult to differentiate between allelic variation — so the SNPs that you really want — and variation that is just between paralogous genes, or, in the case of polyploids, homeologous genes."
TraitTag uses the MAQ software from the Wellcome Trust Sanger Institute to do the initial sequence alignment and candidate SNP identification from the Illumina GA data. But rather than align the reads to a reference genome, it uses unigenes assembled from publicly available ESTs.
For example, in a paper describing the method published last year in the Plant Biotechnology Journal, Bancroft, Trick, and colleagues explain that they were able to use a set of more than 94,000 unigenes for oilseed rape (Brassica napus) that had been assembled from around 810,000 public ESTs from several different Brassica species, which represented a total of 64 million megabases of sequence.
The researchers used the Illumina platform to sequence the transcriptomes of two Brassica cultivars for the study — Taipdor and Ningyou 7 — and found that they could get an average depth of coverage or 18.9-fold and 20.9-fold, respectively.
The results showed that "the majority of [Illumina GA]-derived ESTs can be aligned with sequences already in the public databases, but a substantial minority (25.9 percent in Ninyou 7 and 26.6 percent in Tapidor) cannot."
The authors speculated that the missing alignments may be because the corresponding transcript is not in the public databases.
Beyond the use of unigenes for a reference, Bancroft said that the "distinctive" part of TraitTag is the "downstream scripting to reanalyze and reinterpret the output of MAQ to allow you to differentiate allelic variation from these inter-homeolog polymorphisms."
The Plant Biotechnology paper describes a two-step scripting process that the authors developed to first identify in the unigene sequences "positions of robust candidate sequence polymorphisms relative to the assigned [Illumina] reads from each cultivar." In the second step, they compare the two lists of SNPs to determine SNPs within and between the two cultivars.
"Basically you're comparing your lists, and if something comes up as showing more than one base at a similar position, if that shows up in all of your lines, then that is an inter-homeolog polymorphism," Bancroft said.
He added that approximately 90 percent of these SNPs are so-called hemiSNPs, "where you've got a polymorphism in one pair of homeologs but not the other. So generally what you see is either a single resolved base — a C all the time — or you see a mixture of Cs and Ts where the C would be the homeolog and T is the allelic variation."
Bancroft said that because TraitTag was designed to address the difficult challenge of polyploidy in SNP analysis, "it is applicable just about anywhere," including mammalian genomes.
The platform was developed to analyze Illumina data because that's what JIC is using internally for its own sequencing. Bancroft said that if the center adopts other sequencing platforms for its internal use, it will modify TraitTag accordingly.
And while the platform is primarily a SNP-discovery pipeline at this time, "there are a number of other things that are just months behind that in terms of other applications that we're developing," Bancroft said.
At first, the researchers plan to enable analysis of PCR products and transcriptome quantification as part of the service, and "we anticipate over time there will be more applications coming out from my research program that will be made publicly accessible as a service through the link with Eagle," he said.