NEW YORK (GenomeWeb) – Having recently won a US Centers for Disease Control and Prevention challenge, startup One Codex is now working to commercialize a bioinformatics platform to help researchers using next-generation sequencing better understand pathogens that pose threats to public health.
In late January, the San Francisco-based firm won a $200,000 award from the CDC for its platform for strain-typing Shiga toxin-producing Escherichia coli. While the platform remains in beta mode, CDC noted that the company succeeded in demonstrating how its platform "can rapidly identify STEC from complex clinical samples and provide meaningful information about its strain type and characteristics, even when the pathogens are present at levels too low to support assembly-based methods."
Applications for One Codex's technology span clinical diagnostics, food safety, and biosecurity, Co-Founder and CEO Nick Greenfield told GenomeWeb recently, adding that its database currently comprises about 28,000 full bacterial, viral, and fungal genomes, translating to more than 100 gigabases of reference content with additional genomes to be added in the coming weeks.
"By and large, we're building a technology and data platform for doing bioinformatics in a way that's, hopefully, a little more user-friendly and scales better toward these really large data reference sets than existing methods," Greenfield said.
The platform, also called One Codex, is an index of microbial genomic content that the company intends to expand into the largest index of bacteria, viruses, and fungi. According to Greenfield, the platform's search and indexing technology can "very quickly" classify samples against the index. For a range of uses, it offers Blast-like functionality, but at about three orders of magnitude faster, he added.
"So you can upload a file from a sequencing run… and get a result in 10, 15 minutes for an entire sequencing run," he said, adding the platform provides an automated analysis of multi-gigabyte sequencing files.
Currently, the platform comprises an index layer for storing and organizing sample data, essentially to do a first inspection of sample data for genomic classification. Included in this first order application are strains of common pathogens such as Clostridium difficile, E. coli, and methicillin-resistant Staphylococcus aureus.
Additionally, One Codex is building a second-order application to provide strain typing, for which it won the CDC award. One Codex can currently strain type on a limited basis and is demo-ing that capability in a beta release, Greenfield said.
For strain typing, he said that the platform has "well functioning data at this point," including STEC data from the work it did for its CDC award, and others, though One Codex has not released the data yet.
"The key difference in what we're doing is rather than developing a [method] for a specific pathogen, we're developing an unbiased [method], effectively, for our entire reference libraries," Greenfield said. The One Codex platform is cloud-based, and researchers can upload their raw data through the company's web application or programmatically through an application programming interface or command line. In the beta stage, the results provided by One Codex are free.
Greenfield declined to say when the platform would be fully launched or disclose any pricing details.
Data that have been uploaded so far have come from Illumina's NGS platforms, Life Technologies' Ion Torrent instruments, and Pacific Biosciences, Greenfield said. He wasn't sure if One Codex had received any raw data from an Oxford Nanopore instrument. In all, more than 1,000 accounts have used the One Codex platform, he said.
Results returned to the researcher include the raw output of the analysis, a metagenomic classification summary, the phylogenetic tree, and some tabular results, among other things.
"The goal of the current web platform is that you can throw raw data right off the sequencing machine at it and get useful first-order results for a class of problems which are basically microbial detection problems," Greenfield said. For now, the indexing is based on information from the National Center for Biotechnology Information, though One Codex is "going after a couple of other repositories, as well," he added.
In trying to get the highest quality data, One Codex cleans the reference data so that bad data is not used, Greenfield said. Additionally, the company chooses what reference data can be partially used, "which is really the hard problem of [figuring out if] there are contaminants within particular references that need to be removed."
On the analytical, or classification, side of the process, the company has built a robust pipeline that applies kmer-based matching strategies to simulation and empirical data sets.
In its validation work with simulated sequencing data, One Codex "exactly matches its simulated data with its data and computes "all manners of accuracy statistics," Greenfield said. Validation work with biological samples is done by "looking to understand the limits of detection and how specific our results are," he said.
According to Greenfield, the approach is "quite robust to error," so if there are any ambiguous results, the company will work to clarify them. If its analysis leads to a bad reference, "it doesn't actually lead to the wrong answer; it just leads to a less specific answer.
For instance, if One Codex had a reference database of only E. coli and Salmonella, and the Salmonella reference was made up of 10 contigs of which one is E. coli, when the firm finds kmers that match the E. coli contig that was mislabeled as Salmonella, One Codex will classify it as the common parent of E. coli and Salmonella.
"So we wouldn't say it's Salmonella erroneously," Greenfield explained. "We'd say the content here, based on what we know about available reference material, is only specific to the family, rather than specific to the species. In that way the inherent look-up and results presentation handles a lot of error cases quite nicely for you."
As One Codex continues building out the platform, it is in discussions with various customers about their use cases and the features they'd like to see improved or added in order to extend the technology so that it has use for all its potential users, including those working in public health agencies and academic institutions, Greenfield said. One recent addition to the platform is the ability for researchers to publicly share raw data that they upload. The resulting data by One Codex cannot be shared, however.
The goal for the CDC challenge was to "develop a new or innovative method to strain-type and characterize [STEC] without using culture-based methods," CDC said in announcing the challenge. Current methods can take too long to get results, while newer technologies, such as PCR-based methods, though faster, "do not help in detecting or investigating outbreaks or trends in pathogen development," the center said.
One Codex was tasked with demonstrating that its technology could identify STEC in a complex metagenomic sample, stool, chosen by the CDC because feces commonly has background E. coli, making it an especially "emblematic problem of a mixed sample" in which a technology needs high specificity in its detection, Greenfield said. He added there could be non-pathogenic E. coli in the sample and CDC was interested in methods that could differentiate pathogenic from non-pathogenic strains.
While molecular diagnostics have done well in differentiating STEC from other strains of E. coli, the challenge for NGS-based methods is "making sure you don't lose the specificity when you have so much data," Greenfield said.
One Codex first classified the metagenomic sample, which was supplied by the CDC, and found that about 99 percent of the sample did not have E. coli, but about 1 percent did. The company then strain-typed the sample to determine whether or not it was STEC.
Duncan MacCannell, chief science officer for the Office of Advanced Molecular Detection at CDC, told GenomeWeb that while the challenge was specifically for detecting STEC in stool, the expectation was that the winner's method would have applications for other samples types and other pathogens.
He added that while others have developed a cloud-based metagenomic classifier, the advantage of One Codex's platform is that it's "fairly easy to use … and it makes the idea of looking at metagenomic samples a lot less complicated."
While CDC has no agreement to use the One Codex platform, a number of the center's groups "are looking … at how the platform might work." CDC is evaluating different metagenomic approaches "and so this would be one powerful new tool in our arsenal," MacCannell said.
The CDC challenge was a "very logical extension of the platform" that One Codex is building for strain typing. Instead of showing all the bacteria, virus, etc., in a sample, however, the new tool will show a phylogenetic tree of E. coli strains, as well as the location of specific kmers and evidence across the tree, and therefore what strain is likely present in the sample. Metadata about the individual strains are also displayed.
In founding One Codex last spring, the goal of Greenfield and Cofounder Nik Krumm was to meet the growing demand for "powerful, but useable bioinformatics solutions," Greenfield said. As applications for NGS technology ramp up, "well-designed bioinformatics technology will only become more important," and in the coming years, researchers will need "a horizontal data and search platform for genomics in order to cope with the deluge of NGS data," he added.
In June 2014, the firm, which was incorporated as Reference Genomics but goes by One Codex, went live and underwent a program created by a venture investor group called Y Combinator that helps startups build their businesses, while providing them seed funding and preparing them for later-stage financing.
One Codex received $120,000 in seed money from Y Combinator and raised additional seed funding in a financing round later in the fall. Greenfield declined to disclose the amount of the add-on funding.
While there are other companies that help researchers make sense of their NGS data, he said that One Codex don't necessarily see those firms, which operate on a service model, as competition.
"We don't see anyone actively building the technology and data platform first in the way that we've set out to do," Greenfield said.