Data warehousing firm Kognitio is partnering with the National Center for Genome Resources to optimize its WX2 database for the life sciences market.
Under an agreement announced this week, NCGR is using the company's WX2 database system to house data from a quickly growing fleet of Illumina Genome Analyzers that is expected to generate tens to hundreds of terabytes over the next year.
Kognitio, headquartered in Bracknell, UK, was formed through the 2005 merger of data management firm Kognitio and business software developer WhiteCross, which was founded in 1989. To date, the firm has focused on data management and analytics in telecommunications, utilities, and retail management, but it sees a promising opportunity for WX2 in the bioinformatics market.
Bioinformatics is "a very small part" of the firm's portfolio, John Thompson, CEO of Kognitio's US operations, told BioInform, "but we expect it to grow over time" because "bioinformatics and genome processing and genomic information processing is going to be a huge market."
The goal of the NCGR partnership, he said, is to help "evolve" WX2 so it "can do things for bioinformatics professionals that they can't get other places" and give the database "functional extensions that are specific to the life sciences world."
Kognitio describes WX2 as a "virtual" data warehouse appliance that runs on industry-standard hardware. The company claims that WX2 can obtain results "several times" faster than competing data warehouse appliances such as Netezza and 10 to 60 times faster than typical software-only relational databases like Oracle.
Ernest Retzel, program leader at NCGR, told BioInform that the loading speed for WX2 is 10 times faster than other systems and queries are on the order of 10- to 100-fold faster, though he declined to specify which systems he compared to WX2.
NCGR currently has six Illumina GAs, with two more on order, Retzel said. Meanwhile, the throughput on each of those machines is advancing rapidly. The current systems can generate 20 gigabases per run, and Illumina is projecting that its technology will be able to reach 100 gigabases per run by the end of the year.
"One advantage of WX2 is that it scales well, and scales well with the addition of relatively commodity-type hardware," Retzel said.
WX2 is a relational database much like Oracle, Thompson said. "The difference is that WX2 is tailored and developed solely for analytics."
Describing the system as a "software-only massively parallel processing database," Thompson noted that it works with any x86 hardware running Linux. NCGR, for example, is using the software with Sun systems running Linux.
NCGR's software licensing and support agreement with Kognitio will be scaled as its needs change, Retzel said, adding that his team is "on a learning curve" regarding the database's bioinformatics capabilities.
Retzel said that WX2 will not entirely replace NCGR's existing relational database, from a vendor he declined to name. Some applications "that do a lot of things well" will remain in the current system, he said.
Although Kognitio does not have much experience in bioinformatics, Retzel explained that NCGR did not view the company's offering with unusual prejudice. "There is always caution in exploring new technology when you are operating in a production environment," he said. He and his colleagues spoke to other WX2 users, although he admitted that "the number [of Kognitio users] in our problem area is small."
[ pagebreak ]
As part of the evaluation, NCGR scientists requested that Kognitio develop a proof of concept on its servers with real datasets, including on-site support to get it to run in NCGR's environment.
A disk with an NCGR database was loaded into WX2 at Kognitio's UK headquarters. "I won't say this magically happened," Retzel said, indicating that the process took over a year.
Thompson explained that the database must be indexed and various optimization procedures have to be performed in order to run queries optimally. "You have to understand the question to set up the database so it works well for you," he said.
At NCGR, scientists first thought they would need to optimize their data for WX2, but Thompson said that process wasn't obligatory.
Nevertheless, "It's not a plug-and-play," Retzel said, adding that Kognitio worked "very hard" at addressing any issues. "There has been an attention paid to our specific problem that I have not seen in any vendor I have dealt with."
"Kognitio has been incredibly supportive in both understanding our application area, our specific database issues," Retzel said. He added that the firm continues to assist his team with reviewing database query results, optimizing, loading, and querying tasks.
Breaking into a New Market
Thompson said he first began to explore the bioinformatics market when he became CEO of Kognitio's US operations in 2007. He first approached researchers at the California Institute for Telecommunications and Information Technology, Calit2, which is a partnership between UC San Diego and UC Irvine.
Calit2 had been developing a data repository on a Sun cluster system with a PostgreSQL database for its Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis, or CAMERA, database, which houses metagenomics data from Craig Venter's Sorcerer II expedition.
According to a Kognitio case study, the Calit2 researchers found that they could load the CAMERA repository in 50 minutes as opposed to 36 hours on the PostgreSQL system.
To test the system, the scientists gave Kognitio six or seven multi-parameter queries, Thompson said, asking, for example, for sequences of samples with particular characteristics from a certain water temperature range. They were able to run queries with WX2 that hadn't been possible with their previous database, Thompson said.
The CAMERA scientists remained skeptical, he said, so Kognitio gave them the database and the software to work with and the scientists got "the exact same results we did," he said.
Leveraging the CAMERA experience, he approached NCGR, which "has the sequencing side of the house figured out," he said, but sought to address data handling. Thompson said he was able to show the scientists that "you can rip through terabytes of data that you couldn't do before."
Attack the Data
In the next phase of the partnership, Retzel said that NCGR is "increasing the size and scope" of its WX2 license to accommodate "several new projects" that he declined to specify.
"One of my interests in their technology is the ability to write external code to access the database using their 'plug-in' facilities," Retzel said. The plug-ins allow scientists to write the routines that "attack the data" in tailored ways, Thompson said.
SQL is a non-procedural language that is difficult to use for complex routines, unlike C, Thompson said. "We can take C code and people can write interesting routines that do things to the data; we can ingest those into the database and then parallelize them across the entire server farm." If WX2 is implemented in a server farm with 400 processors complex routines can be run across all the processors, he said.
Thompson noted that Cambridge University physicists are applying this approach to their WX2 environment with six different transformations of astronomy data, performed in sequence. The Cambridge scientists wrote the six sequential transformations and then "we wrapped them as plug-ins and we ingested them into WX2," he said. Transforms that previously took 20 days to ready data for researchers took 5 minutes on WX2, he said.
Retzel said another personal interest lies in the integration of different data types. Genomic sequencing, transcriptome sequencing, and small RNAs should lead to a situation in which scientists don't just have "one view of the data" but can look at it in the context of other "supporting" information, for example, connecting expression differences and methylation patterns.
Feedback with Kognitio will continue, he said, since "problems we have, are problems other people will have."