Skip to main content
Premium Trial:

Request an Annual Quote

UC Irvine Team Develops Data Compression Method


Scientists at the University of California, Irvine, have developed a new method to compress genomic data and have demonstrated in a recent paper that the approach can convert a human genome into an e-mailable attachment.

To arrive at their results, the scientists at Irvine's department of computer science and the Institute for Genomics and Bioinformatics used a series of compression techniques that enabled them to reduce James Watson's genome from around 3 gigabytes to 4 megabytes, which they say made the data "small enough to be sent as an e-mail attachment."

The study, which appeared in the January 15 print issue of Bioinformatics, was intended not only to take on genome compression issues, but also to "spark some discussion," says Xiaohui Xie, a professor in the school of information and computer sciences at UC Irvine.

"Just providing a program to compress is not sufficient, even if it is an easy-to-use, flexible program available for all types of computers," he says.

Rather, the challenge facing researchers, he says, is that public databases "do not currently allow them to submit their data compressed with our program, or actually any program other than basic gzip," he says. "Nor do those public databases provide their data compressed with our program."

Xie says he thought of publishing the compression technique as a way to "get the community talking about coming up with a standard, [and] seriously tackling the problem and making advanced compression a core part of the data infrastructure."

As the authors point out in their paper, "general-purpose" compression programs such as gzip can substantially decrease data size, but special-purpose algorithms can compress the data even further "by orders of magnitude."

The basic concept is to realize that it is "more efficient" to store only the variations between genomes, since any two human genomes are more than 99 percent identical.

Vivien Marx

Bioinformatics Notes

The Centre for Applied Genomics at Toronto's Hospital for Sick Children is now using GenoLogics' lab and data management system in the center's microarray analysis and gene expression, DNA sequencing and synthesis, cytogenomics and genome resources, and genetic and statistical analysis facilities.

Genomatix Software has joined the Illumina-Connect program to develop tools and applications for data generated with Illumina second-generation sequencing technology.

A team at Imperial College London published a comparison in The Journal of Proteome Research of GE Healthcare's DeCyder v6.5, Nonlinear Dynamics' Progenesis Same-Spots v3.0 and Syngene's Dymension 3. The paper says SameSpots beat the other packages in matching accuracy.


$1.9 million
Amount that Collaborative Drug Discovery received from the Gates Foundation to create a cheminformatics database for scientists developing therapies for tuberculosis.

Funded Grants

$107,200/FY 2008
Enzyme Isoselective Inhibition: a Novel Computational Approach to Drug Design
Grantee: Amnon Albeck, Bar-Ilan University
Began: Jul. 1, 2008; Ends: Jun. 30, 2009
With this, Albeck will aim to develop and implement a computational methodology for virtual screening and design of new covalent transition-state analog enzyme inhibitors. The long-term goal is to present a database that could be used as an information source or as a tool for drug design, as well as a mature algorithm that will be incorporated into a software package.

$230,706/FY 2008
Accelerating Biomolecular Simulations on Reconfigurable Computing Hardware
Grantee: Pratul Agarwal, Oak Ridge National Laboratory
Began: Aug. 15, 2008; Ends: Jun. 30, 2010
This will go toward assisting Agarwal with the development of biomolecular simulations software for adaptive computing that includes reconfigurable computing hardware and general purpose graphical processing units hardware. The projects PMEMD and LAMMPS will be ported and optimized for popular RC/GPGPU devices.