Richard Casey is a project manager in the IT division of Agilent Technologies in Fort Collins, Colo. He manages enterprise software and database development projects. His background is in life sciences and information technology. He can be reached at [email protected]
One of the more daunting challenges for genomics and proteomics researchers is integrating and sharing information among the hundreds of databases, applications, laboratory information systems, gene arrays, and myriad other sources of bioinformatic data. Any methods or systems that ease the burden of integrating data from these many sources are welcome.
For the past several years, the Extended Markup Language (XML) has evolved to allow individuals and organizations to share XML-based data and documents over the Internet. Because of its ability for sharing information in a standardized way, XML has gained wide acceptance in the bioinformatics community as a method of storing and exchanging gene expression, proteomic, and annotation data. Some well-known XML-based methods that researchers use to exchange such data include the Gene Expression Markup Language (GEML), Bioinformatic Sequence Markup Language (BSML), and Genome Annotation Markup Elements (GAME). In addition, a web of public and private databases and applications support genomic-proteomic research, including the Protein Database, SWISS-PROT, GenBank, BLAST, and FASTA. Many of these databases and applications support XML for importing, exporting, exchanging, and storing bioinformatic data.
Oracle 9i, the newest database from Oracle, supports XML in a way that could dramatically improve the exchange of bioinformatic data between individuals and organizations.
Technically, version 9i supports a new datatype called XMLType. What this means is that XML data can be treated like any other native datatype (i.e. character or numeric data) in the database. Entire XML documents, and sets of documents, can be stored directly in tables in 9i databases.
Table columns in turn can be defined such that they hold XMLType data, and each row or record in the table can hold an entire XML document. Once stored in 9i tables, a full set of standard, built-in SQL functions can be used to insert, update, delete, extract, and query XML data and documents, just like any other datatype.
Because XML is treated as a native datatype, developers and software engineers can develop database queries using simple, standard SQL calls with which they are already familiar. They do not need to learn a new programming language to access the data. Furthermore, if a large amount of XML data is stored in the database, indexes and other standard performance-enhancing methods can be employed to speed up queries and perform data management functions.
This is an important factor in database design considering the large amount of genomic and proteomic data being created today. Also, queries can be run against XML documents such that only specific sections or subsets of the document are searched and retrieved, thus allowing for powerful data manipulation capabilities.
Operational data stores (ODS), sometimes called data integration hubs, are databases that collect, transform, and integrate data from a variety of sources and send it to data warehouses, decision support systems, and reporting tools. Bioinformatic data hubs could be built to integrate XML data derived from various source systems and deliver it to bioinformatic warehouses. In the ODS, developers could enforce “bioinformatic business rules” to ensure that only correctly integrated and properly transformed bioinformatic data winds up in the data warehouse. By acting as data integration hubs operating on standardized XML data, the data stores could perform an invaluable, integrative service for the bioinformatic community.
Opposite Strand is a forum for readers to express opinions and ideas about trends and issues in genomics. Submissions should be kept to 550 words and may be submitted to [email protected]