Skip to main content
Premium Trial:

Request an Annual Quote

Microsoft Seeks Scientific Community's Input on Next Release of Open-Source Bioinformatics Toolkit

Premium

By Uduak Grace Thomas

As Microsoft Research gears up for the next release of its Microsoft Biology Foundation next summer, its developers are asking for the scientific community's input on what functionalities they would like to see included in version 2.0 of the open-source bioinformatics toolkit.

Simon Mercer, director of Health and Wellbeing at Microsoft Research, told BioInform that for the next version of MBF, the development team will focus on "scenario-based development," which involves asking current and potential users to submit scenarios that will help the team "better understand the needs of the scientist" and drive the development of a "rich repository" of features that will meet researchers' needs.

"We need to know what people need to do with sequences that they can't do or that they find cumbersome, because if we can learn where those pain points lie, then those are the new features that we will put in," Mercer said. "The better we understand that process, the more closely we can ensure that MBF will support all the stages in that process."

Mercer made the call for submissions at the Microsoft eScience Workshop held in Berkeley, Calif., this week. Researchers interested in submitting ideas can send them here.

He said that there isn’t a specific list of criteria that researchers need to abide by when submitting scenarios and that his team will work with groups to translate their "raw materials" into a suitable programming format.

To make that process easier and to encourage more groups to get involved, Mercer has hired an additional staff member tasked with the responsibility of "building bridges" to research groups in academia and commercial settings.

Mercer's team has already begun receiving suggestions that it is working on implementing. For example, it is currently working on a request submitted by Ricardo Vencio, a professor in the genetics department at the Universidade de Sao Paulo, Brazil.

Vencio is the principal investigator of a team that is working with genomic data from the sugarcane plant, and Mercer said that his team is working with Vencio's group to build a scenario that contains a "detailed description of the experimental processes" the researchers need to perform for that task.

The result of the collaboration, he said, will be functionalities that "fill in the gaps" in the current version of MBF by providing components like sequence data cleaning and quality assurance, quality assessment of sequence assemblies, and genome annotation.

"The purpose of MBF is to cover the biological research space from when sequence is generated … all the way through to contiguous and fully annotated DNA sequence," Mercer said. He added that the sugarcane project provides "a great scenario" because it "covers exactly that end-to-end space."

But it's more than just meeting the needs of one group. "DNA is DNA," he pointed out. "Many of the challenges … are common with any other organism you might sequence," and, as such, the functionalities that result from a single scenario can be used to address the needs of other projects.

Part of the goal for this scenario-based approach, in addition to providing a "richer picture" of researchers' needs, is that it increases the value of the MBF toolkit to the scientific community, Mercer noted, because researchers will have access to prewritten capabilities they need to build pipelines or applications and as such can focus on their research rather than building the tools from scratch.

Mercer's team also plans to use the scenarios to scale up MBF's ability to handle large quantities of data and ensure that the "performance of analytical algorithms meets, or exceeds, the expectations of the user … on the types of hardware researchers would typically have available."

He explained that while there is a lot that can be done to MBF to optimize the code and "tweak performance," the team has only finite resources available and therefore needs to prioritize tasks and set realistic goals that will meet scientific needs, but won't be too difficult to maintain.

Using the sugarcane-based scenario as an example, Mercer explained that a researcher working with the plant's complex genome — estimated to be approximately 10 gigabases, with ploidy at 8X to 10X — may adopt one or more sequencing technologies and generate data with different characteristics.

"If we understand the scientific challenges we can implement efficient computational solutions," he said. "For example, we might choose to represent long reads in memory in a different way [than] short reads."

MBF 1.0, released last summer, currently includes several features to improve efficiency, such as a data virtualization layer, "which takes care of the caching necessary to handle very large flat files," and.NET Parallel Extensions, which "sense how many cores and processors are in a machine and parallelizes the workload across them."

Mercer said that the team is currently "evaluating whether it is feasible to extend" the .NET Parallel Extensions approach further, to "leverage any GPUs that might be available for a further performance boost" for MBF 2.0.

There isn’t a specific deadline for submitting scenarios, but Mercer recommended that researchers turn in ideas as soon as possible so that the features can be developed and added in time for the next release. He added that the company does not intend to stop collecting scenarios with the release of MBF 2.0.

"What we would in an ideal world like is a constant stream of scenarios … and a continual and open dialogue with the research community," he said, "so that we understand what they need not only this month and next month but next year and the year after as we continue to develop MBF."

A Good Response

Microsoft launched MBF 1.0 in July at the Intelligent Systems for Molecular Biology conference (BI 07/16/2010). The company said that since its release, both universities and companies have been using MBF as the foundation to develop a range of analytical tools.

"What we are finding overall is that people tend to adopt small parts and then discover the potential and the value of the parts to add value to what they are doing," Mercer said. For example, "we have a number of groups who have adopted the file parsers but now that they realize that we have a de novo sequence assembler, they are starting to look at that to adopt that into their pipelines."

One commercial group that has adopted the infrastructure is the informatics team at Johnson & Johnson Pharmaceutical Research and Development. The group used the tool to build capabilities that let researchers integrate small- and large-molecule discovery data with its in-house biological and chemical discovery informatics platform. The group presented this work at the eScience Workshop this week.

In a statement, Jeremy Kolpak, a senior analyst at Johnson & Johnson Pharmaceutical R&D, said building on MBF's existing functionalities saved the group "a tremendous amount of time" and enabled it to focus on developing "higher-level analysis and visualization capabilities" for the platform.

On the academic side, a group at the University of Indiana used MBF in two of its parallel sequence alignment programs.

Judy Qiu, a professor at the University of Indiana and the leader of the group, told BioInform via e-mail that the programs are used to perform pair-wise alignment of sequences. She said that both programs were "developed as message passing interface applications where the final outcome is a symmetric square matrix of distances between each sequence."

"Each MPI process computes the distances between a subset of sequences with all the sequences. These partial matrices are finally combined to form a full matrix or saved as they are depending on the requirement," she explained. "Partial computation is optimized by considering the symmetry of the total pair-wise calculation, thus avoiding the calculation of distance between the same two sequences more than once."

She added that the MBF sequence alignment algorithms are used "behind the parallel MPI code to compute the alignment of two sequences."

So far, Qiu's group has used the Smith-Waterman and Needleman-Wunsch algorithms available in MBF and has "extended the set of the built-in similarity scoring matrices in MBF by providing the functionality to load any standard scoring matrix from a file."

The team also used MBF capabilities like the Fasta parser to perform initial parsing of sequences.

In addition to accepting scenarios from the community, Mercer's team is also coming up with scenarios of its own to cover functionalities that may not necessarily be a priority in the researcher's mind.

For example, the team wants to ensure that programmers who want to add their code to the tool "have a seamless and easy experience" doing so. As such, Mercer's team has come up with what they termed a "developer scenario" that aims to make this process easier.

He also said that the group is looking at incorporating visualization capabilities into MBF and has considered using some Microsoft tools, such as Seadragon and Photosynth, to further develop these capabilities.

Once again, however, Mercer said that rather than making assumptions about the kinds of visualization tools the community needs, his team will defer to researchers' suggestions in order to develop "relevant" tools.

The Scan

Back as Director

A court has reinstated Nicole Boivin as director of the Max Planck Institute for the Science of Human History, Science reports.

Research, But Implementation?

Francis Collins reflects on his years as the director of the US National Institutes of Health with NPR.

For the False Negatives

The Guardian writes that the UK Health Security Agency is considering legal action against the lab that reported thousands of false negative COVID-19 test results.

Genome Biology Papers Present Epigenetics Benchmarking Resource, Genomic Architecture Maps of Peanuts, More

In Genome Biology this week: DNA methylation data for seven reference cell lines, three-dimensional genome architecture maps of peanut lines, and more.