Skip to main content

Atul Butte Discusses Integration of Microaray Data and Standards


At A Glance

  • Atul Butte
  • Instructor in pediatrics, Harvard Medical School; assistant in endocrinology and informatics, Children’s Hospital, Boston.
  • Postdoctoral — Completed fellowship in endocrinology and informatics, Children’s Hospital
  • Honors — Howard Hughes Medical Research Training Fellowship for Medical Students
  • MD, Brown University School of Medicine

Atul Butte’s opinions on microarray data standards and integration might appear to veer from the mainstream. But, with a background as a software engineer at Microsoft and Apple, as well as an MD from Brown, a specialty in pediatric endocrinology, and a Master’s in medical informatics from MIT, Butte straddles the crossroads of computer science and biology.

Butte, the co-author of one of the first books on microarray analysis, “Microarrays for an Integrative Genomics” published by MIT Press, is currently on staff at the Children’s Hospital Informatics Program in Boston, and is an instructor at Harvard Medical School. He is also the founder of three software companies — Mango Tree Software, Xpogen, and Genstruct.

In addition to the microarray analysis book, which he co-wrote with Isaac Kohane and Alvin Kho, Butte has also authored several recent papers on the subject. These include “The use and analysis of microarray data,” in Nature Reviews Drug Discovery for December, 2002 and “Comparing expression profiles of genes with similar promoter regions,” published in the Dec. 18, 2002, issue of Bioinformatics.

You seem to be pretty busy.

I’ve given 40 talks in the last three years. I write a lot, but not enough, if you listen to my boss.

How did you get involved with microarrays?

I was working in collaboration with the Todd Golub group on microarray data [at the] Whitehead [Institute]. We were staring at the data for a long time when Isaac Kohane got the idea to combine that data with another dataset to see how well drugs work on those 60 to 70 cancers and the genes for those cancers. It’s been non-stop since then.

Tell me about your group.

We are the biggest bioinformatics group in Boston. We’ve grown to a huge number with 20 people doing microarray analysis, working with eight different core facilities in Boston and analyzing 50 to 100 microarrays a week. We aren’t a core facility, we are collaborators. It’s a thankless job, [as] we’re viewed as a core facility. We work with people, helping design the experiments.

What microarray technology are you using?

Almost exclusively Affymetrix. The majority of researchers we work with in Boston and Cambridge use Affymetrix. It’s much easier for a new lab to get started, compared to spotting arrays, where it takes 18 months to get up and running. You just buy the latest chip from a company that knows what you are doing.

Isn’t there a cost differential?

Price is very transient. You need more than just price to justify the technology. For people who are working on zebrafish, you are out of luck unless you want to make your own. For the vast majority, Affymetrix is good. The latest Affymetrix chip has the whole genome across two chips, instead of the previous five. Each data point costs less than it was a year ago. Yes, you are going to find genes that aren’t on there. So, you run it again next year when the next chip comes out. It’s a subscription model.

If you had unlimited funds, what would you do?

I would try to study diabetes and research how the insulin receptor works. That’s my area of interest. I want to study the protein components of how insulin does its signaling and use microarrays to figure out signal transduction.

Can I get you to look into your crystal ball for me and tell me what you see?

Integration. More and more people will be putting out data on websites and repositories, because the journals will make them do that. We need to come up with applications that will take advantage of these.

Two years ago, the National Heart Lung and Blood Institute began 11 programs in genomic applications. Each of the 11 is a consortium of four to 15 different sites, huge institutes that will use the state-of-the-art technology to measure tissues and put out the information in 60 days from quality assurance, well before publication of data. It’s a novel concept to generate data for the rest of the world. But it is so tough to get at the data — each has its own format, the downloading. There are 5 million pieces of information from 1,200 microarrays and not a single one of those has appeared in a publication.

What is your solution for this?

We can figure out how to get at all this data and we don’t need all these standards. It not rocket science to figure out the arrays that are available on the Internet. I’m a big fan of integration now, instead of setting up committees to come up with standards. That’s not required to get integration. You can do queries and find all sorts of gene-gene connections that would never be found if you looked at the data individually.

You talk about integration being easy to accomplish today. Why isn’t that happening?

There is an influx of computer scientists in bioinformatics that want to work on these standards because they don’t understand the biology. The science is driving this. It’s not rocket science, it’s just writing a few Perl scripts. If we get the microarrays [fitting into] these standards, we still don’t have the context. The rate limitation is what is in these samples. You can read the website and the paper to figure out what sample is which. We can try to code the samples but no amount of informatics can code samples completely. So much other clinical data went along with those patients that will never appear on the website.

You have a unique background, mixing computer science and biology. How did you get involved in both fields?

I was in the right place at the right time. I started off in computer science at Brown University and I was completely into computer science. I worked at Microsoft and at Apple and I can code in my sleep, no problem. When I entered medical school, I had never stepped into a biology lab before. I didn’t want to be an applications programmer, so I worked at the Howard Hughes Medical Institute. I sequenced, ran gels, ran PCR and got my hands wet. Now, I have no fear in talking to a biologist or a computer scientist and I can understand either vocabulary. After medical school, I worked six years at Boston Children’s hospital. I’ve been here on staff for more than a year now.

My ideal is to be in the middle ground, knowing the computer science and knowing the biology.

What software do you use?

I use the Matlab, a generic mathematics and matrix type of environment not specifically [designed] for genomics. It’s very facile with numbers and large matrices, like Mathematica. I have also written scripts to analyze people’s data. GeneSpring and others are beautiful programs, but the hard part is that it’s very easy for biologists to tweak experiments in a way that prohibits off-the-shelf analysis. It’s very easy to exceed what is available. So you have to go to basic programs and code what you want.

What do you write your code in?


You worked at Microsoft and Apple. What are your thoughts on Microsoft in the life sciences market?

I think Microsoft is absolutely where they want to be. Steve Jobs [of Apple] has interesting vision, but Bill Gates really knows the technology. He helps with programming and knows the code.

Microsoft Research does a lot of work in data mining and knowledge discovery and those tools are applied in life sciences. But, otherwise, the market is too small. The microarray industry is on an up ramp, but they would have to build new tools. And right now, it’s not a good time to be a tools company. Many of those companies are now morphing from tools companies to discovery partners running the tools.

What is the biggest issue in microarrays now?

It’s not the sample preparation, not the chip and microfluidics; it’s not the analysis, it’s just hard to figure out what you have. If you are writing up a paper with data collected a year or a year and a half ago, and you have lists of genes and you want to come up with a story that explains, there are very few automated tools that do that. The names and what we know [are] constantly changing. I might have run an experiment three years ago and at the top of the list would have had an unknown, and a second unknown. Now we have a name, an ontology, and a position in the genome.

What is your advice to computer scientists?

Interesting findings cannot end on the computer screen. You have to go back to the biologist and learn the vocabulary. Read a couple of review articles in the biomedical domain that one is studying, that would help a lot. There are no shortages of review articles in the journals: Do the homework. That’s how to add the most value.

The Scan

Response Too Slow, Cautious

A new report criticizes the global response to the threat of the COVID-19 pandemic, Nature News reports.

Pushed a Bit Later

Novavax has pushed back its timeline for filing for authorization for its SARS-CoV-2 vaccine, according to Bloomberg.

AMA Announces Anti-Racism Effort

The Associated Press reports that the American Medical Association has released a plan to address systemic racism in healthcare.

Nucleic Acids Research Papers on miRMaster 2.0, MutationTaster2021, LipidSuite

In Nucleic Acids Research this week: tool to examine small non-coding RNAs, approach to predict ramifications of DNA variants, and more.