Skip to main content

November 2002: Keeping Whitehead on Edge

Premium

It's a little after 10:00 on a characteristically chilly, gray Boston morning. Jill Mesirov, CIO of the Whitehead Institute Center for Genome Research, is sitting in the back of the weekly cancer genomics meeting when someone tiptoes into the room and passes her a note. The center's sequencing informatics director is on the phone, and it's urgent. Mesirov is on her feet in a flash and across the hall in her office before anyone in the meeting even notices.

Later, she laughs about the incident. The director was looking for some data, and Mesirov couldn't help — the data hadn't been finished yet. "I thought she was calling to say the system was down," Mesirov says, relieved. "It's happened before."

Spend a day with Mesirov, and it's easy to get the feeling that what she does most is put out fires. In a rare break from meetings, she sits down at her computer — she clings to her G4, one of the few Macs in the center — and scans through e-mail. "Let's look for emergencies," she says, half to herself, as she settles into her chair and scrolls through dozens of new messages.

Emergency response may be a major component of her job, but it's Mesirov's knack for preventing them that makes her even more valuable to Whitehead. Her boss Eric Lander notes, "She's had world-class experience in cryptography, parallel computing, and data mining. It's a great background to have as biology is fast turning into an information science."

Mesirov joined the center almost five and a half years ago and spent tremendous effort taking the place from the way she found it — where servers crashed simply because they were left in hallways to bake under the sun streaming in — to where it is now, a mega-infrastructure that can handle 45 million lanes of sequence per year with some 40 terabytes of online storage capacity.

The scale-up over the last three years has been most notable. Mesirov's annual budget is $10 million to $15 million for all compute needs throughout the center's three programs: sequencing and genome analysis, medical and population genetics, and cancer genomics. She's used this over the past three years to bring in $6 million in new hardware — "the retail value is obviously much higher," she notes — and spends the rest on software elements and personnel.

Mesirov's budget may be generous compared to some institutions, but it doesn't touch the $22 million Sanger dropped on a new compute system this fall. "We operate on a pretty lean budget and I think that we produce a tremendous amount of important science with it," she says, conceding that "if we had a larger budget, we could do more."

Now, her task is keeping the genome center at the cutting edge, particularly in the face of challenges such as NCBI's trace repository and the anticipated departure from the Compaq Alpha line, the system that underlies all of the center's infrastructure.

Up to Speed

Like the genome facility, Mesirov herself has come a long way to get where she is today. The Philadelphia native spent 10 years at Thinking Machines, where she met Whitehead's Lander in 1987 and worked with him to implement the Smith-Waterman algorithm on the company's massively parallel computer. It was her first exposure to computational biology. "He really got me hooked," she recalls. "My whole research direction changed." A visualization of that first work with Lander, now a good friend, hangs outside Mesirov's office.

When the company went bankrupt, Mesirov headed for IBM, where she managed the bioinformatics and computational biology market for North America for two years. "That was at a time when IBM was sort of dancing around getting seriously involved in that market but had not yet made a serious financial commitment to it," she says. "It made it a little bit frustrating for me."

Mesirov, whose own way-back background is in mathematics at Penn and Brandeis — she learned high-performance computing in her six years at the Institute for Defense Analyses, a think tank delving into complex cryptology problems on a Cray — wanted to get back into research. "[Eric] approached me and asked me if I would be interested in leaving IBM and basically running bioinformatics and computational biology at the center. I didn't have to think twice about it." She came aboard in June '97, just after the sequencing center was set up and right before the major scale-up began: "It was a perfect time."

It was the perfect time for Whitehead to get Mesirov, too. The thrust was "on getting base pairs out the door," she says, but there were just three or four informatics people in the whole genome center then. She now oversees a staff of some 70 people, including the IT group that manages the center's approximately 130-processor Alpha farm.

Staffing wasn't the only thing that needed to be ramped up, says Michael Zody, chief technologist of sequencing informatics. When he started in March of 1997 at the center, based in what was once part Genzyme warehouse, part beer distributor, there were 12 ABI 377 sequencers spewing out a mighty two gigabytes or so in sequence reads each day off slab gels. "But only 200 to 400 megabytes of that was usable data," Zody says. The center, which has grown to occupy space in two buildings on the MIT campus, now sees 30 gigs of data roll off its 160 3700 capillary machines daily, and the new 3730s are gearing up for even more.

Computer security was also an issue that had to be addressed in a big way. KM Peterson, manager of computer systems operations, says before Compaq's Alpha was chosen as the center's platform, there were various systems scattered around. "We used to have a lot of Linux here," he says. "It was a big target for hackers." Three or four years ago, when Peterson's IT group was so understaffed that he was just one of two people overseeing the whole system, a series of major hacks sent the center into a tailspin, bringing down the web server and erasing data.

"They ground us to a halt," Mesirov recalls. "It took us a number of weeks to recover from that." Security is still her major concern: "I do worry about the integrity and security of our data. … Our data is our lifeblood."

Peterson tracked down the hackers, who he thinks had no idea what genomics or the Whitehead were. Because the center uses MIT's network, "They're able to say, 'I broke into MIT,'" Peterson says, believing the hacks were about bragging rights rather than a targeted attempt to compromise the genome center.

That's not to say he didn't worry about security. Since then, Peterson has joined the MIT network security team to keep aware of breaches, and after temporarily solving the problem by fire-bricking the server, has installed a full firewall for the center. Linux was ousted. Peterson stepped up the ritual backup, doing a full backup every other week and an incremental one each night. He makes sure a copy of the tape is sent off-site for added security. "Murphy is alive and well and spends a lot of time here," he quips.

Brian Gilman, group leader for medical and population genetics, jokes that his early days at the center were like something out of NASA. "You put all the pieces on a table and are like, 'Okay, I've got duct tape, a tube, some toothpaste, and some bubble gum. What can I do to find SNPs with that?'"

Curse of the Edgy

But the problem with being cutting edge is in what it takes to stay in front of the technology tidal wave behind you. "Computers are like clothes, right? They go out of fashion in two years," says Mesirov, who has to keep putting her bottle of water down so she can talk with both hands. Changes in research needs and algorithms happen even faster — and she has to keep up for all three of the center's programs.

"It's very hard having this broad view, being responsible for all the different programs," Mesirov says. On a tiny scale, it's like the problem she has juggling offices in both center buildings, a few minutes' drive apart — invariably, the paper she needs is where she isn't. "I have a severe data integration problem in my head," she jokes.

A significant part of her job involves keeping an ear to the ground so she knows what's coming and can position the center to be on top of it. There's nothing better than tapping other people, she says; even in her management style, she's loath to make a decision without consulting everyone who may be affected or have an opinion on the matter.

At the local level, Mesirov is continually in meetings with Whitehead folks, figuring out what's coming, what needs to be done, and where various projects are. Not only do the others' contacts in the field help, but their diverse backgrounds give each person a unique perspective on where the cutting edge lies. Among her bioinformatics group, just one person has a biology background; the rest came out of math, physics, engineering, computer science, signal processing, and cryptology.

More broadly, Mesirov stays in touch by hitting the conference circuit each year, attending enough meetings to keep her up to date. A member of at least six advisory boards of companies and nonprofits, she says the opportunity to visit those places and see "how other people do it … really pays off. I think it's really important to keep those external connections going."

Something Mesirov doesn't credit, but clearly relies on heavily, is her ability to focus further ahead than most people when she's planning projects. At a meeting to talk about progress on GenePattern, the gene expression software that will succeed Whitehead's GeneCluster 2, Mesirov's group brainstorms the functionalities that will be needed. Where her group sees a time-saving solution by keeping a particular function behind the scenes, Mesirov sees a possible gap. "But do we ever see a time when people will want to get prototypes from the Web?" she asks.

Part of that knack comes from experience. At a lunch meeting with sequencing informatics development director Toby Bloom, Mesirov hashes out the possibility of making certain data inaccessible after a period of disuse to free up resources. "Every time we say we never want to look at that data again, guess what," she says, poking at her salad. "We always do," answers Bloom.

Permanence worries Mesirov in a field that morphs as fast as genomics. "What we do here changes all the time," she says. In her group meeting, she's convincing her colleagues to put the effort in ahead of time to enable certain modules, whether or not they'll be added later. They're concerned that it'll take too long; the software's release date has already been set. "If we don't put the hooks in now, I don't want to refactor [later]," Mesirov says.

Keeping up is only half knowing what the best technology and algorithms will be. Even with an enviable budget, money is still an issue for Mesirov's team. The center relies on grant funding for its projects — "you always have to budget for the unexpected," Mesirov notes — and when grants don't allow buying cutting edge, you have to be creative to stay ahead. Sometimes that means reworking funding if grants are flexible enough to allow that, or just hunting for spare cycles on existing infrastructure. In some cases, other centers can help out, as can vendors.

Compaq was particularly generous providing access to its BioCluster for sequencing centers looking for more power during the human genome sequence race. "Jill and Eric Lander pointed out that they were short of computing cycles to finish the [draft]," says Ty Rabe, R&D manager for high-performance technical computing who came to HP from Compaq. "We spend a lot of time with Jill and her staff trying to keep them up and running."

The open-source community has also been a boon to the genome center's crew. When the center lets its software loose for the public, open-source aficionados often pitch in and provide interfaces to languages or standards that Mesirov's team hasn't been able to address yet. That was one thing that helped with OmniGene, a middleware solution Brian Gilman headed up over the past year to integrate data so researchers using myriad interfaces could gain access.

Mesirov remembers the mouse assembly as an example of creative computing. Whitehead didn't have a large enough machine to assemble the genome, and the assembly code didn't lend itself to parallel computing. "We didn't have time to parallelize the code, and it would've been a humongous investment to buy a bigger machine," Mesirov says. "This team really worked themselves to the bone to reduce the memory requirements of the code" — to the point where the assembly could be done on a 32-gigabyte ES45.

Another dilemma came when the center switched from BAC-based sequencing to whole-genome shotgun, which produces far more files and directories than the BAC routine the file system was optimized for. "We didn't have time to rearchitect" the system to get around the restrictions on the number of files, Mesirov says. "So we kluged various solutions to that. … They worked, they got us through the human draft, they got us through mouse. But the fact is, they have been rather costly to support and maintain."

Gathering Steam

There are any number of upcoming projects that on their own could, conceivably, take up every last moment of Mesirov's time. And like the load-sharing system where she suspects no researcher opts for the low-priority job setting, each project seems to have highest importance. If there's anything that keeps Mesirov up at night, it's figuring out how to dole out her own intellectual resources to keep everything going at full speed.

One major expected change for the center is the choice of a new system to replace the Alpha. Though HP has vowed to support the Alpha for years to come and Rabe says users can expect a new generation of Alphas next year and an upgrade some 18 months after that, Mesirov says there's no point in staying with the system any longer than necessary when it's obvious that upgrades and support will dwindle.

HP, of course, hopes that its major customer will migrate happily to HP UX. "We're moving a lot of the features that they like from Tru64 into HP UX," Rabe says.

"We anticipate bringing in the HP UX system to look at as a point of comparison," says Peterson, who says no one on his team has worked with the platform before.

Mesirov says she will certainly "talk to all the usual suspects" about the post-Alpha path. "We don't want to jump into anything precipitously. … I would feel really good if by the end of the year we had some notion," she says, but adds that part of the decision-making process relies on what HP does. "I'm not sure how realistically the details of the HP roadmap will have gelled by then. I don't want to rush it."

Peterson expects to keep using the Alpha system for the next couple of years, and says one change this is prompting at the center is the realization that a heterogeneous compute environment, like a diversified financial portfolio, could be beneficial. In the future, he may use different vendors for specific projects rather than insisting on a uniform platform for everything.

To that end, the genome center could see the return of Linux. "We're thinking about supplementing the existing farm with maybe a Linux farm," Mesirov says. Her team is currently testing to determine if a new farm would be more cost-effective than an add-on.

Peterson envisions the farm as "Blade servers — rack-based Intel servers that run Linux." Still very aware of past problems with the platform, Peterson says if a Linux farm is brought in, it will likely not be used to store vital data.

Storage is, of course, a continual sore point for Mesirov. Her rule of thumb: "As soon as you put [new] storage online, it's full." The center has some 40 terabytes of online storage, and somewhere between 50 and 100 terabytes of data archived to tape, Peterson estimates.

One of Mesirov's main projects now revolves around storage — it's building a better trace repository now that NCBI requires genome centers to deposit trace images. The center has always kept its traces, about five or six terabytes generated per year, in the file system. It's a clunky approach: Zody acknowledges that it can be easier to download the Whitehead's own data back from NCBI than to manage it properly at the center.

The file system was used in the first place so that the most-requested trace images would be kept on top, while less popular traces sifted down to the bottom — much the way Google displays search results. That means, though, that the genome center uses a lot more space for the trace files than it has to. The solution will be a new database that indexes the traces; the files will stay in the file system but won't move around, and searching will occur through the master database.

It's a sleeker way to manage the data, and it will free up more storage resources at the genome center. The timeline is dictated by practical issues. Toby Bloom isn't quite sure when the database will be online, but, she says, "Chimp starts in January, and it needs several terabytes. We don't have it, so this will be up and ready by January."

But by January, of course, everything could change. Just thinking about a five-year prediction for the center makes Mesirov laugh: "The technology cycle is so much shorter than that." When pressed, she says the center will become increasingly computational with even more genome analysis needs. In the ideal world, she'd be able to count on the continuance of Moore's law. "You know, more power, more storage, more cost efficiency, and in less space. That's my dream."

The Scan

And For Adolescents

The US Food and Drug Administration has authorized the Pfizer-BioNTech SARS-CoV-2 vaccine for children between the ages of 12 and 15 years old.

Also of Concern to WHO

The Wall Street Journal reports that the World Health Organization has classified the SARS-CoV-2 variant B.1.617 as a "variant of concern."

Test for Them All

The New York Times reports on the development of combined tests for SARS-CoV-2 and other viruses like influenza.

PNAS Papers on Oral Microbiome Evolution, Snake Toxins, Transcription Factor Binding

In PNAS this week: evolution of oral microbiomes among hominids, comparative genomic analysis of snake toxins, and more.