Those who were around for the founding of the nucleic acid sequence database recall the old days, and appreciate modern technology all the more
By Adrienne J. Burke
After theoretical physicist Walter Goad died in November 2000, his wife Maxine donated his documents to the American Philosophical Society in Old City, Philadelphia. There, in a climate-controlled room that houses the science and technology archives begun by Ben Franklin in 1743, Goad’s lab notebooks, diplomas, and letters occupy six feet of shelf space. Among them is a memo on Los Alamos National Laboratory stationery dated May 9, 1980, that reads: Monday, May 12 at 10:30 Steve Simon invites you for cake and coffee to celebrate 100,000 bases now in the DNA sequence library.
Goad, who spent the first 15 years of his 40-year Department of Energy career at LANL developing thermonuclear weapons (one of his files is labeled “H-bomb memoranda”), later spearheaded Los Alamos’ effort to create a national repository of nucleic acid sequences. As an original member of LANL’s T-10 team — the Theoretical Biology and Biophysics Group formed by George Bell in 1974 — Goad began building the so-called Los Alamos Sequence Library that would, in October 1982, win a $2 million, five-year grant from the National Institute of General Medical Sciences, and be christened “GenBank, the Nucleic Acid Sequence Data Bank.”
Today, of course, GenBank, which is produced in collaboration with the DNA Data Bank of Japan and the EMBL Nucleotide Sequence Database in the UK, is the indispensable tool of molecular biologists worldwide. Some 40,000 users search or download GenBank’s 22 billion base pairs every day.
Compared to the yellowed paper and quill-penned notes stored at the Philosophical Society, Goad’s GenBank archive is hardly history. But his typewritten letters and ARPAnet correspondences that contain the early plans for a national nucleic acid database are poignant relics of a bygone era. The trivia contained there, as well as a few old-timers’ rusty recollections of the technologies, personalities, and politics of the day, show just how quickly genomics has grown up.
“There’s a great story in how this all got started,” says Christian Burks, who was the first “card-carrying biologist” to join Goad in February 1982. “George Bell and Walter Goad really championed the idea of doing a database. They were both nuclear physicists who had gone to LANL as part of the weapons effort, but during the ’70s they got interested in molecular biology. The fact that DNA sequencing was taking off was what interested them. … They had spent a lot of years collecting very large datasets — nuclear cross sections crucial to calculations around the weapons program — behind closed doors … thinking about what to do with data that you can’t store in notebooks.”
Not only did their efforts plant the first seeds of the Human Genome Project, but, Burks notes, they gave birth to the era of electronic scientific data publishing. “It’s totally fantastical that the database was started and now is high on the list of what biologists use day to day.”
Fran Lewitter held her first bioinformatics job from 1984 to 1987 as a GenBank staffer at Bolt, Beranek, and Newman, the Cambridge, Mass., computer consultancy that shared the first five-year NIGMS contract with LANL. As head of the Whitehead Institute’s biocomputing group, she now offers students a short history of GenBank and says new hires are incredulous when they see the printed compendium in her office from GenBank’s pre-Internet days. “It’s amazing how things have changed in the last 20 years in terms of accessibility and the kinds of information we can collect,” she says.
To be sure, the most feverish buildup of GenBank has occurred during the past decade, since sequencing centers became exponentially more prolific and the National Center for Biotechnology Information took over maintenance. At the end of LANL’s first five-year contract, the database contained just more than 15 million base pairs and 14,000 entries. During its second five years, during which LANL worked with Intelligenetics, GenBank sextupled — in 1992 it held 101 million bases and 78,000 sequences. Under NCBI, GenBank has grown to 200 times that size. But few would deny Goad and his team the credit for having laid the foundations for modern biology’s most ubiquitous resource.
One brainchild, many parents
Depending who’s telling the story, credit for the actual idea of a national nucleic acid sequence bank does not lie exclusively with Walter Goad.
Some versions of history credit the late Margaret Dayhoff, a trailblazing computational biologist at Georgetown University and associate director of the private National Biomedical Research Foundation. Dayhoff’s Atlas of Protein Sequences and Structures was a predecessor to GenBank, and she had begun her own effort to collect DNA sequences around the same time that Goad began his.
By 1981, Dayhoff had compiled a 350 kilobase database of sequences longer than 500 nucleotides and Goad held a 140 kilobase collection that included many shorter sequences. Word was spreading that such resources existed. In May 1980, Stanford’s Doug Brutlag sent Goad the basic repeated elements of two different cloned segments of Drosophila satellite DNAs and wrote in a letter, “I would appreciate very much if you could compare these two sequences with each other and also with themselves in order to detect internal homologies with your program.” Requesting a copy of Goad’s program in a machine readable format Brutlag wrote, “I have access to two types of computers here — a DEC 10 whose Fortran is somewhat archaic, and an IBM 370 with an optimizing compiler.”
The value of a national nucleic acid sequence resource was obvious, and when NIH published an RFP seeking contractors to develop and maintain one, Goad and Dayhoff went head to head for the job.
Christine Carrico, NIGMS project officer for the first five-year contract, recalls that both Goad and Dayhoff were judged competent to take on the task, but that the Los Alamos team “wrote a more responsive proposal” and “had more facilities at its disposal.” Goad had in fact considered partnering with Dayhoff early on, but an electronic note he got in 1980 from Richard Roberts of Cold Spring Harbor Laboratory hints at one reason a relationship never evolved. Wrote Roberts, “I agree with you that the present method of ‘manual data collection’ is a temporary one. I also would agree with you that a cooperative effort between you and Margaret would be the best of all possible worlds. Do you think this is a realistic possibility, given Margaret’s personality?”
Dayhoff, Carrico says, was “very highly respected, but had very definite ideas about how she wanted to do things. She was going to do it her way.” (One year after losing the contract to the LANL team, Dayhoff died at the age of 57.)
Nevertheless, Carrico attributes the master plan for a national databank to neither Goad nor Dayhoff, but says that GenBank was Elke Jordan’s brainchild. Jordan, a 30-year veteran of the NIH who retired this July from the post she had held since 1988 as deputy director of the National Human Genome Research Institute, was the NIGMS associate director for program activities in the early ’80s. She and NIGMS director Ruth Kirschstein convened a series of meetings among leading molecular biologists to talk about the need for a national DNA database, and then, when Kirschstein agreed to support the project, Jordan suggested that the young staffer in the office across the hall from hers implement it. It was Carrico’s first job after her postdoc.
In Europe, parallel efforts were under way. French researcher Richard Grantham had begun collecting sequence data, and in Heidelberg, EMBL’s Greg Hamm and Graham Cameron were doing the same. In a June 1980 letter, EMBL’s Ken Murray informed Goad that his lab had decided to establish a nucleotide sequence data library and was discussing the possibility of developing a “totally automatic nucleotide sequenator.”
Nearly a year later, George Bell wrote to bioinformatics pioneer Temple Smith that “LANL is developing a database and associated software for computer analysis of nucleic acid sequences with the expectation that a national center for such activities will be established here.” Bell offered Smith a one-year visiting staff appointment to assist users of the LANL database, computers, and software in defining and solving research problems.
Meanwhile, Howard Bilofsky names Columbia University immunochemist Elvin Kabat as one of the earliest proponents of a public nucleic acid sequence repository. As a Bolt, Beranek and Newman project manager, Bilofsky began working with Kabat in the early ’70s to manage the Kabat Immunoglobulin Protein Database, which he mailed for free to labs worldwide. Bilofsky, who now directs knowledge and information technologies and alliances for GlaxoSmithKline, calls those days his introduction to public access data and to the world of sequence data management. “It laid the foundation for my team at BBN to bid on the GenBank contract in 1982,” he says.
From @ to ATCG
In particular, it was his experience applying BBN’s statistical analysis software, Prophet System, to Kabat’s sequences in the early ’70s that Bilofsky says helped win his group favor with LANL.
Bilofsky had corresponded with Goad in 1979 about using Prophet for sequence analysis. So, when DOE rules prohibited Goad’s group from accepting an award from another government agency, BBN was poised to act as LANL’s commercial collaborator to win the NIH contract.
BBN, which had developed ARPAnet, the precursor to the World Wide Web, and introduced the @ symbol for email addresses, had developed Prophet with funding from NIH’s division of research resources. It was extraordinarily powerful, Bilofsky recalls. “We weren’t using hierarchical or flat file systems. We had a lot of flexibility and powerful tools at our command that the rest of the world didn’t have.” Prophet is likely one of the “facilities” that Carrico says enabled LANL to win the GenBank contract.
Russell Doolittle, the UC San Francisco evolutionary biologist who later served as a member of the GenBank advisory committee, recalls that the scientific community was appalled that BBN was awarded the job. “A lot of people were mortified because [BBN] didn’t know a thing about databasing of the scientific sort. The real contract should have gone to Los Alamos. BBN got the job because of these rules on government contracting,” says Doolittle.
“It was a melée at the time, but it all worked out at the end,” he concedes.
In September 1982, Elke Jordan and Carrico penned a letter to Science announcing that GenBank would be available to the public starting October 1. Goad told the press: “There are hundreds of researchers in the US and Europe sequencing DNA at a rate in excess of 500,000 bases a year. Our goal is to have all such sequences entered into the databank within three months of identification.”
As modest as it seems now, that goal turned out to be overly ambitious for Goad’s five-person staff. Burks’ description of the workflow makes it easy to see why: “We would go to the literature [and get] the sequence [from] the paper — a line of As, Ts, Cs, and Gs. We would tear it out or Xerox it and someone would sit down and type it into the database.”
Goad’s group collected and curated the data, and sent it by FTP to the BBN trio — Bilofsky, Lewitter, and technical director Wayne Rindone — who cleansed and published it. Bilofsky recalls, “Los Alamos would enter the data and periodically we would get a distribution from them that we would massage and [apply] tools for doing quality control.” Lewitter, who notes that Blast didn’t come onto the scene until 1990, put the database onto magnetic tape and distributed it for analysis with the GCG package. But two years into the project, GenBank was backlogged 18 months. “As soon as there was this public database, people expected the data to be there, but it would take a year for the data to show up,” Burks recalls.
Minutes from one of the first exploratory meetings held in 1980 reveal how the community had underestimated its potential for generating sequence data: “Although all agreed that it would always be necessary to produce both tapes and hardcopy of the data, it was also agreed that establishment of a telecommunications computer network to interactively serve as wide a community of users as possible would speed and simplify collection and distribution.”
In 1984, Carrico reported that 2.8 million of the 3.2 million bases published since 1967 had been entered into GenBank. At the time about 120 individuals and universities received the database on magnetic tape and an average of five users per day accessed it online. “Two technological developments have increased the operating efficiency,” Carrico wrote. “LANL has switched work on the database to microcomputers, and LANL and BBN now use the DOD’s ARPAnet to transfer data between the two locations.” Operating at 25,000 baud, the entire database, with references and annotations, could be transferred in about 40 minutes.
Telenet and magnetic tape
From the time he had started developing the Los Alamos Sequence Library, Goad began talking with researchers in Europe about sharing data. When the GenBank award came through, the groundwork was laid. Goad told Carrico in September 1982 that at a workshop in Aspen, Colo., his team, BBN’s Rindone, and Greg Hamm had arrived at a plan for transferring data between LANL and EMBL via Telenet access to a VAX.
Because LANL’s bank of CDC 7600s used to manage the database weren’t easily Telenet accessible, Goad would put files up on the VAX for EMBL to peruse. “This will permit very close coordination of our collection efforts, even if the data itself is exchanged on magnetic tape,” he told Carrico.
“From early on there was the notion that we’d like to exchange data, and there was a formal sense from NIH that this was meant to be a global project,” Burks recalls. But the partnership between the Americans and Europeans was not without complications.
What features to include in a given entry was one point of contention. Lewitter recalls, “One time there were 10 or 15 of us who got stuck in Virginia in some motel trying to hash out with people from EMBL and Los Alamos what should be in the features table. The Los Alamos group was annotating and assigning keywords to each entry … but 50 percent of the keywords had one entry pointing to them. We had to do some work on re-indexing.”
Carrico, who acknowledges that “time tends to dull the bad memories,” says that there were plenty of disagreements in GenBank’s early days, not just among the collaborators, but also among users regarding conventions and protocols for entering data. In 1983, Carrico reported to Kirschstein that EMBL and GenBank were each applying their own standards and annotation to the data, but that the two banks were “moving toward an identical content with the frequent exchange of data tapes. …While the formats of the two banks are different, each possesses a program to reformat the data from one to the other so that does not present a problem.”
Ultimately, EMBL and GenBank worked out the kinks and by early 1984, the two teams were working to publish a hardcover compendium as a supplement to Nucleic Acids Research. The two-volume, 600-page book would cost about $75.
“The folks at EMBL deserve a lot of credit,” Bilofsky says. “GenBank has always gotten more visibility, but the folks at EMBL have done a remarkably good job and been very creative over the years, and that’s helped balance work in the US.”
Accession and succession
Another turning point came when NIH’s efforts to get the scientific journals to cooperate with GenBank paid off. Nucleic Acids Research editor Dieter Söll offered to forward Goad “clean” copy of sequences in papers accepted for publication, and eventually, all of the major journals agreed to do the same. “That was a big thing,” Lewitter says. “If you published a sequence you had to get an accession number to GenBank.”
To make data exchange more fluid, Carrico asked DARPA to permit university researchers to access GenBank through their ARPAnet nodes. “It turned out to be not as easy as it would have seemed, due to the levels of security you had to go through,” recalls Carrico, who didn’t even have a computer in her office at the time, “but the Internet then developed pretty quickly.”
Debates also ensued about whether GenBank staff should incorporate software tools into the database, or provide a bulletin board telling users where to access them. While everyone seemed to agree that more analysis capability was vital, some, such as Richard Roberts, argued that GenBank should emphasize retrieving, not processing, entries. In hindsight, Burks says that NIH’s sharp decision to restrict GenBank to working on the database and not tools was a mistake.
“I would have given more leeway to the effort to create tools, taken a more holistic approach,” Burks says. “I think it’s good to have an organic connection to the people who are using the database and creating it. Shipping it off and saying, ‘They’ll figure out how to use it,’ worked but it was also a bit awkward.”
That said, when asked how they would have acted if they knew then what they know now, Burks and Bilofsky have few regrets. Overall, Russ Doolittle calls the project “an enormous success.” The early GenBank team established something, he says, that now rivals the US Geological Survey in terms of scientific usefulness.
David Lipman, the first and only director of NCBI, which has managed GenBank since 1992, has modest ambitions for continuing to improve its sophistication related to connectivity and curation. And he has plans for expanding the database’s usefulness for protein family, model organism, and gene function research. Looking back on the early days, what might he have done differently? “I’m just impressed by the view that they had of the future.”
And now, says Lewitter, “people can’t understand what it would be like if they couldn’t sit down at their computer and search GenBank.”
Where the GenBank Generation Is Now
George Bell: nuclear physicist who in 1974 founded the Theoretical Biology and Biophysics group at Los Alamos where colleague and close friend Walter Goad started GenBank; a founder of the Center for Human Genome Studies in 1988. Died 2000.
Howard Bilofsky: theoretical and computational chemist who was introduced to sequence data management in 1974 through his work assisting Elvin Kabat in compiling a database of immunoglobulin protein sequences; first GenBank project manager at Bolt, Beranek, and Newman; credited with adding the “Gen” prefix to Wayne Rindone’s suggested “Bank” to coin the term GenBank; left BBN in 1990 to spend three years at EMBL establishing the European Bioinformatics Institute; joined SmithKline Beecham in 1993 to build bioinformatics strategy; now director of knowledge and information technology and alliances in GlaxoSmith- Kline’s R&D IT division in King of Prussia, Pa.
Fred Blattner: founder of early bioinformatics company DNAStar and University of Wisconsin assistant scientist in 1983 when appointed as third official curator of GenBank for lambda bacteriophages; currently professor of genetics and director of the Genome Center at the University of Wisconsin, and president of DNAStar.
Christian Burks: molecular biophysicist/ biochemist who joined LANL’s Theoretical Division in February 1982 as a postdoc to help write GenBank proposal to NIGMS; appointed GenBank PI in 1988 and became group leader for Theoretical Biology and Biophysics Division and program manager for Computational Biology; left LANL for CIO post at Exelixis in 1997; left Exelixis in March this year to become CSO for Affinum Pharmaceuticals in Toronto.
Christine Carrico: NIGMS project officer for first GenBank contract in 1982; wrote GenBank RFP and managed contract for first five years; left NIH in 1993; now executive officer of the American Society for Pharmacology and Experimental Therapeutics in Bethesda.
Margaret Dayhoff: research biochemist, professor at Georgetown University Medical Center and associate director of the National Biomedical Research Foundation; created the Atlas of Protein Sequence and Structure and an early online nucleic acid sequence database; lost bid to manage GenBank in 1982. Died 1983.
James Fickett: on original GenBank team, member of T-10 at LANL from 1980-1996; left LANL to become director of bioinformatics research at GlaxoSmithKline; left GSK in 2001 for position as global director of bioinformatics for AstraZeneca R&D Boston.
Walter Goad: nuclear physicist whose 40-year DOE career included 15 years in thermonuclear weapons development followed by a tenure in the T-10 group at LANL where he created GenBank and oversaw the staff that managed it for 10 years. Died 2000.
Elke Jordan: early proponent of GenBank at NIH; got ball rolling on GenBank RFP as associate director for program activity at NIGMS; in 1988 left NIGMS to become deputy director of NHGRI; retired in July to work part-time for the Foundation for the National Institutes of Health.
Elvin Kabat: Columbia University professor of immunochemistry who studied immunoglobulin protein sequences, first written out on pieces of taped-together paper spread on the floor, later in an electronic database; published the Kabat Immunoglobulin Protein Database; early advisor to GenBank, and one of first two official GenBank curators appointed in 1983. Died 2000.
Minoru Kanehisa: Japanese physicist who worked on Walter Goad’s Los Alamos Sequence Analysis System in 1980 and helped create GenBank; left LANL in 1983 for NCI; left US in 1987 to become professor in the Institute for Chemical Research, Kyoto University; since April 2001 has been directing the Bioinformatics Center of Kyoto University.
Fran Lewitter: Harvard genetic epidemiologist who made a career change in 1984 to bioinformatics; joined Bolt, Beranek and Newman to work on GenBank; responsible for FTP of data from Los Alamos, distribution on magnetic tape, and later on floppy disks; experimented in 1986 with putting GenBank on CDRom. Left BBN in 1990; now head of Biocomputing Group at the Whitehead Institute.
Wayne Rindone: technical director of GenBank at BBN from 1982 to 1987; from 1992 to 1998 worked in Harvard’s Biological Laboratories in the Department of Cellular and Molecular Biology, as technical director of the FlyBase database; since 1998 has been a senior data programmer/analyst in the Lipper Center for Computational Genetics in the George Church laboratory in the Harvard Medical School Department of Genetics.
Richard Roberts: molecular biologist who won the Nobel Prize in 1993 for work in the area of DNA methylases; in 1983, while at Cold Spring Harbor Laboratory, appointed one of first two official GenBank curators for adenovirus serotype-2; in 1992 left CSHL to become joint research director for New England Biolabs in Boston.
Temple Smith: nuclear physicist and biomedical engineer invited in 1981 by George Bell to take a one-year visiting staff position at LANL to help establish nucleic acid sequence database; left Northern Michigan University in 1982 for LANL; now director of Biomolecular Engineering Research Center at Boston University.