By Adrienne Burke
Gene Codes Forensics staffers have been consumed since September with the saddest software job they’ll ever face, and perhaps their proudest achievement
Standing on a patch of asphalt on a temperate spring day, Howard Cash recites a funeral prayer with a dozen other people bowing their heads before 16 refrigerated tractor trailers parked on Manhattan’s East Side. Lined up side by side in two rows under a 60-foot-high white canopy and an enormous quilt of American flags, the trailers are adorned with flowers. At the end of the service, the mourners place lit lanterns before the door of each trailer. Inside are thousands upon thousands of fragments of human bodies recovered from the site of the World Trade Center.
Cash, president of the Ann Arbor, Mich., bioinformatics software company Gene Codes and its new wholly owned subsidiary Gene Codes Forensics, has attended one of these weekly ceremonies many a Friday afternoon since last autumn, when his company was recruited by New York’s Office of the Chief Medical Examiner to help solve the most horrific DNA pattern-matching problem imaginable.
Since September 11 last year, the ME’s office has been receiving remains from the World Trade Center disaster at a rate of about 100 “pieces” per day. Anything that resembles human bone or tissue, down to the size of a fingertip, is delivered by ambulance to the triage center at 30th Street and First Avenue where a DM number (Disaster Manhattan) is assigned and a sample is taken for DNA extraction.
Meanwhile, victims’ parents, siblings, and children have submitted cheek swabs, as well as personal effects of the deceased — toothbrushes, combs, razors, etc. — to the New York State Police. New York’s ultimate goal is that every bit of human recovered from the disaster site will be identified by DNA, matched to a relative and a personal effect, and returned to a grieving family. Even remains that are too damaged to yield enough readable short tandem repeats are being preserved in hopes that a mitochondrial DNA profile or even some identifying SNPs can be recovered.
Terror-stricken Data Collection
Faced with human remains so pulverized as to have been ground together by mortar and pestle, and with family and other identifying data pouring in from 22 different sources, it was immediately obvious that the ME’s standard methods for making crime-scene DNA identifications would be utterly inadequate. For the homicide and sex crimes cases that it is used to handling, the ME’s office conducts its own DNA extraction and employs the FBI’s CODIS (Combined DNA Index System) software for one-to-one pattern matching. Even for the crash of American Airlines Flight 587 in Queens that came two months after the World Trade Center attacks and would otherwise have been the biggest disaster that the office ever handled, those methods would have been adequate.
To handle the unprecedented World Trade Center disaster, in which 2,824 perished, the office arranged to outsource DNA extraction and short-tandem-repeat profiling to Myriad Genetics in Salt Lake City and Bode Technology Group in Springfield, Va. And in late September, the office’s department of forensic biology, which had been using Gene Codes’ Sequencher software for mitochondrial DNA analysis, called Cash for computational help.
Department head Bob Shaler told Cash that he wanted five things from a software program: match DNA from individual remains with DNA from family members and victims’ personal effects; reunify separated pieces of individuals; track collected samples; maintain chains of custody for all submitted swabs and personal effects; and confirm the accuracy of the identifications with rigorous quality assurance tests. After all, there had been no preparation for this data-gathering project, and the donors as well as most of those collecting the data were in a state of shock at the time. Entry errors were a given.
To be sure, the computational challenge for Gene Codes would not be extraordinarily complex: write a program that detects matches within and among several fields of data.
But the significance of the program’s findings, which would trigger the release of remains to families for burial, meant that impeccably accurate output would be imperative. As Gene Codes Q/A specialist Amy Sutton notes, the demands of the project were different from creating a commercial software product. “It’s embarrassing if it crashes, but it’s acceptable,” she says. “What’s not acceptable is making the wrong match.”
Cash says that his philosophy going in was: “The bug that crashes the program is not that bad; a bug that makes it look like we made an ID when we didn’t is catastrophic.”
Indeed, Cash has approached this job, which he has called the most important moment of his professional life, as if a religious calling. Gene Codes Forensics has a $10 million contract with New York City, but Cash says he intends to bill only for his costs, which he estimates will fall in the range of $3 million. As a consequence, Gene Codes suffered its first unprofitable quarter in eight-and-a-half years.
A Crossword With Half the Hints
Mike Hennessey, Gene Codes’ former business development director who had coincidentally taken up residence in New York before September 11 and now occupies a desk in the department of forensic biology on Gene Codes’ behalf, offers an example of the sort of data-entry complications they were facing: “Say you came in and reported your husband lost in the World Trade Center, and, unbeknownst to you, your husband’s brother came in as well. You reported his name as Billy, and his brother reported him as William.” That hypothetical case would have created two ID numbers with two names for one missing person — a fact that would eventually create confusion when the same recovered tissue matched a personal effect linked to one name and ID number, and a DNA swab linked to another.
Imprecise record-taking caused additional confusion. Says Cash, “If someone asked, ‘What is your relationship to the victim?’ and the answer recorded was ‘Father,’ does that mean the victim is the father, or the person filing the report is the father of the victim?”
Mix-ups even occurred among personal effects. “People brought in [a toothbrush or a brush] and stapled a cover sheet to the bag, but by the time it got to the state police in Albany the cover sheet had become detached,” says Hennessey. “If you have a box of 10 bags and nine have the cover sheet stapled to them and there’s a loose cover sheet in the box, you know where it goes. But what happens if a couple of them came detached? Part of what we have to do is go back and work through the paper trail and say, OK, this one really belongs here and not there.”
Hennessey, who physically walked the entire data recovery process from Ground Zero to Albany, knows as well as anyone the opportunities for errors along that trail. “We were trying to identify where the material and the information for that material were split and then reunited,” he says. Drawing a flow chart to demonstrate, he explains: DNA extracted from a toothbrush by the New York State Police Forensic Investigation Center in Albany is sent to Myriad for short-tandem-repeat profiling. But it’s sent as an anonymous sample and only later reunited with the correct toothbrush and family identifier. Says Hennessey, “Wherever you have that happening you have potential for a mistake.”
No one is to blame; mistakes under such conditions were inevitable, he emphasizes. Just how many were there? “I will tell you that the number of cases where all the forms are filled out the same, everything is spelled correctly, no dates are transposed, and there is 100 percent concordance has got to be less than 20 percent.” Most, he says, are classified as mere “category one” problems — conflicts such as mistranscribed date of birth entries that need simply to be double-checked.
But in about 10 percent of cases, there are serious “category two” errors. For instance, a family might have turned in a victim’s toothbrush for DNA identification, but there’s only a razor with his name attached to it in the evidence locker in Albany. “That’s a problem. That’s a big problem,” Hennessey says.
He worked 22 hours over Easter weekend sorting out the personal effects for three particular cases. “It’s like doing a crossword puzzle where they only give you clues to the verticals,” he says.
Programming Under Extreme Conditions
With victim’s photos and biographies lining the walls of Cash’s Ann Arbor office as constant reminders of the gravity of their task, Gene Codes engineers toiled days, nights, weekends, and holidays through October and November to build a customized profile-matching software program. Writing code in C#, they followed an approach called Extreme Programming, or XP, in which developers work intensively in pairs — one writes, one monitors the effect on the broader application, and both test and review each line of code as they go. Cash even brought in XP programming guru Kent Beck to coach the team for three weeks.
Tom Kubit, a senior software engineer who joined the company while the project was underway, says the agile processing approach is what has allowed Gene Codes to respond to weekly requests from the customer for functionality changes. This was key, Kubit says, because “this kind of application had not been attempted before. No one knew what they wanted out of it. [There had to be] a lot of flexibility in the process.” XP offered that: “You can go one direction one week and a completely different direction the other week,” he says.
Before anything is sent back to New York, Amy Sutton tests for those catastrophic bugs that haunt Cash. “My job is to ferret out all the defects in the program before it goes out the door,” she says. Aside from the more than 700 tests that are run automatically 10 to 15 times a day on the software, Sutton tests the program by mimicking a user and then, she says, doing “ridiculous things that no one in their right mind would ever try to make a software product do.” For instance, she’ll load a huge data file while trying to open up another application, or pull the plug from the socket while the computer is in the midst of accessing the database to see if it recovers gracefully.
Cash says his company has always had high standards for quality control, but “we’ve never done Q and A like this.” After creating the program to accommodate STR profiles along the same 13 loci plus gender that the CODIS program identifies, Gene Codes expanded M-FISys (pronounced “emphasis” and short for Mass Fatality Identification System) to include 16 loci plus gender.
To further ensure that matches made by the system are reliable, Cash says the minimum likelihood threshold for delivering a match is set at one times 1010. In other words, Cash says, “We don’t want to see anything unless the likelihood of a true match is greater than one times 1010. If we set it down to 109, maybe we’d find a few more pieces of this picture. But if we set it down to 102, you’d start putting pieces of the wrong people together.”
From STRs to SNPs
Gene Codes delivered version 1.0 of M-FISys to the ME’s office on December 10, and then continued working at the same strenuous pace through the winter holidays and into the New Year on upgrades and new releases.
Cash, who walks through the halls of the medical examiner’s office exchanging greetings and patting backs like a long-time staffer, has flown from Detroit to New York every Friday for the past eight months to install weekly M-FISys upgrades and get feedback on the previous week’s work. In mid-May, the center was running the 23rd version of the program.
Just two weeks before the cleanup and search for remains at the WTC is to come to a halt, 1,062 individuals have been identified, some by fingerprint, dental work, or tattoo, but most by DNA ID using M-FISys. The triage center is still doing intake, sorting through bits and pieces of humans that continue to trickle in, now either from the WTC pit or from a Staten Island landfill where debris from the site is dumped and re-sifted. And the labs at Myriad, Bode, and elsewhere feed the ME’s office new batches of data every two weeks.
There are now short-tandem-repeat profiles or SNP and mitochondrial DNA data in the WTC disaster recovery database for more than 20,000 human remains. While it is even likely that many of the 2,824 victims perished without leaving behind any trace, some mind-boggling matches made by the software help explain the 20,000 figure. For instance, in one case DNA from a personal effect matched up with nearly 200 separate remains.
By the one-year anniversary of the tragedy, the ME’s office expects to have made all the matches it can using short tandem repeats for body parts that remain unidentified. It will then move on to mitochondrial DNA analysis and, with the help of Orchid’s subsidiary GeneScreen and Celera Genomics, the ME’s office will also try a new approach to identifying remains by SNP analysis. Gene Codes programmers are now upgrading M-FISys to incorporate those additional search fields.
That such methods and technologies didn’t exist until recently, of course, begs a question: Is going to these lengths to identify human remains really for the best? Or has the task become a morbidly excessive exercise?
Sutton says she has decided that it’s an individual question, but adds, “I’m glad for the families that really do need that. And there are a lot of them out there.”