As new methods for analyzing textual data continue to emerge, a new company and an academic venture have each taken steps to tailor that growing toolkit to the varied needs of the biomedical community.
Seeking to fill a business-to-business niche, former OmniViz principals Jeffrey Saffer and Vicki Burnett recently founded SciWit, a Boulder, Colo.-based startup that offers customized information-mining solutions based on a range of methods, including natural language processing, statistical analysis, and parsing techniques.
Meanwhile, the UK’s National Center for Text Mining at the the University of Manchester is taking the open access route to collaboratively create a software toolbox for text miners from publicly available tools.
Both efforts exemplify a trend that has emerged within the text-mining community in recent years: a growing awareness that there is no single out-of-the-box package for handling every text-mining task, and that the best approach is often a combination of different tools tailored to fit a particular problem.
SciWit’s Saffer told BioInform that the company takes a consulting-based approach in which it first spends time with a client and then puts together a computationally layered text-mining process customized for the client’s needs.
“What we have as a fundamental philosophy is to really understand the problem and create the tools that solve that problem,” he said. “I don’t like being boxed in to say, ‘This is a tool kit we have.’”
Jun'ichi Tsujii, NaCTeM’s scientific director, said that SciWit appears to be part of a new trend in which text-mining companies are moving toward becoming solution providers, instead of selling a fixed set of tools.
Tsujii, who also has a joint appointment at the University of Manchester and the University of Tokyo, said that this is in line with demand from life science customers, and that NaCTeM is working with several pharmaceutical companies that choose to “combine components provided by companies or provided by us or other research groups to then construct their own system.”
SciWit is targeting pharmaceutical firms for its customized offering, as well as biotech and chemical companies, consulting firms, and other information-analytics providers who might want to embed some of the company’s customized algorithms into their own solutions. Initial customers include software and service providers, Saffer said.
Saffer said that a typical SciWit project might be to help detect biomarkers from a wealth of information such as gene-expression or proteomics data. For the first phase of the company’s development, however, the emphasis is on text analytics based on quantitative linguistics.
Burnett and Saffer are both former bench scientists from Pacific Northwest National Laboratory and previously founded data-visualization company Omniviz as a PNNL spin-out. They left six months after Omniviz was acquired by life science software firm BioWisdom last year. [BioInform 02-02-07]
The firm is privately held and has seven employees including four PhD-level scientists. The company considers its familiarity with scientific problem solving as an advantage, said Burnett. “One of the reasons we can do this is because of our history of being those end-users.”
Saffer, who describes himself as “a molecular biologist who became an informaticist out of necessity,” said that as he pored over newly sequenced microbial genomes while head of the molecular biosciences department at PNNL, he felt stymied by the task of needing to understand a few thousand new genes all at once.
“I was very uncomfortable that we were still doing data analysis with the statistical analysis of one fact at a time and missing the big picture,” he said. As part of his job, he said he acquainted himself with information-mining capabilities at PNNL, which were built mainly for the intelligence community. Out of that effort grew the technology behind OmniViz, and now SciWit.
The company’s services are based on four main technologies. These include TopicalitySleuth to find topical terms most relevant for a client in a given set of documents. It delivers a list of key themes in a document, quantifying their importance. EmergenceSleuth, meanwhile, is set up to discover both emerging and disappearing concepts in texts in order to track scientific trends in the literature; ConceptSleuth quantifies complex business analytics in a document collection; and MarkerSleuth is a pattern-recognition and visualization tool for numeric data.
These products are processes more than algorithms, explained Burnett. “We don’t mean one little equation; it is an entire approach, step by step, a quantitative approach, and in most steps an automated quantitative approach,” she said.
The firm also does custom algorithm development to integrate its solutions into a client’s workflow. As Burnett explained, a customer may already have natural language-processing capabilities and might tap SciWit to create the next step in its text-mining pipeline.
Competitors in the text-mining market include Linguamatics, Temis, and Connexor, but Saffer said that SciWit’s customization approach sets it apart from these firms, which sell pre-designed components. And unlike the Omniviz product, which enables exploratory analysis, the SciWit method gives “definitive, actionable answers,” he said.
“We asked SciWit to teach our computers how to read … and I think they did an excellent job of that.”
Unlike the market positioning with OmniViz, SciWit’s target customers are not individual researchers but rather other information analytics businesses. The firm is seeking customers who say, “’We can’t find a company that sells a widget to solve this,’” said Saffer, who described the firm as mainly a B-to-B provider helping others to fine-tune their computational problem-solving in text mining. “Customization is a very important part of what we do,” he said.
One ConceptSleuth customer is Michael Orlando, who runs Denver-based consulting firm Economic Advisors. In a former capacity he had, in his words, “a computationally intensive characterization problem,” which was part of a business analytics contract. It required quantifying textual content across a number of dimensions. “We were trying to assess perceptions of firms,” said Orlando, whose background is in engineering and economics.
Typically that is done manually, he said, developing rules and breaking down a text according to certain criteria. His company had applied basic content-analysis techniques with a team of people picking up on subtle patterns and scoring English-language documents along predefined dimensions. Rather than use human scorers, his firm wanted to bring the project to “commercially viable scale,” he said.
“We asked SciWit to teach our computers how to read … and I think they did an excellent job of that,” he said. “They came up with a much more precise way of getting at what we wanted in an automated fashion.”
What he found valuable was how SciWit formally reframed the company’s challenge for automated content analysis. His company saw its own value in developing, interpreting, and delivering recommendations based on analytics to its clients, so it outsourced this project because it did not wish to scale up its own computational-development arm. “Being the one able to technically code that up on a regular basis isn’t something we saw in our market space,” he said.
In this project, explained Saffer, the quantitative linguistic approach quantified the strength of broad concepts, for example brand awareness. The framework first involves figuring out “signatures for concepts,” which is an application of one algorithm on top of an anchor vocabulary. Then a second proprietary algorithm is used to perform an actual measurement.
Meeting Pharma’s and Biotech’s Needs?
Buoying SciWit’s business model is the fact that the biopharmaceutical industry has become much more comfortable outsourcing and partnering its text mining tasks, according to William Hayes, director of Library & Literature Informatics at Biogen Idec.
But SciWit will still likely have to convince these potential customers that it offers advantages beyond anything they are capable of doing on their own.
SciWit’s approach sounds “like a lot of approaches I have already seen that are fairly common in the text mining community,” Hayes told BioInform. For example, finding hidden concepts in text for which no reference vocabulary exists is currently feasible, he said.
“You can take a corpus and extract concepts without knowing anything a priori,” he said. What is required is looking at patterns and how often they are represented. “If two or three words show up following each other fairly often, that’s a concept you want to follow.”
In addition, methods for determining the rate at which literature mentions are growing are well established, he said. “The technology is out there and a whole bunch of companies have it.”
What has been difficult, he said, is noise.
For example, at his former employer, AstraZeneca, Hayes ran the tex-mining initiative. In one project, he and his team sought to find new concepts in the literature about estrogen-sensitive tumors. However, they only wanted to find genes and proteins related to tumors that escape estrogen-sensitivity and become untreatable. One drawback of many text-mining algorithms, he said, is that they may yield proteins that are not related to estrogen sensitivity or insensitivity but are just being discussed in this context. “The trick is filtering out the noise,” he said.
Hayes said he believes that SciWit’s B-2-B positioning is surprising due to the principals’ background with OmniViz, which was geared toward end-users, but sees promise in the company’s highly focused niche market. “The business model makes a lot of sense,” he said. “I like the concept.”
However, it might be a challenge for the firm to get in the door with pharmaceutical companies, said Hayes. Even his department has a problem of “getting in front of people [within Biogen Idec] often enough and … letting them know of alternative workflows in order to get our technology taken up,” he said.
The Open Approach
On the other side of the Atlantic, meantime, the UK’s NaCTeM is taking a different approach to the challenges of developing effective text-mining workflows: It’s providing a public repository of integrated tools.
NaCTeM director Sophia Ananiadou said that the center, which provides text-mining services free of charge to members of higher-education institutions, is committed to open-source and open-access text mining.
“With open source you have more possibilities of selecting and integrating different tools,” she said. She said she believes open access allows greater flexibility than proprietary software, and better matches the varied needs of text mining in biology.
“If tools are open access, you can mix and match” — for example, using different taggers or parsers — “and get the best output,” she said. “If you don’t allow that, you are a bit stuck in one specific solution.”
One challenge lies in the text-mining approach chosen. “Initially text mining sees text as a bag of words with no structure of sentences,” said NaCTeM’s Tsujii. The dominant technology approach to text mining is still based on that view of a text — the counting the frequency of specific nouns or co-occurences of two words — he said.
So a search for protein interactions would first identify proteins that co-occur in sentences. “Lots of software just enumerates the names of proteins without any specific interaction among them,” he said. That leads to the problematic noise that text mining can face.
Deep parsing or full parsing technology is a new approach that gives more semantic-oriented information, said Tsujii. Recognizing this, NaCTeM has begun collaborating with several academic groups to develop software that performs semantic or deep parsing, which uncovers implicit sentence structure, he said.
And given the special requirements of the life-science community, targeted text mining is appropriate, he said. With 18 million abstracts in Medline, if scientists obtain noisy results in information extraction, scientists cannot check the results, so text-mining tools must be sophisticated, he said. There is no one solution that fits all text mining needs in the life sciences, with its vastly differing subspecialties and ontologies, he said.
The Manchester center provides services and consulting in text mining, such as concept extraction and information extraction, for the entire UK academic community and it also maintains an inventory of “best breed” software, in addition to other software tools.
One of the major remits of the center is to coordinate resource-building and develop software tools, its scientists said. Because they believe in an approach of mixing and matching text-mining tools, they are trying to build a publicly available repository for resources and tools, including privately owned tools.
Tsujii agreed, saying that the market “is huge and the problem is huge as well, so we don’t really compete with specific companies. We want to coordinate all the efforts in this field to deliver the best services to individual users.”
For its part, NaCTeM offers tools with an eye to interoperability and for which workflow software is important, for example the Unstructured Information Management Architecture, or UIMA, formerly associated with IBM and now an open project that runs in OASIS and Apache, and protocols such as SOAP for XML-based message exchange. “We want to be able to select the most important tools for a specific task,” said Ananiadou. Users can mix and match the tools they need.
“Academics make things freely available, so the idea of UIMA is to expose your resources to the outside world, but companies are proprietary and may not wish to share,” said Ananiadou.