Nicholas Tsinoremas is the director of the department of informatics at Scripps Florida, an expansion of the Scripps Research Institute that was launched in January 2004. The institute currently houses around 170 employees in a temporary facility a 41,000-square-foot building located on the Jupiter campus of Florida Atlantic University. Scripps is building a second 33,000-square-foot building as its permanent headquarters that it expects to open this fall.
Tsinoremas joined Scripps from the Rosetta Inpharmatics group at Merck, where he served as director of computational genomics and genomic discovery. BioInform caught up with him recently to discuss the advantages and challenges of building a research informatics infrastructure from the ground up.
It sounds like you have a unique opportunity in setting up the informatics group at Scripps Florida, since you're starting from a blank slate.
We're trying to learn from the things of the past. We know what hasn't worked in the past, so we know what not to try.
Any specific examples of what hasn't worked in the past that you're going to avoid?
Well, for example, we all want some kind of an integrated environment, but in the past, all these systems were developed at different times, in different periods, with different scope. It made it very hard, actually, to try to have a more comprehensive view. But now we're starting with a new institute with no legacy applications, so we have a chance to not repeat the same mistakes, and to be able to plan ahead, rather than just seeing what is the immediate need.
So in this case, we know we will bring the cheminformatics and the lead discovery informatics with the genomics information together. We've been trying to do this for a long time, but then, most of the time, we stumbled on legacy applications and legacy systems. In this case, we're trying to plan ahead and make sure that we can seamlessly change the data, we can seamlessly analyze data well, at least we are trying.
So it sounds like, especially in terms of integrating these silos of genomic and chemical data, that you have an ideal situation in building all of this from scratch. How well is that working?
So far, in our view, it is working pretty well. We are still in the process of trying to make the integration, and I think it's so far working very well. The challenge is going to be in the future, when a lot of the data is coming in from both sides, and we see whether the informatics component will keep up. We believe that it will, but I want to be very up front.
Actually, we are building an infrastructure because we are part of the [Molecular Libraries Screening Centers Network] for NIH. So here, we have one organization, and it's part of the mentality of the organization. We have one organization that is responsible for bringing the genomic information, the biological information, and the chemical information together. Of course, there are different people who do this, but at least we adhere to the same practices, to the same infrastructure, and try to use tools that are compatible. We try to make sure that we can easily exchange data, [we] try to have standardized ontologies. Because this is something that has been part of the legacy in the past, where people are using chemical information using one ontology, and biological information using different types of terms, if you will, and then trying to actually correlate those it becomes a big project by itself.
What are you doing in that case? Have you written a mapping ontology between those two domains?
Yes, we have a mapping ontology and of course we try to leverage as much as possible from the public domain. So we are not trying to reinvent the wheel, and that's something that's also the philosophy of the group. We will only do what is needed on top and above all that is available. We're definitely going to integrate it intelligently and in a very scientific way so we can get the answers that we want. But we are not going to be redeveloping existing tools.
Can you provide an overview of your group and its goals?
We have actually five different domains within the group. One is high-performance computing, which is basically all the infrastructure the databases, the clusters, the processing, and so forth. So this is the infrastructure for everybody.
Then we have what we call the scientific software engineering group, and these are the people who are writing the large systems like LIMS, scientific management databases. So these are people who have quite a lot of experience in scientific data management and writing database applications, and database design.
And then we have three other groups of what I would call analysts. These are people who have PhDs in different domains. So one of the groups is what I would call computational biology they have very strong molecular biology backgrounds, [they are] involved with genome annotations, [they] understand microarray analysis. And then we have another group that is the counterpart in the chemical space the cheminformatics people. They understand how to treat chemical information. And again they have the analytical perspective of how to do structure-activity relationship analysis, how you can write predictive models for toxicology, for drug metabolism properties of the chemical compounds. So these are the people who will look at the high-throughout screening data and try to make sense and try to figure out what to do as a next step.
Then we have another group that is data mining and statistical analysis. So this is where we have new algorithm development and so on. In this group, we are very active in whole-genome association studies, so we definitely need some statistical geneticists who are coming from that group, but we also need the computational biologists, and we also need the software engineers to actually write the system of where all the genotypes are going to go. So the projects span through whatever expertise we need for a particular project to be completed.
As an example, one of the projects that we're pursuing now is a collaboration with the University of Miami, and we're trying to identify the genes that are underlying lupus a specific subtype of lupus, the SLA. So this is an ongoing project and it involves our genotyping core as well. We are using Affymetrix chips the 500K chips, but we're using half of the chip, 250,000 and then we're genotyping case and controls. And of course, all this data has to be managed, we have to be able to integrate the genotypes with the phenotypes that we get from our collaborators, and then all the data has to go into a database, they have to be analyzed, they have to be mapped onto genomic information to find areas of the genome where the genes are. So this is a project that spans at least three of the areas of expertise that we have.
We are trying to do it in a comprehensive way, and we're trying to create an end-to-end system that we can use for the next idea as well.
How easy is it to automate an informatics workflow like that? Or is it something that will always require hands-on manipulation from step to step?
There are certain steps that can be automated. So, for example, the QA of the data and loading the data in the database that's something that we believe can be automated and we are in the process of doing that. The analysis part, of course, when you deal with new types of data as in this case, initially it has to be manual. But over time, it will be automated as well, because we will gain more knowledge of how we deal with 250,000 genotypes. At this point, we are in the process of trying to understand how we are going to analyze the data. Some of the analysis will always require some modifications, but we're trying to automate as much as possible, and with new data coming in, new analysis will take some time to be automated and keep moving. There is this kind of iteration that one has to go through.
How many people do you have working over these five areas that you mentioned?
About 10 to12 people currently, but we just started about a year ago. For the whole of Scripps Florida, so far, we have about 170 to 180 people. But the plan is over the next three to four years to grow to 500 to 600 people.
What about the informatics group? How large is that expected to get?
Depending on the projects, I would say about 20 to 30 people over the next three to five years. My experience in the past is that sometimes when you have a lot more people, it's actually much harder to manage the projects. Plus, all the people that we have currently here have at least five to seven years experience, and that's actually something very important. If you don't have the experience, you tend to repeat the same mistakes. So most of us have been in the space for quite a bit of time, so that's why assembling a very experienced group is very important.
Now we can hire actually more junior [people], but this is a field that never had a very formal education. It's a very young field. And of course we're trying to standardize all the processes that we are using in house. That's very important.
What are you developing on the informatics side for the molecular screening center you mentioned?
We are actually building a whole infrastructure that pretty much automates everything that comes off the robots that we have we have a Kalypsys robot all the way through to the data analysis part. So there, it involves scientific software engineering, it involves cheminformatics people, and so on, so we're trying to have an end-to-end system to mange plates, the database, doing QA, all of that part. And again, we're not reinventing the wheel.
Is this something that you would share with the other screening centers?
Not necessarily. We're trying to share as much as possible in terms of tools and so forth, but each center has to decide their own system. But we're trying to share as much information and as many tools.
What is the breakdown between what you have to develop in house versus what's available either through commercial tools or public domain tools?
That changes over time and it depends on the project. So, for example, for a genetics project, it's probably 40 percent we find it out there, and the rest of it is for us. For a cheminformatics project, it's probably 60 percent we find from a commercial vendor and so forth, and probably 30 to 40 percent we put on top. But this 30 to 40 percent is a lot it's a big project but at least we get a good start.
So that's still a lot of work.
There's still a lot of work, but, for example, we are using quite a bit of the MDL infrastructure, and we are trying to integrate this with other tools and with our processes, and of course that's a big job as well.
Do you find that commercial cheminformatics tools work well with other tools?
Well, one of the ways we selected MDL, for example, is because they tend to be closed, actually, but one of our mandates when we look at something is how open it is and how we'll be able to integrate it with other tools, because that's actually part of our evaluation. If it doesn't do that, it's going to be very hard for us to justify a tool that's not at least open to us to be able to integrate it with other tools public domain tools or other tools.
So I guess it's safe to assume that everything you have in place already can be integrated with other tools.
Absolutely. At least we know how to do it. Otherwise, we won't even consider it.
What kind of hardware infrastructure has your scientific IT team put in place?
We have a cluster of Sun V20s. We have about 200 CPUs, about 100 nodes, and they have the new AMD single-core chips. The cluster has been operational since the beginning of the summer.
What challenges have you encountered in putting this group together?
The largest challenge is that, we are a very young institute, but we also want things to happen fast. We need everything at once, or as much as possible at once. And of course that's a challenge.
How have you been able to handle that so far?
So far, by being able to have a very experienced team, being able to leverage tools that exist out there, and by being able to be smart and integrate the proper tools together. So this, I would have to say, is the largest challenge that we have currently.