Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: UCSD's Philip Compeau Discusses Rosalind, a Problem-Oriented Bioinformatics Education Platform

Premium

US and Russian developers recently released an online bioinformatics education platform that guides users through a series of increasingly complex problems that address key concepts in biology and programming.

The platform, called Rosalind, after Rosalind Franklin, is currently in beta testing and includes just over 50 problems, but its developers hope to expand that to several hundred in the coming months.

The platform is the result of a collaboration between Phillip Compeau, a postdoc in the lab of Pavel Pevzner at the University of California, San Diego, and Nikolay Vyahhi, another student of Pevzner's based at St. Petersburg Academic University in Russia.

Rosalind is based on concepts from online problem-solving projects such as Project Euler, which focuses on mathematical problems; and Google Code Jam, which focuses on programming skills. It differs, however, in that its problems are designed specifically to educate users on bioinformatics concepts and problems. It's intended to appeal to biologists who want to develop programming skills as well as programmers who would like to gain some familiarity with molecular biology.

BioInform spoke to Compeau last week to learn more about the project and its goals.


What made you decide to take this problem-based approach to bioinformatics education?

My adviser is Pavel Pevzner at UC San Diego, and he's always tried to keep up with bioinformatics education projects. He helped found a bioinformatics education conference — RECOMB-BE — that's held annually. I'm interested in going into education myself, so I've been on a [Howard Hughes Medical Institute] grant working on some educational problems like writing a chapter for a textbook and a few other things.

I was thinking about how interesting it would be if you took a problem-oriented approach to teaching bioinformatics. Project Euler is [a good model, but] you really need to have a big interest in mathematics to do the Project Euler problems. … So that didn't seem like it would appeal to everybody because you need a certain amount of mathematical sophistication to progress through the website.

So I thought what if we do this for bioinformatics because there are just so many problems that it seems like it would pair well with this progressively more difficult programming exercise format. So I talked to Pavel about it [last year] and he said [he had] another student — he has a second lab in St. Petersburg — who basically had this exact same idea, and so [he put me] in touch with him.

We decided to collaborate on this, and we've been working on it for the past four or five months close to full time.

For people who are interested in working through the problems, are there any particular prerequisites? Is there a certain baseline level of education – either on the biology side or the computational side – that you would expect people to have?

We've tried to create the site so that someone could be a novice in biology or programming — a complete novice. [For example]the first problem introduces DNA. It doesn't even get to the double helix, it just talks about what goes on in the cell nucleus and it briefly describes the chemical structure of DNA. That's relatively straightforward and something that most people probably already know from high school biology. And then [for] the actual programming exercise [it presents you with] a DNA string —As, Cs, Ts, and Gs — and you're just adding the frequency of each symbol. That's something that could be done in Word. You don't even really need to program to do it.

If you're a complete novice to programming, we suggest that you do something like Code Academy where they'll start you off if you've never programmed before and pair you with a language like Python and you'll go through and learn some basics. We're not saying that you can learn programming just through [our] site. You'll need another resource.

If you're a programmer who wants to learn some biology, there's a biology introduction to each problem. And they're threaded together in such a way that they increase in sophistication. The problem tree is created so that there are no cycles in the tree in terms of computational ideas. So it progresses downward in terms of complexity and we've tried to write the biology in such a way that it matches that as well.

What is the timeline for the full launch and what are you looking to get out of the beta phase in terms of feedback?

There are about 550 users who have solved problems, so we're letting the beta users help us out. The first thing was just finding typos on the site. I'm the only native English speaker working on the project, so it can be tough with so many people creating content [but] that's been really helpful. We're [also] trying to transition to see what the user community can do for us. The first big step there was to say, 'Well, if you solved 80 percent of the published problems, you'll be able to see problems that are close to being published, like draft problems.' So that allows us to not necessarily test all the problems ourselves, which was a big time commitment.

It's moving us toward a system where we can have very small outlines of problems that we create, and we may actually have users implement them, and then test them, and then publish them. So we need a structure where users can do more with us before we launch the alpha project but I think we're really close because the published problems that we have are stable and the system that we have for publishing them is relatively stable as well.

How would you say that Rosalind improves upon current approaches to training bioinformaticians?

The obvious answer would be that there are actually a lot of places in the world where bioinformatics education is completely non-existent, so I think this was a big push of Pavel's Russian lab, because there were no courses on bioinformatics in Russia when that lab was founded.

Just from posting to a Russian blog we've had a huge [amount of interest]. Most of our 550 active users are Russian programmers, so we've gotten huge positive feedback from them because they're not familiar with this field at all. They're completely unfamiliar with genome sequencing. So I wasn't anticipating that as such a big market for the site — [people who have] no educational access at all to bioinformatics.

I [also] think that it's a cool resource for university professors to use as sort of automated homework testing, but it's good that there are places in the world where we can have Rosalind actually help people completely learn bioinformatics, or at least the framework of it.

So it sounds like you view this as more of a complement to existing training approaches rather than something that would replace current methods of bioinformatics education, perhaps depending on where your users are based.

Yeah, it depends on the person. We've tried to create it so that it could satisfy both types of people — somebody who is going through the site and trying to learn the field on their own, as well as somebody who's taking a university course and may either use it as a supplement for the course or the teacher may use it as a homework repository. We've already had a couple of actual courses that are doing that. We have a professor environment, where you can go through and choose a subset of exercises, and you might even weight certain exercises more than others, and then you invite your students to the project and they do the assignments for completion and you can see what percentage each student has completed.

Will Rosalind eventually have some sort of certification or other proof that participants have gotten to a certain level within the program?

We already keep track of that, in terms of experience. And we're trying to work on improving our badge system. So we're sort of ‘gamifying’ the website, where you have different levels and tiers and badges and so forth. But what we have right now is pretty beta in that respect.

Over the longer term, how do you intend to measure Rosalind's success? Is there way of tracking how participants in the program make their way through the bioinformatics field?

You know, we haven't thought about that. Our goal in terms of more traditional education — and we talked about this at the [RECOMB] bioinformatics education conference this year — is that education is going to have to become a lot more scalable than it is at the moment, because education costs are rising so quickly. And when you have these massive online courses like Coursera and Udacity and so on, it forces the universities, in turn, to scale what they're doing and create at least a subset of their product that's scaled and cheaper.

So our hope is that maybe a site like Rosalind can alleviate some of this pressure by, first off, eliminating the need to go in and grade. If you created a course for 10,000 students, actually grading the assignments becomes a huge struggle, so it would be a help to have automated homework checking.

Secondly, we're thinking of pairing users or allowing people to collaborate through the project who have maybe similar backgrounds or had similar errors to previous problems, so they can work together. Because if you have a large course, another big hurdle is being able to provide one-on-one instruction to students who need it. So that would theoretically reduce the number of teaching assistants you need.

The way we intend to test that out is by having a trial course at UCSD in the spring for 100 students, and we'll use Rosalind as a resource for that and we'll have fewer TAs than you would have in other bioinformatics courses, where you may have multiple TAs for a small course.

Do you have a sense of how this type of training would play out in the job market?

I don't know and I think we need insight from the private sector with respect to that. If you look at the site, a lot of our questions now are very academic, and it would be nice to have contributed problems that are more practical. We're looking at creating another bank of problems now that are more practical in that it's go to this website and do this task and return what you've done, so you can see the different formatting of files and so on. And those will be introductory problems.

But the hope would be that people outside of academia appreciate the project and can help us out with that. We've got a tool where you can suggest a problem, so we're hopeful that people who enjoy the project will submit their own problems.


Filed under

The Scan

For Better Odds

Bloomberg reports that a child has been born following polygenic risk score screening as an embryo.

Booster Decision Expected

The New York Times reports the US Food and Drug Administration is expected to authorize a booster dose of the Pfizer-BioNTech SARS-CoV-2 vaccine this week for individuals over 65 or at high risk.

Snipping HIV Out

The Philadelphia Inquirer reports Temple University researchers are to test a gene-editing approach for treating HIV.

PLOS Papers on Cancer Risk Scores, Typhoid Fever in Colombia, Streptococcus Protection

In PLOS this week: application of cancer polygenic risk scores across ancestries, genetic diversity of typhoid fever-causing Salmonella, and more.