CHICAGO (GenomeWeb) – The Stockholm-based Science for Life Laboratory, or SciLifeLab, built the National Genomics Infrastructure technology platform in Sweden. Now, it is looking to support bioinformatics globally.
"We're kind of like a core sequencing lab, but we work for all of Sweden's research groups," explained Philip Ewels, a bioinformatician who serves as deputy head of facility for SciLifeLab. SciLifeLab, a collaboration in molecular biosciences between Stockholm University, Karolinska Institutet, KTH Royal Institute of Technology, and Uppsala University, as well as some commercial partners, performs sample preparation, sequencing, and limited genomic analysis.
The lab has several protocols for high-throughput production, including RNA sequencing, whole-genome sequencing, and exome sequencing, according to Ewels, who came to Sweden in 2014 after holding a postdoctoral research position at the Babraham Institute at the University of Cambridge in the UK.
"For those [protocols], we have standardized pipelines," Ewels said at the 2017 Intelligent Systems for Molecular Biology-European Conference on Computational Biology (ISMB/ECCB) scientific conference in Prague last week.
"We take the data and we do a lot of [quality control] to make sure everything looks good. In whole genome, we deliver a list of variants. With RNA seq, you've got aligned reads and gene counts," Ewels said.
"Many of the groups that come to us are very new to next-generation sequencing. Hopefully, our analysis results give them a really good place to start," he continued. "It's everything you need to get started," he said.
For those looking for more analysis, SciLifeLab has a dedicated bioinformatics facility. It also has a piece of open-source software that Ewels developed called MultiQC to aggregate multiple bioinformatics analyses into a single report for the purpose of quality control. SciLifeLab issued the first stable release of MultiQC in May. The current version, 1.1, was published July 18.
MultiQC isn't an analytics platform itself, but rather a visualization tool, Ewels explained. It picks up results of QC checks run on any of a growing number of compatible analytics software titles. MultiQC currently works with about 50 other tools, including FastQC, a quality-control application for high-throughput sequence data developed at the Babraham Institute.
"FastQC is one of the most commonly used tools in bioinformatics, but it works on a one-sample basis, so you end up with these lovely plots [on a graph]," Ewels said. "If you have 100 samples, you have 100 reports to look at, and no one can do that. MultiQC plots the same data, but all of them are together." That makes it easier to spot outliers right away.
"When you run a bioinformatics pipeline for a project, you can have tens or hundreds of samples. Maybe your pipeline has multiple steps, and every step gets log files. You have to dig through these and find out that one number that is actually interesting," Ewels said. "It becomes really time-consuming."
MultiQC, which is written in the Python programming language, gets pointed at a directory of analysis results and searches for anything it recognizes. It generates both a standalone HTML report and a directory of past data, according to Ewels. "It can be a really nice intermediate step that aggregates all this of information for you. If you're doing really massive projects like single-cell projects, that can be really helpful," he said.
Each module for the programs it supports contains a bit of Python code specific to each form of output, and all of the modules tie into the core code.
It's difficult to estimate the size of the MultiQC open-source community, Ewels said, though code-hosting service GitHub lists 37 contributors. More than 200 people have given ratings on GitHub.
"In my view, it's kind of a simple tool. It doesn't really do very much, but in that simplicity, it has obviously filled a niche. It's something lots of people were doing without many solutions," Ewels said. "It's just generic enough and modular enough and extensible enough."
Traffic is global, according to Ewels, though, as is to be expected with bioinformatics, the heaviest participation in terms of the number of downloads is from Europe and the US. "It's been a very organic growth," he said. "It was quite unanticipated."
As the community grows, others are contributing code and modules.
"It's relying personally on me less and less, which is great. That's not sustainable, really, for one person," Ewels said. He wants to see more participants and more tools in the short term. "The more tools we can actually support here, the better."
Meantime, SciLifeLab has just begun a new, parallel project — a website to install on local networks — to collect data each time a user runs a MultiQC report and map longitudinal trends, since MultiQC only provides snapshots at a single point in time. "It will depend 100 percent on MultiQC, but it will be separate," Ewels said of the app in development.
"We're a sequencing facility. I want to see the quality numbers over time."