Skip to main content

Sorting Out Statistics

Premium

The outcome of many biomedical experiments comes down to two questions: 1) Is it statistically significant? and 2) Is it biomedically meaningful? The answer to the latter, of course, determines the importance to the researcher and the biomedical community, but it is closely linked to the answer to the former — at least with quantitative data. Biomedical research is becoming more quantitative, and statistical theory and practice is becoming increasingly important to researchers. With the growing prevalence of expression microarrays and other large-scale methods, biologists need powerful, robust, easy-to-use statistical software.

Since the days of mainframe computers, scientists have had a choice of several popular commercial statistics packages, but what of open source options? From packages with integrated graphical interfaces to command-line tools, researchers have several good choices for statistical analyses of varying complexity. In our experience, in fact, the primary limitation of effective statistical analysis is not the availability of open source tools but rather the researcher’s grasp of statistical theory. As a result, good software documentation is essential, not just to explain software usage but also for linking to publications and texts for further reading.

Starting Out

For basic statistics, many biologists head to familiar spreadsheet applications. The best open source choice is Open Office’s Calc application, which has an interface and functionality similar to commercial packages. Open Office, created by Sun Microsystems and the open source community, is a free suite of applications for spreadsheets, text documents, presentations, and vector graphics. It runs on desktop and Unix operating systems and provides an Application Programming Interface and macro programming. Calc performs basic statistical calculations, such as the t-test and correlations, and creates charts (which can be exported as PDF, a nice feature). Installing the statistics macro extends Calc with tests such as ANOVA and principal components. Files from other spreadsheet applications can be opened in Calc. We often use spreadsheet applications for viewing tab-delimited text files, but opening text files in Calc requires specifying the format “Text CSV,” which can be annoying.

For bioinformatics programmers, it can be very helpful to integrate statistics into one’s programming environment. Like many bioinformatics groups, our preferred programming environment is Perl, for its text processing and biology tools (primarily BioPerl), so sometimes we add statistics functionality with a Perl module. Many modules are fine for basic statistics, but — perhaps because Perl isn’t a favorite choice for serious mathematics or statistics — there aren’t large multi-purpose statistics modules.

All About R

For hard-core statistics, the question of open source software has a one-letter answer: R. The R Project is a comprehensive package of statistics modules patterned after the S language. Started by Robert Gentleman and Ross Ihaka, it’s a system for statistical computation and graphics. Since 1997, the R Core Team and contributors have built it into an excellent resource that runs on Unix and desktop systems. One can use R as a programming language, but it’s most readily accessible by calling one- to several-line commands to execute specific statistical analyses or generate graphics. With the extensive documentation, sample code, and sample data, the R developers have tried to make it easy for those new to R, but it requires some time investment to get started. The Windows version of R adds a GUI interface, but basically it still requires the command-line interface, so non-programmers can find it intimidating. In addition to book-like PDF documents and HTML help, R provides vignettes with alternating code and explanations. For those who prefer learning from real books, they can use packages (ISwR and MASS) that include all the sample data from popular R and S textbooks.

So what can R do? In summary, perform a lot of statistics, and generate a lot of elegant graphics. Common statistical tools for biomedical research, like survival analysis (Kaplan-Meier curves), box-and-whisker plots, and adjustment of p-values for multiple comparisons, are easy to perform in R but are difficult — if not impossible — to find in spreadsheet applications or Web-based tools. On the other hand, it would be worth learning the language for the graphics alone. To represent data in a novel way, R provides access to both low- and high-level graphing functions. (The cost of this flexibility does require a time investment.) Publication-quality graphics can be saved as postscript or PDF.

In addition to general statistical analysis, R has an add-on set of modules for the analysis of microarray data. The Bioconductor Project, which contains the majority of these tools, can process both oligonucleotide and spotted arrays, and includes the original implementations of several state-of-the-art analysis algorithms. This built-in functionality has made Bioconductor/R a popular choice for microarray analysis (and a starting point for packages like GenePattern and SNOMAD).

Even though R is designed for a command-line interface, a few examples of GUIs exist. A few elegant stand-alone packages like “the R Commander” (Rcmdr) and the “Linear Models for Microarray Data” (limma) GUI were created for basic statistics and microarray analysis, respectively, using Tcl/Tk extensions. Rcmdr is a good system for introducing users to the command line, since commands are printed as one executes them from the GUI menus. Some groups have created Web-based GUIs, but these cover only a few packages. Some progress has also been made toward connectivity between R and Perl with interfaces like RSPerl, but setting this up was problematic for us. Nevertheless, it allows R calls from Perl and vice versa. We, however, often design a Perl wrapper for the desired R functionality, which creates an R script on the fly and then executes the R commands on the desired data files.

 

Regardless of the software chosen for the job, it won’t be a substitute for learning statistics; we still keep some good textbooks and articles on hand (and make friends with statisticians) to make sure we’re performing these statistical analyses in their most effective, powerful manner.

 

Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a bioinformatics scientist in Fran’s group.

 

The Scan

Pfizer-BioNTech Seek Full Vaccine Approval

According to the New York Times, Pfizer and BioNTech are seeking full US Food and Drug Administration approval for their SARS-CoV-2 vaccine.

Viral Integration Study Critiqued

Science writes that a paper reporting that SARS-CoV-2 can occasionally integrate into the host genome is drawing criticism.

Giraffe Species Debate

The Scientist reports that a new analysis aiming to end the discussion of how many giraffe species there are has only continued it.

Science Papers Examine Factors Shaping SARS-CoV-2 Spread, Give Insight Into Bacterial Evolution

In Science this week: genomic analysis points to role of human behavior in SARS-CoV-2 spread, and more.