Reproducible Research in High-Throughput Biology: A Case Study

Paolo Sonego

November 30th, 2012

Overview

Introduction

What is Reproducible Research?

Why Reproducible Research is so important?

The Anil Potti Story

The Anil Potti Story - Conclusion 2

Reproducible Research in daily practice 3

Tools for Reproducible Research and Literate Programming

Editors and IDE

Markup languages and tools

TeX and LaTeX

LaTeX “Hello World”

\documentclass{article}
\title{Hello World}
\author{Paolo Sonego}
\date{November 2012}
\begin{document}
   \maketitle
   Hello world!
\end{document}

Sweave

The de facto standard for Reproducible Research in the R environment.

Sweave “Hello World”

Take a LaTex source file.

Change the file extension from .tex to .Rnw.

Insert a chunk of R code you want to execute between << >>= and a @ sign followed by a space:

<<hello sweave >>=
print(rnorm(50))
@
##  [1] -1.02809  0.58519  0.47856 -1.15948  0.35598 -0.19351 -1.13424  0.63789
##  [9]  1.69889 -1.11168  1.01274 -1.86981 -0.03253 -1.23849  0.65434  0.51970
## [17] -0.17844  0.15562 -0.31640  0.73339  1.58082  0.28131 -0.49190 -0.53323
## [25] -0.64260  0.06578 -1.07761  0.88227 -0.33908 -0.29489  0.65853  0.49415
## [33]  0.91659 -0.20477 -0.33838  1.67765 -0.70644 -0.55799 -0.26238 -0.19877
## [41] -0.49096 -1.33861 -0.39981  0.41306  0.01421 -0.53587  0.48340  2.28866
## [49] -1.33952 -1.84252

Run Sweave to produce Tex, and Stangle to extract the R code

R CMD Sweave helloworld.Rnw
R CMD Stangle helloworld.Rnw
pdflatex helloworld

Markdown

Markdown Cheat Sheet

R Markdown

It allows the insertion of R chunks in a markdown file as well as Sweave allows it in a LaTeX file.

```{r hello_rmarkdown}
print(rnorm(50))
```

##  [1]  0.47135  0.33137  1.12405 -1.93464  0.49882 -0.40663  1.70079  1.95296
##  [9]  0.28356  0.46559 -1.60801 -0.51435 -1.14396  1.38524  1.41066 -0.51923
## [17]  0.01728 -1.15363  1.05206 -1.38199 -0.47508  1.18964 -0.35563 -0.41657
## [25] -0.70746  1.41704 -0.65796 -0.11216  0.67053 -0.11592  0.49551 -0.17050
## [33]  0.36816  0.06961 -0.32033 -0.60880 -0.72954 -0.60811 -0.02811  1.25004
## [41] -0.35045  1.64920  1.58694 -0.05738  0.68935 -0.03911  0.94360  1.18036
## [49]  0.45375 -0.04956

knitr

pandoc

Swiss-army knife for converting a markup document in different formats, few examples:

Versioning and Version Control Systems

Apache Subversion

“Enterprise-class centralized version control for the masses”

Subversion exists to be universally recognized and adopted as an open-source, centralized version control system characterized by its reliability as a safe haven for valuable data; the simplicity of its model and usage; and its ability to support the needs of a wide variety of users and projects, from individuals to large-scale enterprise operations.

svn “Hello World”

svnadmin create SVNrep
mkdir test
touch test/test.txt
svn import test file:///Users/paolo/SVNrep/test ‐m 'Initial import'
rm –r test
svn checkout file:///Users/paolo/SVNrep/test
echo 'Hello World' > test.txt
svn status
svn commit test.txt ‐m'test modified'
svn update
svn diff ‐r 1

git

Reproducible Research in High-Throughput Biology

R and Bioconductor

Bioconductor 4

Why Bioconductor is the way to follow for Reproducible Research

Summary

Case Study

“Reversal of gene expression changes in the colorectal normal-adenoma pathway by NS398 selective COX2 inhibitor, O Galamb et al., British Journal of Cancer (2010) 102, 765–773”

Case Study - Conclusion

Slides and Example

These slides and the Case Study were performed using the package knitr for converting either the RMarkdown into Markdown (slides) or Sweave markup to pdf (case study), pandoc to generate the html5 from Markdown. The case study is available at https://github.com/onertipaday/ItalianBioRDay2012/CaseStudy. The slides are available at https://github.com/onertipaday/ItalianBioRDay2012/Slides
and can be replicated from R by typing (package knitr should be installed in your R distribution and pandoc available on your system):

require("knitr")
knit("Slides.Rmd")
system("pandoc -s -S -i -t slidy --mathjax Slides.md -o Slides.html")
browseURL("Slides.html")

R Session-Info

It is always a good practice to include the session info:

print(sessionInfo(), locale = FALSE)
## R version 2.15.2 (2012-10-26)
## Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] wordcloud_2.2      RColorBrewer_1.0-5 Rcpp_0.10.0        tm_0.5-7.1        
## [5] XML_3.95-0.1       knitr_0.8          dataframe_2.5      colorout_0.9-9    
## 
## loaded via a namespace (and not attached):
## [1] digest_0.5.2   evaluate_0.4.2 formatR_0.6    plyr_1.7.1     slam_0.1-26   
## [6] stringr_0.6.1  tools_2.15.2

Acknowledgements

I wish to thank Dr. Yihui Xie for developing the knitr package, I used for producing both the slides and the example, and Dr. Vince Buffalo for the inspiring The Beauty of Bioconductor blog post.

Contacts

Paolo Sonego

email: onertipaday@gmail.com

blog: onertipaday.blogspot.com

twitter: @onertipaday

github: onertipaday


  1. NYT on the importance of reproducible research.

  2. Anil Potti page on Wikipedia.

  3. quotes from Keith Baggerly.

  4. Bioconductor web site

  5. T. Hastie, R. Tibshirani, Balasubramanian Narasimhan and Gil Chu (2011).
    pamr: Pam: prediction analysis for microarrays. R package version 1.54.