Additional references

"Thoughts" on language/software stability

As we explained, the programming language used in an analysis has a clear influence on the reproducibility of your analysis. It is not a characteristic of the language itself but rather a consequence of the development philosophy of the underlying community. For example C is a very stable language with a very clear specification designed by a committee (even though some compilers may not respect this norm).

On the other end of the spectrum, Python had a much more organic development based on a readability philosophy and valuing continuous improvement over backwards-compatibility. Furthermore, Python is commonly used as a wrapping language (e.g., to easily use C or FORTRAN libraries) and has its own packaging system. All these design choices tend to make reproducibility often a bit painful with Python, even though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division.

R, in comparison is much closer (in terms of developer community) to languages like SAS, which is heavily used in the pharmaceutical industry where statistical procedures need to be standardized and rock solid/stable. R is obviously not immune to evolutions that break old versions and hinder reproducibility/backward compatibility. Here is a relatively recent true story about this and some colleagues who worked on the statistics introductory course with R on FUN reported us several issues with a few functions (plotmeans from gplots, survfit from survival, or hclust) whose default parameters had changed over the years. It is thus probably good practice to give explicit values for all parameters (which can be cumbersome) instead of relying on default values, and to restrict your dependencies as much as possible.

This being said, the R development community is generally quite careful about stability. We (the authors of this MOOC) believe that open source (which allows to inspect how computation is done and to identify both mistakes and sources of non-reproducibility) is more important than the rock solid stability of SAS, which is proprietary software. Yet, if you really need to stay with SAS (similar solutions probably exist for other languages as well), you should know that SAS can be used within Jupyter using either the Python SASKernel or the Python SASPy package (step by step explanations about this are given here). Using such literate programming approach allied with systematic version and environment control will always help.

Controlling your software environment

As we mentioned in the video sequences, there are several solutions to control your environment:

The easy (preserve the mess) ones: CDE or ReproZip
The more demanding (encourage cleanliness) where you start with a clean environment and install only what's strictly necessary (and document it):
- The very well known Docker
- Singularity or Spack, which are more targeted toward the specific needs of high performance computing users
- Guix, Nix that are very clean (perfect?) solutions to this dependency hell and which we recommend

It may be hard to understand the difference between these different approaches and decide which one is better in your context.

Here is a webinar where some of these tools are demoed in a reproducible research context: Controling your environment (by Michael Mercier and Cristian Ruiz)

You may also want to have a look at the Popper conventions (webinar by Ivo Gimenez through google hangout) or at the presentation of Konrad Hinsen on Active Papers (http://www.activepapers.org/).

Preservation/Archiving

Ensuring software is properly archived, i.e, is safely stored so that it can be accessed in a perennial way, can be quite tricky. If you have never seen Roberto Di Cosmo presenting the Software Heritage project, this is a must see. https://www.softwareheritage.org/

For regular data, we highly recommend using https://www.zenodo.org/ whenever the data is not sensitive.

Workflows

In the video sequences, we mentioned workflow managers (original application domain in parenthesis):

Galaxy (genomics), Kepler (ecology), Taverna (bio-informatics), Pegasus (astronomy), Collective Knowledge (compiling optimization) , VisTrails (image processing)
Light-weight: dask (python), drake (R), swift (molecular biology), snakemake (like make but more expressive and in python) …
Hybrids: SOS-notebook, …

You may want to have a look at this webinar: Reproducible Science in Bio-informatics: Current Status, Solutions and Research Opportunities (by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard).

Numerical and statistical issues

We have mentioned these topics in our MOOC but we could by no way cover them properly. We only suggest here a few interesting talks about this.

Publication practices

You may want to have a look at the following two webinars:

Enabling open and reproducible research at computer systems’ conferences (by Grigori Fursin). In particular, this talk discusses artifact evaluation that is becoming more and more popular.
Publication Modes Favoring Reproducible Research (by Konrad Hinsen and Nicolas Rougier). In this talk, the motivation for the ReScience journal initiative are presented.
Simine Vazire - When Should We be Skeptical of Scientific Claims?, which is discussing publication practices in social sciences and in particular HARKing (Hypothesizing After the Results are Known), p-hacking, etc.

Experimentation

Experimentation was not covered in this MOOC, although it is an essential part of science. The main reason is that practices and constraints can vary so wildly from one domain to another that it could not be properly covered in a first edition. We would be happy to gather references you consider as interesting in your domain so do not hesitate to provide us with such references by using the forum and we will update this page.

A recent talk by Lucas Nussbaum on Experimental Testbeds in Computer Science.

Tracking environment information

Getting information about your Git repository

When taking notes, it may be difficult to remember which version of the code or of a file was used. This is what version control is useful for. Here are a few useful commands that we typically insert at the top of our notebooks in shell cells

git log -1

commit 741b0088af5b40588493c23c46d6bab5d0adeb33
Author: Arnaud Legrand <arnaud.legrand@imag.fr>
Date:   Tue Sep 4 12:45:43 2018 +0200

    Fix a few typos and provide information on jupyter-git plugins.

git status -u

On branch master
Your branch is ahead of 'origin/master' by 4 commits.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   resources.org

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	../../module2/ressources/replicable_article/IEEEtran.bst
	../../module2/ressources/replicable_article/IEEEtran.cls
	../../module2/ressources/replicable_article/article.bbl
	../../module2/ressources/replicable_article/article.tex
	../../module2/ressources/replicable_article/data.csv
	../../module2/ressources/replicable_article/figure.pdf
	../../module2/ressources/replicable_article/logo.png
	.#resources.org

no changes added to commit (use "git add" and/or "git commit -a")

Note: the -u indicates that git should also display the contents of new directories it did not previously know about.

Then, we often include commands at the end of our notebook indicating how to commit the results (adding the new files, committing with a clear message and pushing). E.g.,

git add resources.org;
git commit -m "Completing the section on getting Git information"
git push

[master 514fe2c1 ] Completing the section on getting Git information
 1 file changed, 61 insertions(+)
Counting objects: 25, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (20/20), done.
Writing objects: 100% (25/25), 7.31 KiB | 499.00 KiB/s, done.
Total 25 (delta 11), reused 0 (delta 0)
To ssh://app-learninglab.inria.fr:9418/learning-lab/mooc-rr-ressources.git
   6359f8c..1f8a567  master -> master

Obviously, in this case you need to save the notebook before running this cell, hence the output of this final command (with the new git hash) will not be stored in the cell. This is not really a problem and is the price to pay for running git from within the notebook itself.

Getting information about Python(3) libraries

Getting the list of installed packages and their version

This topic is discussed on StackOverflow. When using pip (the Python package installer) within a shell command, it is easy to query the version of all installed packages (note that on your system, you may have to use either pip or pip3 depending on how it is named and which versions of Python are available on your machine

Here is for example how I get this information on my machine:

pip3 freeze

asn1crypto==0.24.0
attrs==17.4.0
bcrypt==3.1.4
beautifulsoup4==4.6.0
bleach==2.1.3
...
pandas==0.22.0
pandocfilters==1.4.2
paramiko==2.4.0
patsy==0.5.0
pexpect==4.2.1
...
traitlets==4.3.2
tzlocal==1.5.1
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5

In a Jupyter notebook, this can easily be done by using the %%sh magic. Here is for example what you could do and get on the Jupyter notebooks we deployed for the MOOC (note that here, you should simply use the pip command):

%%sh
pip freeze

alembic==0.9.9
asn1crypto==0.24.0
attrs==18.1.0
Automat==0.0.0
...
numpy==1.13.3
olefile==0.45.1
packaging==17.1
pamela==0.3.0
pandas==0.22.0
...
webencodings==0.5
widgetsnbextension==3.2.1
xlrd==1.1.0
zope.interface==4.5.0

In the rest of this document, I will assume the correct command is pip and I will not systematically insert the %%sh magic.

Once you know which packages are installed, you can easily get additional information about a given package and in particular check whether it was installed "locally" through pip or whether it is installed system-wide. Again, in a shell command:

pip show pandas
echo "            "
pip show statsmodels

Name: pandas
Version: 0.22.0
Summary: Powerful data structures for data analysis, time series,and statistics
Home-page: http://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: /usr/lib/python3/dist-packages
Requires: 
            
Name: statsmodels
Version: 0.9.0
Summary: Statistical computations and models for Python
Home-page: http://www.statsmodels.org/
Author: None
Author-email: None
License: BSD License
Location: /home/alegrand/.local/lib/python3.6/site-packages
Requires: patsy, pandas

How to list imported modules?

Without resorting to pip (that will list all available packages), you may want to know which modules are loaded in a Python session as well as their version. Inspired by StackOverflow, here is a simple function that lists loaded package (that have a __version__ attribute, which is unfortunately not completely standard).

def print_imported_modules():
    import sys
    for name, val in sorted(sys.modules.items()):
        if(hasattr(val, '__version__')): 
            print(val.__name__, val.__version__)
        else
            print(val.__name__, "(unknown version)")

print("**** Package list in the beginning ****");
print_imported_modules()
print("**** Package list after loading pandas ****");
import pandas
print_imported_modules()

**** Package list in the beginning ****
**** Package list after loading pandas ****
_csv 1.0
_ctypes 1.1.0
decimal 1.70
argparse 1.1
csv 1.0
ctypes 1.1.0
cycler 0.10.0
dateutil 2.7.3
decimal 1.70
distutils 3.6.5rc1
ipaddress 1.0
json 2.0.9
logging 0.5.1.2
matplotlib 2.1.1
numpy 1.14.5
numpy.core 1.14.5
numpy.core.multiarray 3.1
numpy.core.umath b'0.4.0'
numpy.lib 1.14.5
numpy.linalg._umath_linalg b'0.1.5'
pandas 0.22.0
_libjson 1.33
platform 1.0.8
pyparsing 2.2.0
pytz 2018.5
re 2.2.1
six 1.11.0
urllib.request 3.6
zlib 1.0

Saving and restoring an environment with pip

The easiest way to go is as follows:

pip3 freeze > requirements.txt # to obtain the list of packages with their version
pip3 install -r requirements.txt # to install the previous list of packages, possibly on an other machine

If you want to have several installed Python environments, you may want to use Pipenv. I doubt it allows to track correctly FORTRAN or C dynamic libraries that are wrapped by Python though.

Installing a new package or a specific version

The Jupyter environment we deployed on our servers for the MOOC is based on the version 4.5.4 of Miniconda and Python 3.6. In this environment you should simply use the pip command (remember on your machine, you may have to use pip3).

If I query the current version of statsmodels in a shell command, here is what I will get.

pip show statsmodels

Name: statsmodels
Version: 0.8.0
Summary: Statistical computations and models for Python
Home-page: http://www.statsmodels.org/
Author: Skipper Seabold, Josef Perktold
Author-email: pystatsmodels@googlegroups.com
License: BSD License
Location: /opt/conda/lib/python3.6/site-packages
Requires: scipy, patsy, pandas

I can then easily upgrade statsmodels:

pip install --upgrade statsmodels

Then the new version should then be:

pip show statsmodels

Name: statsmodels
Version: 0.9.0
Summary: Statistical computations and models for Python
Home-page: http://www.statsmodels.org/
Author: Skipper Seabold, Josef Perktold
Author-email: pystatsmodels@googlegroups.com
License: BSD License
Location: /opt/conda/lib/python3.6/site-packages
Requires: scipy, patsy, pandas

It is even possible to install a specific (possibly much older) version, e.g.,:

pip install statsmodels==0.6.1

Getting information about R libraries

Getting the list imported modules and their version

The best way seems to be to rely on the devtools package (if this package is not installed, you should install it first by running in R the command install.packages("devtools")).

sessionInfo()
devtools::session_info()

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux buster/sid

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.1
Session info ------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  fr_FR.UTF-8                 
 tz       Europe/Paris                
 date     2018-08-01                  

Packages ----------------------------------------------------------------------
 package   * version date       source        
 base      * 3.5.1   2018-07-02 local         
 compiler    3.5.1   2018-07-02 local         
 datasets  * 3.5.1   2018-07-02 local         
 devtools    1.13.6  2018-06-27 CRAN (R 3.5.1)
 digest      0.6.15  2018-01-28 CRAN (R 3.5.0)
 graphics  * 3.5.1   2018-07-02 local         
 grDevices * 3.5.1   2018-07-02 local         
 memoise     1.1.0   2017-04-21 CRAN (R 3.5.1)
 methods   * 3.5.1   2018-07-02 local         
 stats     * 3.5.1   2018-07-02 local         
 utils     * 3.5.1   2018-07-02 local         
 withr       2.1.2   2018-03-15 CRAN (R 3.5.0)

Some actually advocate that writing a reproducible research compendium is best done by writing an R package. Those of you willing to have a clean R dependency management should thus have a look at Packrat.

Getting the list of installed packages and their version

Finally, it is good to know that there is a built-in R command (installed.packages) allowing to retrieve and list the details of all packages installed.

head(installed.packages())

Package	LibPath	Version	Priority	Depends	Imports	LinkingTo	Suggests	Enhances	License	License_is_FOSS	License_restricts_use	OS_type	MD5sum	NeedsCompilation	Built
BH	/home/alegrand/R/x86₆₄-pc-linux-gnu-library/3.5	1.66.0-1	nil	nil	nil	nil	nil	nil	BSL-1.0	nil	nil	nil	nil	no	3.5.1
Formula	/home/alegrand/R/x86₆₄-pc-linux-gnu-library/3.5	1.2-3	nil	R (>= 2.0.0), stats	nil	nil	nil	nil	GPL-2	GPL-3	nil	nil	nil	nil	no	3.5.1
Hmisc	/home/alegrand/R/x86₆₄-pc-linux-gnu-library/3.5	4.1-1	nil	lattice, survival (>= 2.40-1), Formula, ggplot2 (>= 2.2)	methods, latticeExtra, cluster, rpart, nnet, acepack, foreign,
gtable, grid, gridExtra, data.table, htmlTable (>= 1.11.0),
viridis, htmltools, base64enc	nil	chron, rms, mice, tables, knitr, ff, ffbase, plotly (>=
4.5.6)	nil	GPL (>= 2)	nil	nil	nil	nil	yes	3.5.1
Matrix	/home/alegrand/R/x86₆₄-pc-linux-gnu-library/3.5	1.2-14	recommended	R (>= 3.2.0)	methods, graphics, grid, stats, utils, lattice	nil	expm, MASS	MatrixModels, graph, SparseM, sfsmisc	GPL (>= 2)	file LICENCE	nil	nil	nil	nil	yes	3.5.1
StanHeaders	/home/alegrand/R/x86₆₄-pc-linux-gnu-library/3.5	2.17.2	nil	nil	nil	nil	RcppEigen, BH	nil	BSD₃_clause + file LICENSE	nil	nil	nil	nil	yes	3.5.1
acepack	/home/alegrand/R/x86₆₄-pc-linux-gnu-library/3.5	1.4.1	nil	nil	nil	nil	testthat	nil	MIT + file LICENSE	nil	nil	nil	nil	yes	3.5.1

Installing a new package or a specific version

This section is mostly a cut and paste from the recent post by Ian Pylvainen on this topic. It comprises a very clear explanation of how to proceed.

Installing a pre-compiled version
If you're on a Debian or a Ubuntu system, it may be difficult to access a specific version without breaking your system. So unless you are moving to the latest version available in your Linux distribution, we strongly recommend you to build from source. In this case, you'll need to make sure you have the necessary toolchain to build packages from source (e.g., gcc, FORTRAN, etc.). On Windows, this may require you to install Rtools.

If you're on Windows or OS X and looking for a package for an older version of R (R 2.1 or below), you can check the CRAN binary archive. Once you have the URL, you can install it using a command similar to the example below:
```
packageurl <- "https://cran-archive.r-project.org/bin/windows/contrib/2.13/BBmisc_1.0-58.zip"
install.packages(packageurl, repos=NULL, type="binary")
```
Using devtools
The simplest method to install the version you need is to use the install_version() function of the devtools package (obviously, you need to install devtools first, which can be done by running in R the command install.packages("devtools")). For instance:
```
require(devtools)
install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org")
```
Installing from source code
Alternatively, you may want to install an older package from source If devtools fails or if you do not want to depend on it, you can install it from source via install.packages() directed using the right URL. This URL can be obtained by browsing the CRAN Package Archive.

Once you have the URL, you can install it using a command similar to the example below:
```
packageurl <- "http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.1.tar.gz"
install.packages(packageurl, repos=NULL, type="source")
```
If you know the URL, you can also install from source via the command line outside of R. For instance (in bash):
```
wget http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.1.tar.gz
R CMD INSTALL ggplot2_0.9.1.tar.gz
```
Potential issues
There are a few potential issues that may arise with installing older versions of packages:
- You may be losing functionality or bug fixes that are only present in the newer versions of the packages.
- The older package version needed may not be compatible with the version of R you have installed. In this case, you will either need to downgrade R to a compatible version or update your R code to work with a newer version of the package.

Table of Contents

Additional references

"Thoughts" on language/software stability

Controlling your software environment

Preservation/Archiving

Workflows

Numerical and statistical issues

Publication practices

Experimentation

Tracking environment information

Getting information about your Git repository

Getting information about Python(3) libraries

Getting the list of installed packages and their version

How to list imported modules?

Saving and restoring an environment with pip

Installing a new package or a specific version

Getting information about R libraries

Getting the list imported modules and their version

Getting the list of installed packages and their version

Installing a new package or a specific version