Commit 0130fd48 authored by Arnaud Legrand's avatar Arnaud Legrand

Split resources in two documents

parent 2fda5471
all: resources.html exo1.html exo2.html exo3.html all: resources_refs.html resources_environment.html exo1.html exo2.html exo3.html
NLINES=10000000 NLINES=10000000
......
<div id="content"> <div id="content">
<h1 class="title">Tracking environment information</h1>
<div id="table-of-contents"> <div id="table-of-contents">
<h2>Table of Contents</h2> <h2>Table of Contents</h2>
<div id="text-table-of-contents"> <div id="text-table-of-contents">
<ul style="margin:0 0;"> <ul style="margin:0 0;">
<li style="margin-bottom:0;"><a href="#org0d4a0e8">Additional references</a> <li style="margin-bottom:0;"><a href="#org54d219c">Getting information about your Git repository</a></li>
<li style="margin-bottom:0;"><a href="#orgd1774c3">Getting information about Python(3) libraries</a>
<ul style="margin:0 0;"> <ul style="margin:0 0;">
<li style="margin-bottom:0;"><a href="#orgc4f6c50">"Thoughts" on language/software stability</a></li> <li style="margin-bottom:0;"><a href="#org7283a87">Getting information about your system</a></li>
<li style="margin-bottom:0;"><a href="#orgabfd56a">Controlling your software environment</a></li> <li style="margin-bottom:0;"><a href="#orgfa4dc3a">Getting the list of installed packages and their version</a></li>
<li style="margin-bottom:0;"><a href="#orgf56170d">Preservation/Archiving</a></li> <li style="margin-bottom:0;"><a href="#org31cde5f">How to list imported modules?</a></li>
<li style="margin-bottom:0;"><a href="#org476ea90">Workflows</a></li> <li style="margin-bottom:0;"><a href="#orgcea179c">Saving and restoring an environment with pip</a></li>
<li style="margin-bottom:0;"><a href="#orgdacce8f">Numerical and statistical issues</a></li> <li style="margin-bottom:0;"><a href="#org849fdbb">Installing a new package or a specific version</a></li>
<li style="margin-bottom:0;"><a href="#org91673e7">Publication practices</a></li>
<li style="margin-bottom:0;"><a href="#org9f7c676">Experimentation</a></li>
</ul> </ul>
</li> </li>
<li style="margin-bottom:0;"><a href="#orgbef6611">Tracking environment information</a> <li style="margin-bottom:0;"><a href="#org4f45b1e">Getting information about R libraries</a>
<ul style="margin:0 0;"> <ul style="margin:0 0;">
<li style="margin-bottom:0;"><a href="#org9ef50b0">Getting information about your Git repository</a></li> <li style="margin-bottom:0;"><a href="#orgc583ee9">Getting the list imported modules and their version</a></li>
<li style="margin-bottom:0;"><a href="#orgc1cd356">Getting information about Python(3) libraries</a> <li style="margin-bottom:0;"><a href="#orgdffc6a5">Getting the list of installed packages and their version</a></li>
<li style="margin-bottom:0;"><a href="#orgb52d0ce">Installing a new package or a specific version</a>
<ul style="margin:0 0;"> <ul style="margin:0 0;">
<li style="margin-bottom:0;"><a href="#org6a455a3">Getting information about your system</a></li> <li style="margin-bottom:0;"><a href="#orgaf558a0">Installing a pre-compiled version</a></li>
<li style="margin-bottom:0;"><a href="#org96c1188">Getting the list of installed packages and their version</a></li> <li style="margin-bottom:0;"><a href="#org7d8a9f0">Using devtools</a></li>
<li style="margin-bottom:0;"><a href="#org46d5140">How to list imported modules?</a></li> <li style="margin-bottom:0;"><a href="#org4509fba">Installing from source code</a></li>
<li style="margin-bottom:0;"><a href="#org1127887">Saving and restoring an environment with pip</a></li> <li style="margin-bottom:0;"><a href="#org9d64d25">Potential issues</a></li>
<li style="margin-bottom:0;"><a href="#org4522704">Installing a new package or a specific version</a></li>
</ul>
</li>
<li style="margin-bottom:0;"><a href="#orgd73af89">Getting information about R libraries</a>
<ul style="margin:0 0;">
<li style="margin-bottom:0;"><a href="#org549563f">Getting the list imported modules and their version</a></li>
<li style="margin-bottom:0;"><a href="#orgafe4f90">Getting the list of installed packages and their version</a></li>
<li style="margin-bottom:0;"><a href="#org9e3380b">Installing a new package or a specific version</a></li>
</ul> </ul>
</li> </li>
</ul> </ul>
...@@ -39,205 +32,9 @@ ...@@ -39,205 +32,9 @@
</div> </div>
</div> </div>
<div id="outline-container-org0d4a0e8" class="outline-2"> <div id="outline-container-org54d219c" class="outline-2">
<h2 id="org0d4a0e8">Additional references</h2> <h2 id="org54d219c">Getting information about your Git repository</h2>
<div class="outline-text-2" id="text-org0d4a0e8"> <div class="outline-text-2" id="text-org54d219c">
</div>
<div id="outline-container-orgc4f6c50" class="outline-3">
<h3 id="orgc4f6c50">"Thoughts" on language/software stability</h3>
<div class="outline-text-3" id="text-orgc4f6c50">
<p>
As we explained, the programming language used in an analysis has a
clear influence on the reproducibility of your analysis. It is not a
characteristic of the language itself but rather a consequence of the
development philosophy of the underlying community. For example C is a
very stable language with a <a href="https://en.wikipedia.org/wiki/C_(programming_language)#ANSI_C_and_ISO_C">very clear specification designed by a
committee</a> (even though some compilers may not respect this norm).
</p>
<p>
On the other end of the spectrum, <a href="https://en.wikipedia.org/wiki/Python_(programming_language)">Python</a> had a much more organic
development based on a readability philosophy and valuing continuous
improvement over backwards-compatibility. Furthermore, Python is
commonly used as a wrapping language (e.g., to easily use C or FORTRAN
libraries) and has its own packaging system. All these design choices
tend to make reproducibility often a bit painful with Python, even
though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division.
</p>
<p>
<a href="https://en.wikipedia.org/wiki/R_(programming_language)">R</a>, in comparison is much closer (in terms of developer community) to
languages like <a href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a>, which is heavily used in the pharmaceutical
industry where statistical procedures need to be standardized and rock
solid/stable. R is obviously not immune to evolutions that break old
versions and hinder reproducibility/backward compatibility. Here is a
relatively recent <a href="http://members.cbio.mines-paristech.fr/~thocking/HOCKING-reproducible-research-with-R.html">true story about this</a> and some colleagues who worked
on the <a href="https://www.fun-mooc.fr/courses/UPSUD/42001S06/session06/about">statistics introductory course with R on FUN</a> reported us
several issues with a few functions (<code>plotmeans</code> from <code>gplots</code>,
<code>survfit</code> from <code>survival</code>, or <code>hclust</code>) whose default parameters had
changed over the years. It is thus probably good practice to give
explicit values for all parameters (which can be cumbersome) instead
of relying on default values, and to restrict your dependencies as much
as possible.
</p>
<p>
This being said, the R development community is generally quite
careful about stability. We (the authors of this MOOC) believe that open
source (which allows to inspect how computation is done and to
identify both mistakes and sources of non-reproducibility) is more
important than the rock solid stability of SAS, which is proprietary
software. Yet, if you really need to stay with SAS (similar solutions
probably exist for other languages as well), you should know that SAS
can be used within Jupyter using either the <a href="https://sassoftware.github.io/sas_kernel/">Python SASKernel</a> or the
<a href="https://sassoftware.github.io/saspy/">Python SASPy</a> package (step by step explanations about this are given
<a href="https://app-learninglab.inria.fr/gitlab/85bc36e0a8096c618fbd5993d1cca191/mooc-rr/blob/master/documents/tuto_jupyter_windows/tuto_jupyter_windows.md">here</a>). Using such literate programming approach allied with systematic
version and environment control will always help.
</p>
</div>
</div>
<div id="outline-container-orgabfd56a" class="outline-3">
<h3 id="orgabfd56a">Controlling your software environment</h3>
<div class="outline-text-3" id="text-orgabfd56a">
<p>
As we mentioned in the video sequences, there are several solutions to
control your environment:
</p>
<ul class="org-ul">
<li style="margin-bottom:0;">The easy (preserve the mess) ones: <a href="http://www.pgbovine.net/cde.html">CDE</a> or <a href="https://vida-nyu.github.io/reprozip/">ReproZip</a></li>
<li style="margin-bottom:0;">The more demanding (encourage cleanliness) where you start with a
clean environment and install only what's strictly necessary (and document it):
<ul class="org-ul">
<li style="margin-bottom:0;">The very well known <a href="https://www.docker.io/">Docker</a></li>
<li style="margin-bottom:0;"><a href="https://singularity.lbl.gov/">Singularity</a> or <a href="https://spack.io/">Spack</a>, which are more targeted toward the specific
needs of high performance computing users</li>
<li style="margin-bottom:0;"><a href="https://www.gnu.org/software/guix/">Guix</a>, <a href="https://nixos.org/">Nix</a> that are very clean (perfect?) solutions to this
dependency hell and which we recommend</li>
</ul></li>
</ul>
<p>
It may be hard to understand the difference between these different
approaches and decide which one is better in your context.
</p>
<p>
Here is a webinar where some of these tools are demoed in a
reproducible research context: <a href="https://github.com/alegrand/RR_webinars/blob/master/2_controling_your_environment/index.org">Controling your environment (by Michael
Mercier and Cristian Ruiz)</a>
</p>
<p>
You may also want to have a look at <a href="http://falsifiable.us/">the Popper conventions</a> (<a href="https://github.com/alegrand/RR_webinars/blob/master/11_popper/index.org">webinar by
Ivo Gimenez through google hangout</a>) or at the <a href="https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org">presentation of Konrad
Hinsen on Active Papers</a> (<a href="http://www.activepapers.org/">http://www.activepapers.org/</a>).
</p>
</div>
</div>
<div id="outline-container-orgf56170d" class="outline-3">
<h3 id="orgf56170d">Preservation/Archiving</h3>
<div class="outline-text-3" id="text-orgf56170d">
<p>
Ensuring software is properly archived, i.e, is safely stored so that
it can be accessed in a perennial way, can be quite tricky. If you
have never seen <a href="https://github.com/alegrand/RR_webinars/blob/master/5_archiving_software_and_data/index.org">Roberto Di Cosmo presenting the Software Heritage
project</a>, this is a must see. <a href="https://www.softwareheritage.org/">https://www.softwareheritage.org/</a>
</p>
<p>
For regular data, we highly recommend using <a href="https://www.zenodo.org/">https://www.zenodo.org/</a>
whenever the data is not sensitive.
</p>
</div>
</div>
<div id="outline-container-org476ea90" class="outline-3">
<h3 id="org476ea90">Workflows</h3>
<div class="outline-text-3" id="text-org476ea90">
<p>
In the video sequences, we mentioned workflow managers (original application domain in parenthesis):
</p>
<ul class="org-ul">
<li style="margin-bottom:0;"><a href="https://galaxyproject.org/">Galaxy</a> (genomics), <a href="https://kepler-project.org/">Kepler</a> (ecology), <a href="https://taverna.apache.org/">Taverna</a> (bio-informatics), <a href="https://pegasus.isi.edu/">Pegasus</a>
(astronomy), <a href="http://cknowledge.org/">Collective Knowledge</a> (compiling optimization) ,
<a href="https://www.vistrails.org">VisTrails</a> (image processing)</li>
<li style="margin-bottom:0;">Light-weight: <a href="http://dask.pydata.org/">dask</a> (python), <a href="https://ropensci.github.io/drake/">drake</a> (R), <a href="http://swift-lang.org/">swift</a> (molecular biology),
<a href="https://snakemake.readthedocs.io/">snakemake</a> (like <code>make</code> but more expressive and in <code>python</code>) &#x2026;</li>
<li style="margin-bottom:0;">Hybrids: <a href="https://vatlab.github.io/sos-docs/">SOS-notebook</a>, &#x2026;</li>
</ul>
<p>
You may want to have a look at this webinar: <a href="https://github.com/alegrand/RR_webinars/blob/master/6_reproducibility_bioinformatics/index.org">Reproducible Science in
Bio-informatics: Current Status, Solutions and Research Opportunities
(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard).</a>
</p>
</div>
</div>
<div id="outline-container-orgdacce8f" class="outline-3">
<h3 id="orgdacce8f">Numerical and statistical issues</h3>
<div class="outline-text-3" id="text-orgdacce8f">
<p>
We have mentioned these topics in our MOOC but we could by no way
cover them properly. We only suggest here a few interesting talks
about this.
</p>
<ul class="org-ul">
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/10_statistics_and_replication_in_HCI/index.org">In this talk, Pierre Dragicevic provides a nice illustration of the
consequences of statistical uncertainty and of how some concepts
(e.G. p-values) are commonly badly understood.</a></li>
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/3_numerical_reproducibility/index.org">Nathalie Revol, Philippe Langlois and Stef Graillat present the main
challenges encountered when trying to achieve numerical
reproducibility and present recent research work on this topic.</a></li>
</ul>
</div>
</div>
<div id="outline-container-org91673e7" class="outline-3">
<h3 id="org91673e7">Publication practices</h3>
<div class="outline-text-3" id="text-org91673e7">
<p>
You may want to have a look at the following two webinars:
</p>
<ul class="org-ul">
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/8_artifact_evaluation/index.org">Enabling open and reproducible research at computer systems’
conferences (by Grigori Fursin)</a>. In particular, this talk discusses
<i>artifact evaluation</i> that is becoming more and more popular.</li>
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org">Publication Modes Favoring Reproducible Research (by Konrad Hinsen
and Nicolas Rougier)</a>. In this talk, the motivation for the <a href="http://rescience.github.io/">ReScience
journal</a> initiative are presented.</li>
<li style="margin-bottom:0;"><a href="https://www.youtube.com/watch?v=HuJ2G8rXHMs">Simine Vazire - When Should We be Skeptical of Scientific Claims?</a>,
which is discussing publication practices in social sciences and in
particular HARKing (Hypothesizing After the Results are Known),
p-hacking, etc.</li>
</ul>
</div>
</div>
<div id="outline-container-org9f7c676" class="outline-3">
<h3 id="org9f7c676">Experimentation</h3>
<div class="outline-text-3" id="text-org9f7c676">
<p>
Experimentation was not covered in this MOOC, although it is an
essential part of science. The main reason is that practices and
constraints can vary so wildly from one domain to another that it could
not be properly covered in a first edition. We would be happy to
gather references you consider as interesting in your domain so do not
hesitate to provide us with such references by using the forum and we
will update this page.
</p>
<ul class="org-ul">
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/9_experimental_testbeds/index.org">A recent talk by Lucas Nussbaum on Experimental Testbeds in Computer
Science</a>.</li>
</ul>
</div>
</div>
</div>
<div id="outline-container-orgbef6611" class="outline-2">
<h2 id="orgbef6611">Tracking environment information</h2>
<div class="outline-text-2" id="text-orgbef6611">
</div>
<div id="outline-container-org9ef50b0" class="outline-3">
<h3 id="org9ef50b0">Getting information about your Git repository</h3>
<div class="outline-text-3" id="text-org9ef50b0">
<p> <p>
When taking notes, it may be difficult to remember which version of When taking notes, it may be difficult to remember which version of
the code or of a file was used. This is what version control is useful the code or of a file was used. This is what version control is useful
...@@ -328,13 +125,13 @@ is the price to pay for running git from within the notebook itself. ...@@ -328,13 +125,13 @@ is the price to pay for running git from within the notebook itself.
</p> </p>
</div> </div>
</div> </div>
<div id="outline-container-orgc1cd356" class="outline-3"> <div id="outline-container-orgd1774c3" class="outline-2">
<h3 id="orgc1cd356">Getting information about Python(3) libraries</h3> <h2 id="orgd1774c3">Getting information about Python(3) libraries</h2>
<div class="outline-text-3" id="text-orgc1cd356"> <div class="outline-text-2" id="text-orgd1774c3">
</div> </div>
<div id="outline-container-org6a455a3" class="outline-4"> <div id="outline-container-org7283a87" class="outline-3">
<h4 id="org6a455a3">Getting information about your system</h4> <h3 id="org7283a87">Getting information about your system</h3>
<div class="outline-text-4" id="text-org6a455a3"> <div class="outline-text-3" id="text-org7283a87">
<p> <p>
This topic is discussed on <a href="https://stackoverflow.com/questions/3103178/how-to-get-the-system-info-with-python">StackOverflow</a>. This topic is discussed on <a href="https://stackoverflow.com/questions/3103178/how-to-get-the-system-info-with-python">StackOverflow</a>.
</p> </p>
...@@ -351,9 +148,9 @@ uname_result(system='Linux', node='icarus', release='4.15.0-2-amd64', version='# ...@@ -351,9 +148,9 @@ uname_result(system='Linux', node='icarus', release='4.15.0-2-amd64', version='#
</div> </div>
</div> </div>
<div id="outline-container-org96c1188" class="outline-4"> <div id="outline-container-orgfa4dc3a" class="outline-3">
<h4 id="org96c1188">Getting the list of installed packages and their version</h4> <h3 id="orgfa4dc3a">Getting the list of installed packages and their version</h3>
<div class="outline-text-4" id="text-org96c1188"> <div class="outline-text-3" id="text-orgfa4dc3a">
<p> <p>
This topic is discussed on <a href="https://stackoverflow.com/questions/20180543/how-to-check-version-of-python-modules">StackOverflow</a>. When using <code>pip</code> (the Python This topic is discussed on <a href="https://stackoverflow.com/questions/20180543/how-to-check-version-of-python-modules">StackOverflow</a>. When using <code>pip</code> (the Python
package installer) within a shell command, it is easy to query the package installer) within a shell command, it is easy to query the
...@@ -461,9 +258,9 @@ Requires: patsy, pandas ...@@ -461,9 +258,9 @@ Requires: patsy, pandas
</pre> </pre>
</div> </div>
</div> </div>
<div id="outline-container-org46d5140" class="outline-4"> <div id="outline-container-org31cde5f" class="outline-3">
<h4 id="org46d5140">How to list imported modules?</h4> <h3 id="org31cde5f">How to list imported modules?</h3>
<div class="outline-text-4" id="text-org46d5140"> <div class="outline-text-3" id="text-org31cde5f">
<p> <p>
Without resorting to pip (that will list all available packages), you Without resorting to pip (that will list all available packages), you
may want to know which modules are loaded in a Python session as well may want to know which modules are loaded in a Python session as well
...@@ -525,9 +322,9 @@ zlib 1.0 ...@@ -525,9 +322,9 @@ zlib 1.0
</div> </div>
</div> </div>
<div id="outline-container-org1127887" class="outline-4"> <div id="outline-container-orgcea179c" class="outline-3">
<h4 id="org1127887">Saving and restoring an environment with pip</h4> <h3 id="orgcea179c">Saving and restoring an environment with pip</h3>
<div class="outline-text-4" id="text-org1127887"> <div class="outline-text-3" id="text-orgcea179c">
<p> <p>
The easiest way to go is as follows: The easiest way to go is as follows:
</p> </p>
...@@ -544,9 +341,9 @@ dynamic libraries that are wrapped by Python though. ...@@ -544,9 +341,9 @@ dynamic libraries that are wrapped by Python though.
</p> </p>
</div> </div>
</div> </div>
<div id="outline-container-org4522704" class="outline-4"> <div id="outline-container-org849fdbb" class="outline-3">
<h4 id="org4522704">Installing a new package or a specific version</h4> <h3 id="org849fdbb">Installing a new package or a specific version</h3>
<div class="outline-text-4" id="text-org4522704"> <div class="outline-text-3" id="text-org849fdbb">
<p> <p>
The Jupyter environment we deployed on our servers for the MOOC is The Jupyter environment we deployed on our servers for the MOOC is
based on the version 4.5.4 of Miniconda and Python 3.6. In this based on the version 4.5.4 of Miniconda and Python 3.6. In this
...@@ -613,13 +410,13 @@ It is even possible to install a specific (possibly much older) version, e.g.,: ...@@ -613,13 +410,13 @@ It is even possible to install a specific (possibly much older) version, e.g.,:
</div> </div>
</div> </div>
</div> </div>
<div id="outline-container-orgd73af89" class="outline-3"> <div id="outline-container-org4f45b1e" class="outline-2">
<h3 id="orgd73af89">Getting information about R libraries</h3> <h2 id="org4f45b1e">Getting information about R libraries</h2>
<div class="outline-text-3" id="text-orgd73af89"> <div class="outline-text-2" id="text-org4f45b1e">
</div> </div>
<div id="outline-container-org549563f" class="outline-4"> <div id="outline-container-orgc583ee9" class="outline-3">
<h4 id="org549563f">Getting the list imported modules and their version</h4> <h3 id="orgc583ee9">Getting the list imported modules and their version</h3>
<div class="outline-text-4" id="text-org549563f"> <div class="outline-text-3" id="text-orgc583ee9">
<p> <p>
The best way seems to be to rely on the <code>devtools</code> package (if this The best way seems to be to rely on the <code>devtools</code> package (if this
package is not installed, you should install it first by running in <code>R</code> package is not installed, you should install it first by running in <code>R</code>
...@@ -687,9 +484,9 @@ clean R dependency management should thus have a look at <a href="https://rstudi ...@@ -687,9 +484,9 @@ clean R dependency management should thus have a look at <a href="https://rstudi
</p> </p>
</div> </div>
</div> </div>
<div id="outline-container-orgafe4f90" class="outline-4"> <div id="outline-container-orgdffc6a5" class="outline-3">
<h4 id="orgafe4f90">Getting the list of installed packages and their version</h4> <h3 id="orgdffc6a5">Getting the list of installed packages and their version</h3>
<div class="outline-text-4" id="text-orgafe4f90"> <div class="outline-text-3" id="text-orgdffc6a5">
<p> <p>
Finally, it is good to know that there is a built-in R command Finally, it is good to know that there is a built-in R command
(<code>installed.packages</code>) allowing to retrieve and list the details of all (<code>installed.packages</code>) allowing to retrieve and list the details of all
...@@ -944,9 +741,9 @@ packages installed. ...@@ -944,9 +741,9 @@ packages installed.
</div> </div>
</div> </div>
<div id="outline-container-org9e3380b" class="outline-4"> <div id="outline-container-orgb52d0ce" class="outline-3">
<h4 id="org9e3380b">Installing a new package or a specific version</h4> <h3 id="orgb52d0ce">Installing a new package or a specific version</h3>
<div class="outline-text-4" id="text-org9e3380b"> <div class="outline-text-3" id="text-orgb52d0ce">
<p> <p>
This section is mostly a cut and paste from the <a href="https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages">recent post by Ian This section is mostly a cut and paste from the <a href="https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages">recent post by Ian
Pylvainen</a> on this topic. It comprises a very clear explanation of how Pylvainen</a> on this topic. It comprises a very clear explanation of how
...@@ -954,9 +751,9 @@ to proceed. ...@@ -954,9 +751,9 @@ to proceed.
</p> </p>
</div> </div>
<ul class="org-ul"> <div id="outline-container-orgaf558a0" class="outline-4">
<li style="margin-bottom:0;"><a id="org1d1770e"></a>Installing a pre-compiled version<br /> <h4 id="orgaf558a0">Installing a pre-compiled version</h4>
<div class="outline-text-5" id="text-org1d1770e"> <div class="outline-text-4" id="text-orgaf558a0">
<p> <p>
If you're on a Debian or a Ubuntu system, it may be difficult to If you're on a Debian or a Ubuntu system, it may be difficult to
access a specific version without breaking your system. So unless you access a specific version without breaking your system. So unless you
...@@ -979,9 +776,10 @@ install.packages(packageurl, repos=<span style="font-weight: bold; text-decorati ...@@ -979,9 +776,10 @@ install.packages(packageurl, repos=<span style="font-weight: bold; text-decorati
</pre> </pre>
</div> </div>
</div> </div>
</li> </div>
<li style="margin-bottom:0;"><a id="org9183e3e"></a>Using devtools<br /> <div id="outline-container-org7d8a9f0" class="outline-4">
<div class="outline-text-5" id="text-org9183e3e"> <h4 id="org7d8a9f0">Using devtools</h4>
<div class="outline-text-4" id="text-org7d8a9f0">
<p> <p>
The simplest method to install the version you need is to use the The simplest method to install the version you need is to use the
<code>install_version()</code> function of the <code>devtools</code> package (obviously, you <code>install_version()</code> function of the <code>devtools</code> package (obviously, you
...@@ -995,9 +793,10 @@ install_version(<span style="font-style: italic;">"ggplot2"</span>, version = <s ...@@ -995,9 +793,10 @@ install_version(<span style="font-style: italic;">"ggplot2"</span>, version = <s
</pre> </pre>
</div> </div>
</div> </div>
</li> </div>
<li style="margin-bottom:0;"><a id="orgd8742a1"></a>Installing from source code<br /> <div id="outline-container-org4509fba" class="outline-4">
<div class="outline-text-5" id="text-orgd8742a1"> <h4 id="org4509fba">Installing from source code</h4>
<div class="outline-text-4" id="text-org4509fba">
<p> <p>
Alternatively, you may want to install an older package from source If Alternatively, you may want to install an older package from source If
devtools fails or if you do not want to depend on it, you can install devtools fails or if you do not want to depend on it, you can install
...@@ -1025,9 +824,10 @@ R CMD INSTALL ggplot2_0.9.1.tar.gz ...@@ -1025,9 +824,10 @@ R CMD INSTALL ggplot2_0.9.1.tar.gz
</pre> </pre>
</div> </div>
</div> </div>
</li> </div>
<li style="margin-bottom:0;"><a id="orga4f8c7a"></a>Potential issues<br /> <div id="outline-container-org9d64d25" class="outline-4">
<div class="outline-text-5" id="text-orga4f8c7a"> <h4 id="org9d64d25">Potential issues</h4>
<div class="outline-text-4" id="text-org9d64d25">
<p> <p>
There are a few potential issues that may arise with installing older There are a few potential issues that may arise with installing older
versions of packages: versions of packages:
...@@ -1041,8 +841,6 @@ to downgrade R to a compatible version or update your R code to work ...@@ -1041,8 +841,6 @@ to downgrade R to a compatible version or update your R code to work
with a newer version of the package.</li> with a newer version of the package.</li>
</ul> </ul>
</div> </div>
</li>
</ul>
</div> </div>
</div> </div>
</div> </div>
......
# -*- mode: org -*- # -*- mode: org -*-
#+TITLE: #+TITLE: Tracking environment information
#+AUTHOR: Arnaud Legrand #+AUTHOR: Arnaud Legrand
#+DATE: June, 2018 #+DATE: June, 2018
#+STARTUP: overview indent #+STARTUP: overview indent
#+OPTIONS: num:nil toc:t #+OPTIONS: num:nil toc:t
#+PROPERTY: header-args :eval never-export #+PROPERTY: header-args :eval never-export
* Additional references * Getting information about your Git repository
** "Thoughts" on language/software stability
As we explained, the programming language used in an analysis has a
clear influence on the reproducibility of your analysis. It is not a
characteristic of the language itself but rather a consequence of the
development philosophy of the underlying community. For example C is a
very stable language with a [[https://en.wikipedia.org/wiki/C_(programming_language)#ANSI_C_and_ISO_C][very clear specification designed by a
committee]] (even though some compilers may not respect this norm).
On the other end of the spectrum, [[https://en.wikipedia.org/wiki/Python_(programming_language)][Python]] had a much more organic
development based on a readability philosophy and valuing continuous
improvement over backwards-compatibility. Furthermore, Python is
commonly used as a wrapping language (e.g., to easily use C or FORTRAN
libraries) and has its own packaging system. All these design choices
tend to make reproducibility often a bit painful with Python, even
though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division.
[[https://en.wikipedia.org/wiki/R_(programming_language)][R]], in comparison is much closer (in terms of developer community) to
languages like [[https://en.wikipedia.org/wiki/SAS_(software)][SAS]], which is heavily used in the pharmaceutical
industry where statistical procedures need to be standardized and rock
solid/stable. R is obviously not immune to evolutions that break old
versions and hinder reproducibility/backward compatibility. Here is a
relatively recent [[http://members.cbio.mines-paristech.fr/~thocking/HOCKING-reproducible-research-with-R.html][true story about this]] and some colleagues who worked
on the [[https://www.fun-mooc.fr/courses/UPSUD/42001S06/session06/about][statistics introductory course with R on FUN]] reported us
several issues with a few functions (=plotmeans= from =gplots=,
=survfit= from =survival=, or =hclust=) whose default parameters had
changed over the years. It is thus probably good practice to give
explicit values for all parameters (which can be cumbersome) instead
of relying on default values, and to restrict your dependencies as much
as possible.
This being said, the R development community is generally quite
careful about stability. We (the authors of this MOOC) believe that open
source (which allows to inspect how computation is done and to
identify both mistakes and sources of non-reproducibility) is more
important than the rock solid stability of SAS, which is proprietary
software. Yet, if you really need to stay with SAS (similar solutions
probably exist for other languages as well), you should know that SAS
can be used within Jupyter using either the [[https://sassoftware.github.io/sas_kernel/][Python SASKernel]] or the
[[https://sassoftware.github.io/saspy/][Python SASPy]] package (step by step explanations about this are given
[[https://app-learninglab.inria.fr/gitlab/85bc36e0a8096c618fbd5993d1cca191/mooc-rr/blob/master/documents/tuto_jupyter_windows/tuto_jupyter_windows.md][here]]). Using such literate programming approach allied with systematic
version and environment control will always help.
** Controlling your software environment
As we mentioned in the video sequences, there are several solutions to
control your environment:
- The easy (preserve the mess) ones: [[http://www.pgbovine.net/cde.html][CDE]] or [[https://vida-nyu.github.io/reprozip/][ReproZip]]
- The more demanding (encourage cleanliness) where you start with a
clean environment and install only what's strictly necessary (and document it):
- The very well known [[https://www.docker.io/][Docker]]
- [[https://singularity.lbl.gov/][Singularity]] or [[https://spack.io/][Spack]], which are more targeted toward the specific
needs of high performance computing users
- [[https://www.gnu.org/software/guix/][Guix]], [[https://nixos.org/][Nix]] that are very clean (perfect?) solutions to this
dependency hell and which we recommend
It may be hard to understand the difference between these different
approaches and decide which one is better in your context.
Here is a webinar where some of these tools are demoed in a
reproducible research context: [[https://github.com/alegrand/RR_webinars/blob/master/2_controling_your_environment/index.org][Controling your environment (by Michael
Mercier and Cristian Ruiz)]]
You may also want to have a look at [[http://falsifiable.us/][the Popper conventions]] ([[https://github.com/alegrand/RR_webinars/blob/master/11_popper/index.org][webinar by
Ivo Gimenez through google hangout]]) or at the [[https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org][presentation of Konrad
Hinsen on Active Papers]] (http://www.activepapers.org/).
** Preservation/Archiving
Ensuring software is properly archived, i.e, is safely stored so that
it can be accessed in a perennial way, can be quite tricky. If you
have never seen [[https://github.com/alegrand/RR_webinars/blob/master/5_archiving_software_and_data/index.org][Roberto Di Cosmo presenting the Software Heritage
project]], this is a must see. https://www.softwareheritage.org/
For regular data, we highly recommend using https://www.zenodo.org/
whenever the data is not sensitive.
** Workflows
In the video sequences, we mentioned workflow managers (original application domain in parenthesis):
- [[https://galaxyproject.org/][Galaxy]] (genomics), [[https://kepler-project.org/][Kepler]] (ecology), [[https://taverna.apache.org/][Taverna]] (bio-informatics), [[https://pegasus.isi.edu/][Pegasus]]
(astronomy), [[http://cknowledge.org/][Collective Knowledge]] (compiling optimization) ,
[[https://www.vistrails.org][VisTrails]] (image processing)
- Light-weight: [[http://dask.pydata.org/][dask]] (python), [[https://ropensci.github.io/drake/][drake]] (R), [[http://swift-lang.org/][swift]] (molecular biology),
[[https://snakemake.readthedocs.io/][snakemake]] (like =make= but more expressive and in =python=) ...
- Hybrids: [[https://vatlab.github.io/sos-docs/][SOS-notebook]], ...
You may want to have a look at this webinar: [[https://github.com/alegrand/RR_webinars/blob/master/6_reproducibility_bioinformatics/index.org][Reproducible Science in
Bio-informatics: Current Status, Solutions and Research Opportunities
(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard).]]
** Numerical and statistical issues
We have mentioned these topics in our MOOC but we could by no way
cover them properly. We only suggest here a few interesting talks
about this.
- [[https://github.com/alegrand/RR_webinars/blob/master/10_statistics_and_replication_in_HCI/index.org][In this talk, Pierre Dragicevic provides a nice illustration of the
consequences of statistical uncertainty and of how some concepts
(e.G. p-values) are commonly badly understood.]]
- [[https://github.com/alegrand/RR_webinars/blob/master/3_numerical_reproducibility/index.org][Nathalie Revol, Philippe Langlois and Stef Graillat present the main
challenges encountered when trying to achieve numerical
reproducibility and present recent research work on this topic.]]
** Publication practices
You may want to have a look at the following two webinars:
- [[https://github.com/alegrand/RR_webinars/blob/master/8_artifact_evaluation/index.org][Enabling open and reproducible research at computer systems
conferences (by Grigori Fursin)]]. In particular, this talk discusses
/artifact evaluation/ that is becoming more and more popular.
- [[https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org][Publication Modes Favoring Reproducible Research (by Konrad Hinsen
and Nicolas Rougier)]]. In this talk, the motivation for the [[http://rescience.github.io/][ReScience
journal]] initiative are presented.
- [[https://www.youtube.com/watch?v=HuJ2G8rXHMs][Simine Vazire - When Should We be Skeptical of Scientific Claims?]],
which is discussing publication practices in social sciences and in
particular HARKing (Hypothesizing After the Results are Known),
p-hacking, etc.
** Experimentation
Experimentation was not covered in this MOOC, although it is an
essential part of science. The main reason is that practices and
constraints can vary so wildly from one domain to another that it could
not be properly covered in a first edition. We would be happy to
gather references you consider as interesting in your domain so do not
hesitate to provide us with such references by using the forum and we
will update this page.
- [[https://github.com/alegrand/RR_webinars/blob/master/9_experimental_testbeds/index.org][A recent talk by Lucas Nussbaum on Experimental Testbeds in Computer
Science]].
* Tracking environment information
** Getting information about your Git repository
When taking notes, it may be difficult to remember which version of When taking notes, it may be difficult to remember which version of
the code or of a file was used. This is what version control is useful the code or of a file was used. This is what version control is useful
for. Here are a few useful commands that we typically insert at the for. Here are a few useful commands that we typically insert at the
...@@ -203,8 +84,8 @@ Obviously, in this case you need to save the notebook before running ...@@ -203,8 +84,8 @@ Obviously, in this case you need to save the notebook before running
this cell, hence the output of this final command (with the new git this cell, hence the output of this final command (with the new git
hash) will not be stored in the cell. This is not really a problem and hash) will not be stored in the cell. This is not really a problem and
is the price to pay for running git from within the notebook itself. is the price to pay for running git from within the notebook itself.
** Getting information about Python(3) libraries * Getting information about Python(3) libraries
*** Getting information about your system ** Getting information about your system
This topic is discussed on [[https://stackoverflow.com/questions/3103178/how-to-get-the-system-info-with-python][StackOverflow]]. This topic is discussed on [[https://stackoverflow.com/questions/3103178/how-to-get-the-system-info-with-python][StackOverflow]].
#+begin_src python :results output :exports both #+begin_src python :results output :exports both
import platform import platform
...@@ -214,7 +95,7 @@ print(platform.uname()) ...@@ -214,7 +95,7 @@ print(platform.uname())
#+RESULTS: #+RESULTS:
: uname_result(system='Linux', node='icarus', release='4.15.0-2-amd64', version='#1 SMP Debian 4.15.11-1 (2018-03-20)', machine='x86_64', processor='') : uname_result(system='Linux', node='icarus', release='4.15.0-2-amd64', version='#1 SMP Debian 4.15.11-1 (2018-03-20)', machine='x86_64', processor='')
*** Getting the list of installed packages and their version ** Getting the list of installed packages and their version
This topic is discussed on [[https://stackoverflow.com/questions/20180543/how-to-check-version-of-python-modules][StackOverflow]]. When using =pip= (the Python This topic is discussed on [[https://stackoverflow.com/questions/20180543/how-to-check-version-of-python-modules][StackOverflow]]. When using =pip= (the Python
package installer) within a shell command, it is easy to query the package installer) within a shell command, it is easy to query the
version of all installed packages (note that on your system, you may version of all installed packages (note that on your system, you may
...@@ -310,7 +191,7 @@ License: BSD License ...@@ -310,7 +191,7 @@ License: BSD License
Location: /home/alegrand/.local/lib/python3.6/site-packages Location: /home/alegrand/.local/lib/python3.6/site-packages
Requires: patsy, pandas Requires: patsy, pandas
#+end_example #+end_example
*** How to list imported modules? ** How to list imported modules?
Without resorting to pip (that will list all available packages), you Without resorting to pip (that will list all available packages), you
may want to know which modules are loaded in a Python session as well may want to know which modules are loaded in a Python session as well
as their version. Inspired by [[https://stackoverflow.com/questions/4858100/how-to-list-imported-modules][StackOverflow]], here is a simple as their version. Inspired by [[https://stackoverflow.com/questions/4858100/how-to-list-imported-modules][StackOverflow]], here is a simple
...@@ -368,7 +249,7 @@ urllib.request 3.6 ...@@ -368,7 +249,7 @@ urllib.request 3.6
zlib 1.0 zlib 1.0
#+end_example #+end_example
*** Saving and restoring an environment with pip ** Saving and restoring an environment with pip
The easiest way to go is as follows: The easiest way to go is as follows:
#+begin_src shell :results output :exports both #+begin_src shell :results output :exports both
pip3 freeze > requirements.txt # to obtain the list of packages with their version pip3 freeze > requirements.txt # to obtain the list of packages with their version
...@@ -378,7 +259,7 @@ pip3 install -r requirements.txt # to install the previous list of packages, pos ...@@ -378,7 +259,7 @@ pip3 install -r requirements.txt # to install the previous list of packages, pos
If you want to have several installed Python environments, you may If you want to have several installed Python environments, you may
want to use [[https://docs.pipenv.org/][Pipenv]]. I doubt it allows to track correctly FORTRAN or C want to use [[https://docs.pipenv.org/][Pipenv]]. I doubt it allows to track correctly FORTRAN or C
dynamic libraries that are wrapped by Python though. dynamic libraries that are wrapped by Python though.
*** Installing a new package or a specific version ** Installing a new package or a specific version
The Jupyter environment we deployed on our servers for the MOOC is The Jupyter environment we deployed on our servers for the MOOC is
based on the version 4.5.4 of Miniconda and Python 3.6. In this based on the version 4.5.4 of Miniconda and Python 3.6. In this
environment you should simply use the =pip= command (remember on your environment you should simply use the =pip= command (remember on your
...@@ -430,8 +311,8 @@ It is even possible to install a specific (possibly much older) version, e.g.,: ...@@ -430,8 +311,8 @@ It is even possible to install a specific (possibly much older) version, e.g.,:
#+begin_src shell :results output :exports both #+begin_src shell :results output :exports both
pip install statsmodels==0.6.1 pip install statsmodels==0.6.1
#+end_src #+end_src
** Getting information about R libraries * Getting information about R libraries
*** Getting the list imported modules and their version ** Getting the list imported modules and their version
The best way seems to be to rely on the =devtools= package (if this The best way seems to be to rely on the =devtools= package (if this
package is not installed, you should install it first by running in =R= package is not installed, you should install it first by running in =R=
the command =install.packages("devtools")=). the command =install.packages("devtools")=).
...@@ -493,7 +374,7 @@ Packages ---------------------------------------------------------------------- ...@@ -493,7 +374,7 @@ Packages ----------------------------------------------------------------------
Some actually advocate that [[https://github.com/ropensci/rrrpkg][writing a reproducible research compendium Some actually advocate that [[https://github.com/ropensci/rrrpkg][writing a reproducible research compendium
is best done by writing an R package]]. Those of you willing to have a is best done by writing an R package]]. Those of you willing to have a
clean R dependency management should thus have a look at [[https://rstudio.github.io/packrat/][Packrat]]. clean R dependency management should thus have a look at [[https://rstudio.github.io/packrat/][Packrat]].
*** Getting the list of installed packages and their version ** Getting the list of installed packages and their version
Finally, it is good to know that there is a built-in R command Finally, it is good to know that there is a built-in R command
(=installed.packages=) allowing to retrieve and list the details of all (=installed.packages=) allowing to retrieve and list the details of all
packages installed. packages installed.
...@@ -514,12 +395,12 @@ head(installed.packages()) ...@@ -514,12 +395,12 @@ head(installed.packages())
| StanHeaders | /home/alegrand/R/x86_64-pc-linux-gnu-library/3.5 | 2.17.2 | nil | nil | nil | nil | RcppEigen, BH | nil | BSD_3_clause + file LICENSE | nil | nil | nil | nil | yes | 3.5.1 | | | StanHeaders | /home/alegrand/R/x86_64-pc-linux-gnu-library/3.5 | 2.17.2 | nil | nil | nil | nil | RcppEigen, BH | nil | BSD_3_clause + file LICENSE | nil | nil | nil | nil | yes | 3.5.1 | |
| acepack | /home/alegrand/R/x86_64-pc-linux-gnu-library/3.5 | 1.4.1 | nil | nil | nil | nil | testthat | nil | MIT + file LICENSE | nil | nil | nil | nil | yes | 3.5.1 | | | acepack | /home/alegrand/R/x86_64-pc-linux-gnu-library/3.5 | 1.4.1 | nil | nil | nil | nil | testthat | nil | MIT + file LICENSE | nil | nil | nil | nil | yes | 3.5.1 | |
*** Installing a new package or a specific version ** Installing a new package or a specific version
This section is mostly a cut and paste from the [[https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages][recent post by Ian This section is mostly a cut and paste from the [[https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages][recent post by Ian
Pylvainen]] on this topic. It comprises a very clear explanation of how Pylvainen]] on this topic. It comprises a very clear explanation of how
to proceed. to proceed.
**** Installing a pre-compiled version *** Installing a pre-compiled version
If you're on a Debian or a Ubuntu system, it may be difficult to If you're on a Debian or a Ubuntu system, it may be difficult to
access a specific version without breaking your system. So unless you access a specific version without breaking your system. So unless you
are moving to the latest version available in your Linux distribution, are moving to the latest version available in your Linux distribution,
...@@ -536,7 +417,7 @@ similar to the example below: ...@@ -536,7 +417,7 @@ similar to the example below:
packageurl <- "https://cran-archive.r-project.org/bin/windows/contrib/2.13/BBmisc_1.0-58.zip" packageurl <- "https://cran-archive.r-project.org/bin/windows/contrib/2.13/BBmisc_1.0-58.zip"
install.packages(packageurl, repos=NULL, type="binary") install.packages(packageurl, repos=NULL, type="binary")
#+end_src #+end_src
**** Using devtools *** Using devtools
The simplest method to install the version you need is to use the The simplest method to install the version you need is to use the
=install_version()= function of the =devtools= package (obviously, you =install_version()= function of the =devtools= package (obviously, you
need to install =devtools= first, which can be done by running in =R= the need to install =devtools= first, which can be done by running in =R= the
...@@ -546,7 +427,7 @@ command =install.packages("devtools")=). For instance: ...@@ -546,7 +427,7 @@ command =install.packages("devtools")=). For instance:
require(devtools) require(devtools)
install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org") install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org")
#+end_src #+end_src
**** Installing from source code *** Installing from source code
Alternatively, you may want to install an older package from source If Alternatively, you may want to install an older package from source If
devtools fails or if you do not want to depend on it, you can install devtools fails or if you do not want to depend on it, you can install
it from source via =install.packages()= directed using the right it from source via =install.packages()= directed using the right
...@@ -565,7 +446,7 @@ line outside of R. For instance (in bash): ...@@ -565,7 +446,7 @@ line outside of R. For instance (in bash):
wget http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.1.tar.gz wget http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.1.tar.gz
R CMD INSTALL ggplot2_0.9.1.tar.gz R CMD INSTALL ggplot2_0.9.1.tar.gz
#+end_src #+end_src
**** Potential issues *** Potential issues
There are a few potential issues that may arise with installing older There are a few potential issues that may arise with installing older
versions of packages: versions of packages:
- You may be losing functionality or bug fixes that are only present - You may be losing functionality or bug fixes that are only present
......
<div id="content">
<h1 class="title">Additional references</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul style="margin:0 0;">
<li style="margin-bottom:0;"><a href="#org3b8ed57">"Thoughts" on language/software stability</a></li>
<li style="margin-bottom:0;"><a href="#org1d2d532">Controlling your software environment</a></li>
<li style="margin-bottom:0;"><a href="#org50da419">Preservation/Archiving</a></li>
<li style="margin-bottom:0;"><a href="#org5d2f9e5">Workflows</a></li>
<li style="margin-bottom:0;"><a href="#orgad41259">Numerical and statistical issues</a></li>
<li style="margin-bottom:0;"><a href="#org7321a51">Publication practices</a></li>
<li style="margin-bottom:0;"><a href="#orge4adad6">Experimentation</a></li>
</ul>
</div>
</div>
<div id="outline-container-org3b8ed57" class="outline-2">
<h2 id="org3b8ed57">"Thoughts" on language/software stability</h2>
<div class="outline-text-2" id="text-org3b8ed57">
<p>
As we explained, the programming language used in an analysis has a
clear influence on the reproducibility of your analysis. It is not a
characteristic of the language itself but rather a consequence of the
development philosophy of the underlying community. For example C is a
very stable language with a <a href="https://en.wikipedia.org/wiki/C_(programming_language)#ANSI_C_and_ISO_C">very clear specification designed by a
committee</a> (even though some compilers may not respect this norm).
</p>
<p>
On the other end of the spectrum, <a href="https://en.wikipedia.org/wiki/Python_(programming_language)">Python</a> had a much more organic
development based on a readability philosophy and valuing continuous
improvement over backwards-compatibility. Furthermore, Python is
commonly used as a wrapping language (e.g., to easily use C or FORTRAN
libraries) and has its own packaging system. All these design choices
tend to make reproducibility often a bit painful with Python, even
though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division.
</p>
<p>
<a href="https://en.wikipedia.org/wiki/R_(programming_language)">R</a>, in comparison is much closer (in terms of developer community) to
languages like <a href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a>, which is heavily used in the pharmaceutical
industry where statistical procedures need to be standardized and rock
solid/stable. R is obviously not immune to evolutions that break old
versions and hinder reproducibility/backward compatibility. Here is a
relatively recent <a href="http://members.cbio.mines-paristech.fr/~thocking/HOCKING-reproducible-research-with-R.html">true story about this</a> and some colleagues who worked
on the <a href="https://www.fun-mooc.fr/courses/UPSUD/42001S06/session06/about">statistics introductory course with R on FUN</a> reported us
several issues with a few functions (<code>plotmeans</code> from <code>gplots</code>,
<code>survfit</code> from <code>survival</code>, or <code>hclust</code>) whose default parameters had
changed over the years. It is thus probably good practice to give
explicit values for all parameters (which can be cumbersome) instead
of relying on default values, and to restrict your dependencies as much
as possible.
</p>
<p>
This being said, the R development community is generally quite
careful about stability. We (the authors of this MOOC) believe that open
source (which allows to inspect how computation is done and to
identify both mistakes and sources of non-reproducibility) is more
important than the rock solid stability of SAS, which is proprietary
software. Yet, if you really need to stay with SAS (similar solutions
probably exist for other languages as well), you should know that SAS
can be used within Jupyter using either the <a href="https://sassoftware.github.io/sas_kernel/">Python SASKernel</a> or the
<a href="https://sassoftware.github.io/saspy/">Python SASPy</a> package (step by step explanations about this are given
<a href="https://app-learninglab.inria.fr/gitlab/85bc36e0a8096c618fbd5993d1cca191/mooc-rr/blob/master/documents/tuto_jupyter_windows/tuto_jupyter_windows.md">here</a>). Using such literate programming approach allied with systematic
version and environment control will always help.
</p>
</div>
</div>
<div id="outline-container-org1d2d532" class="outline-2">
<h2 id="org1d2d532">Controlling your software environment</h2>
<div class="outline-text-2" id="text-org1d2d532">
<p>
As we mentioned in the video sequences, there are several solutions to
control your environment:
</p>
<ul class="org-ul">
<li style="margin-bottom:0;">The easy (preserve the mess) ones: <a href="http://www.pgbovine.net/cde.html">CDE</a> or <a href="https://vida-nyu.github.io/reprozip/">ReproZip</a></li>
<li style="margin-bottom:0;">The more demanding (encourage cleanliness) where you start with a
clean environment and install only what's strictly necessary (and document it):
<ul class="org-ul">
<li style="margin-bottom:0;">The very well known <a href="https://www.docker.io/">Docker</a></li>
<li style="margin-bottom:0;"><a href="https://singularity.lbl.gov/">Singularity</a> or <a href="https://spack.io/">Spack</a>, which are more targeted toward the specific
needs of high performance computing users</li>
<li style="margin-bottom:0;"><a href="https://www.gnu.org/software/guix/">Guix</a>, <a href="https://nixos.org/">Nix</a> that are very clean (perfect?) solutions to this
dependency hell and which we recommend</li>
</ul></li>
</ul>
<p>
It may be hard to understand the difference between these different
approaches and decide which one is better in your context.
</p>
<p>
Here is a webinar where some of these tools are demoed in a
reproducible research context: <a href="https://github.com/alegrand/RR_webinars/blob/master/2_controling_your_environment/index.org">Controling your environment (by Michael
Mercier and Cristian Ruiz)</a>
</p>
<p>
You may also want to have a look at <a href="http://falsifiable.us/">the Popper conventions</a> (<a href="https://github.com/alegrand/RR_webinars/blob/master/11_popper/index.org">webinar by
Ivo Gimenez through google hangout</a>) or at the <a href="https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org">presentation of Konrad
Hinsen on Active Papers</a> (<a href="http://www.activepapers.org/">http://www.activepapers.org/</a>).
</p>
</div>
</div>
<div id="outline-container-org50da419" class="outline-2">
<h2 id="org50da419">Preservation/Archiving</h2>
<div class="outline-text-2" id="text-org50da419">
<p>
Ensuring software is properly archived, i.e, is safely stored so that
it can be accessed in a perennial way, can be quite tricky. If you
have never seen <a href="https://github.com/alegrand/RR_webinars/blob/master/5_archiving_software_and_data/index.org">Roberto Di Cosmo presenting the Software Heritage
project</a>, this is a must see. <a href="https://www.softwareheritage.org/">https://www.softwareheritage.org/</a>
</p>
<p>
For regular data, we highly recommend using <a href="https://www.zenodo.org/">https://www.zenodo.org/</a>
whenever the data is not sensitive.
</p>
</div>
</div>
<div id="outline-container-org5d2f9e5" class="outline-2">
<h2 id="org5d2f9e5">Workflows</h2>
<div class="outline-text-2" id="text-org5d2f9e5">
<p>
In the video sequences, we mentioned workflow managers (original application domain in parenthesis):
</p>
<ul class="org-ul">
<li style="margin-bottom:0;"><a href="https://galaxyproject.org/">Galaxy</a> (genomics), <a href="https://kepler-project.org/">Kepler</a> (ecology), <a href="https://taverna.apache.org/">Taverna</a> (bio-informatics), <a href="https://pegasus.isi.edu/">Pegasus</a>
(astronomy), <a href="http://cknowledge.org/">Collective Knowledge</a> (compiling optimization) ,
<a href="https://www.vistrails.org">VisTrails</a> (image processing)</li>
<li style="margin-bottom:0;">Light-weight: <a href="http://dask.pydata.org/">dask</a> (python), <a href="https://ropensci.github.io/drake/">drake</a> (R), <a href="http://swift-lang.org/">swift</a> (molecular biology),
<a href="https://snakemake.readthedocs.io/">snakemake</a> (like <code>make</code> but more expressive and in <code>python</code>) &#x2026;</li>
<li style="margin-bottom:0;">Hybrids: <a href="https://vatlab.github.io/sos-docs/">SOS-notebook</a>, &#x2026;</li>
</ul>
<p>
You may want to have a look at this webinar: <a href="https://github.com/alegrand/RR_webinars/blob/master/6_reproducibility_bioinformatics/index.org">Reproducible Science in
Bio-informatics: Current Status, Solutions and Research Opportunities
(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard).</a>
</p>
</div>
</div>
<div id="outline-container-orgad41259" class="outline-2">
<h2 id="orgad41259">Numerical and statistical issues</h2>
<div class="outline-text-2" id="text-orgad41259">
<p>
We have mentioned these topics in our MOOC but we could by no way
cover them properly. We only suggest here a few interesting talks
about this.
</p>
<ul class="org-ul">
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/10_statistics_and_replication_in_HCI/index.org">In this talk, Pierre Dragicevic provides a nice illustration of the
consequences of statistical uncertainty and of how some concepts
(e.G. p-values) are commonly badly understood.</a></li>
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/3_numerical_reproducibility/index.org">Nathalie Revol, Philippe Langlois and Stef Graillat present the main
challenges encountered when trying to achieve numerical
reproducibility and present recent research work on this topic.</a></li>
</ul>
</div>
</div>
<div id="outline-container-org7321a51" class="outline-2">
<h2 id="org7321a51">Publication practices</h2>
<div class="outline-text-2" id="text-org7321a51">
<p>
You may want to have a look at the following two webinars:
</p>
<ul class="org-ul">
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/8_artifact_evaluation/index.org">Enabling open and reproducible research at computer systems’
conferences (by Grigori Fursin)</a>. In particular, this talk discusses
<i>artifact evaluation</i> that is becoming more and more popular.</li>
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org">Publication Modes Favoring Reproducible Research (by Konrad Hinsen
and Nicolas Rougier)</a>. In this talk, the motivation for the <a href="http://rescience.github.io/">ReScience
journal</a> initiative are presented.</li>
<li style="margin-bottom:0;"><a href="https://www.youtube.com/watch?v=HuJ2G8rXHMs">Simine Vazire - When Should We be Skeptical of Scientific Claims?</a>,
which is discussing publication practices in social sciences and in
particular HARKing (Hypothesizing After the Results are Known),
p-hacking, etc.</li>
</ul>
</div>
</div>
<div id="outline-container-orge4adad6" class="outline-2">
<h2 id="orge4adad6">Experimentation</h2>
<div class="outline-text-2" id="text-orge4adad6">
<p>
Experimentation was not covered in this MOOC, although it is an
essential part of science. The main reason is that practices and
constraints can vary so wildly from one domain to another that it could
not be properly covered in a first edition. We would be happy to
gather references you consider as interesting in your domain so do not
hesitate to provide us with such references by using the forum and we
will update this page.
</p>
<ul class="org-ul">
<li style="margin-bottom:0;"><a href="https://github.com/alegrand/RR_webinars/blob/master/9_experimental_testbeds/index.org">A recent talk by Lucas Nussbaum on Experimental Testbeds in Computer
Science</a>.</li>
</ul>
</div>
</div>
</div>
# -*- mode: org -*-
#+TITLE: Additional references
#+AUTHOR: Arnaud Legrand
#+DATE: June, 2018
#+STARTUP: overview indent
#+OPTIONS: num:nil toc:t
#+PROPERTY: header-args :eval never-export
* "Thoughts" on language/software stability
As we explained, the programming language used in an analysis has a
clear influence on the reproducibility of your analysis. It is not a
characteristic of the language itself but rather a consequence of the
development philosophy of the underlying community. For example C is a
very stable language with a [[https://en.wikipedia.org/wiki/C_(programming_language)#ANSI_C_and_ISO_C][very clear specification designed by a
committee]] (even though some compilers may not respect this norm).
On the other end of the spectrum, [[https://en.wikipedia.org/wiki/Python_(programming_language)][Python]] had a much more organic
development based on a readability philosophy and valuing continuous
improvement over backwards-compatibility. Furthermore, Python is
commonly used as a wrapping language (e.g., to easily use C or FORTRAN
libraries) and has its own packaging system. All these design choices
tend to make reproducibility often a bit painful with Python, even
though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division.
[[https://en.wikipedia.org/wiki/R_(programming_language)][R]], in comparison is much closer (in terms of developer community) to
languages like [[https://en.wikipedia.org/wiki/SAS_(software)][SAS]], which is heavily used in the pharmaceutical
industry where statistical procedures need to be standardized and rock
solid/stable. R is obviously not immune to evolutions that break old
versions and hinder reproducibility/backward compatibility. Here is a
relatively recent [[http://members.cbio.mines-paristech.fr/~thocking/HOCKING-reproducible-research-with-R.html][true story about this]] and some colleagues who worked
on the [[https://www.fun-mooc.fr/courses/UPSUD/42001S06/session06/about][statistics introductory course with R on FUN]] reported us
several issues with a few functions (=plotmeans= from =gplots=,
=survfit= from =survival=, or =hclust=) whose default parameters had
changed over the years. It is thus probably good practice to give
explicit values for all parameters (which can be cumbersome) instead
of relying on default values, and to restrict your dependencies as much
as possible.
This being said, the R development community is generally quite
careful about stability. We (the authors of this MOOC) believe that open
source (which allows to inspect how computation is done and to
identify both mistakes and sources of non-reproducibility) is more
important than the rock solid stability of SAS, which is proprietary
software. Yet, if you really need to stay with SAS (similar solutions
probably exist for other languages as well), you should know that SAS
can be used within Jupyter using either the [[https://sassoftware.github.io/sas_kernel/][Python SASKernel]] or the
[[https://sassoftware.github.io/saspy/][Python SASPy]] package (step by step explanations about this are given
[[https://app-learninglab.inria.fr/gitlab/85bc36e0a8096c618fbd5993d1cca191/mooc-rr/blob/master/documents/tuto_jupyter_windows/tuto_jupyter_windows.md][here]]). Using such literate programming approach allied with systematic
version and environment control will always help.
* Controlling your software environment
As we mentioned in the video sequences, there are several solutions to
control your environment:
- The easy (preserve the mess) ones: [[http://www.pgbovine.net/cde.html][CDE]] or [[https://vida-nyu.github.io/reprozip/][ReproZip]]
- The more demanding (encourage cleanliness) where you start with a
clean environment and install only what's strictly necessary (and document it):
- The very well known [[https://www.docker.io/][Docker]]
- [[https://singularity.lbl.gov/][Singularity]] or [[https://spack.io/][Spack]], which are more targeted toward the specific
needs of high performance computing users
- [[https://www.gnu.org/software/guix/][Guix]], [[https://nixos.org/][Nix]] that are very clean (perfect?) solutions to this
dependency hell and which we recommend
It may be hard to understand the difference between these different
approaches and decide which one is better in your context.
Here is a webinar where some of these tools are demoed in a
reproducible research context: [[https://github.com/alegrand/RR_webinars/blob/master/2_controling_your_environment/index.org][Controling your environment (by Michael
Mercier and Cristian Ruiz)]]
You may also want to have a look at [[http://falsifiable.us/][the Popper conventions]] ([[https://github.com/alegrand/RR_webinars/blob/master/11_popper/index.org][webinar by
Ivo Gimenez through google hangout]]) or at the [[https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org][presentation of Konrad
Hinsen on Active Papers]] (http://www.activepapers.org/).
* Preservation/Archiving
Ensuring software is properly archived, i.e, is safely stored so that
it can be accessed in a perennial way, can be quite tricky. If you
have never seen [[https://github.com/alegrand/RR_webinars/blob/master/5_archiving_software_and_data/index.org][Roberto Di Cosmo presenting the Software Heritage
project]], this is a must see. https://www.softwareheritage.org/
For regular data, we highly recommend using https://www.zenodo.org/
whenever the data is not sensitive.
* Workflows
In the video sequences, we mentioned workflow managers (original application domain in parenthesis):
- [[https://galaxyproject.org/][Galaxy]] (genomics), [[https://kepler-project.org/][Kepler]] (ecology), [[https://taverna.apache.org/][Taverna]] (bio-informatics), [[https://pegasus.isi.edu/][Pegasus]]
(astronomy), [[http://cknowledge.org/][Collective Knowledge]] (compiling optimization) ,
[[https://www.vistrails.org][VisTrails]] (image processing)
- Light-weight: [[http://dask.pydata.org/][dask]] (python), [[https://ropensci.github.io/drake/][drake]] (R), [[http://swift-lang.org/][swift]] (molecular biology),
[[https://snakemake.readthedocs.io/][snakemake]] (like =make= but more expressive and in =python=) ...
- Hybrids: [[https://vatlab.github.io/sos-docs/][SOS-notebook]], ...
You may want to have a look at this webinar: [[https://github.com/alegrand/RR_webinars/blob/master/6_reproducibility_bioinformatics/index.org][Reproducible Science in
Bio-informatics: Current Status, Solutions and Research Opportunities
(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard).]]
* Numerical and statistical issues
We have mentioned these topics in our MOOC but we could by no way
cover them properly. We only suggest here a few interesting talks
about this.
- [[https://github.com/alegrand/RR_webinars/blob/master/10_statistics_and_replication_in_HCI/index.org][In this talk, Pierre Dragicevic provides a nice illustration of the
consequences of statistical uncertainty and of how some concepts
(e.G. p-values) are commonly badly understood.]]
- [[https://github.com/alegrand/RR_webinars/blob/master/3_numerical_reproducibility/index.org][Nathalie Revol, Philippe Langlois and Stef Graillat present the main
challenges encountered when trying to achieve numerical
reproducibility and present recent research work on this topic.]]
* Publication practices
You may want to have a look at the following two webinars:
- [[https://github.com/alegrand/RR_webinars/blob/master/8_artifact_evaluation/index.org][Enabling open and reproducible research at computer systems’
conferences (by Grigori Fursin)]]. In particular, this talk discusses
/artifact evaluation/ that is becoming more and more popular.
- [[https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org][Publication Modes Favoring Reproducible Research (by Konrad Hinsen
and Nicolas Rougier)]]. In this talk, the motivation for the [[http://rescience.github.io/][ReScience
journal]] initiative are presented.
- [[https://www.youtube.com/watch?v=HuJ2G8rXHMs][Simine Vazire - When Should We be Skeptical of Scientific Claims?]],
which is discussing publication practices in social sciences and in
particular HARKing (Hypothesizing After the Results are Known),
p-hacking, etc.
* Experimentation
Experimentation was not covered in this MOOC, although it is an
essential part of science. The main reason is that practices and
constraints can vary so wildly from one domain to another that it could
not be properly covered in a first edition. We would be happy to
gather references you consider as interesting in your domain so do not
hesitate to provide us with such references by using the forum and we
will update this page.
- [[https://github.com/alegrand/RR_webinars/blob/master/9_experimental_testbeds/index.org][A recent talk by Lucas Nussbaum on Experimental Testbeds in Computer
Science]].
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment