diff --git a/module4/ressources/resources.html b/module4/ressources/resources.html index 7a20bb9b409e23159e925bf71c3cd9ccc8a1772f..329189baea71204206ab1afa2ec26f7eda6a741f 100644 --- a/module4/ressources/resources.html +++ b/module4/ressources/resources.html @@ -3,25 +3,33 @@

Table of Contents

-
-

Getting information about your Git repository

-
+
+

Additional references

+
+
+
+

"Thoughts" on language/software stability

+
+

+As we explained, the programming language used in an analysis has a +clear influence on the reproducibility of your analysis. It is not a +characteristic of the language itself but rather a consequence of the +development philosophy of the underlying community. For example C is a +very stable language with a very clear specification designed by a +committee (even though some compilers may not respect this norm). +

+ +

+On the other end of the spectrum, Python had a much more organic +development based on a readability philosophy and has evolved with +time. Furthermore, python is commonly used as a wrapping language +(e.g., to easily use C or FORTRAN libraries) and has its own packaging +system to make everyone's life easier. All these design choices tend +to make reproducibility often a bit painful with python, even though +the community is slowly taking this into account. +

+ +

+R, in comparison is much closer (in terms of developer community) to +languages like SAS, which is heavily used in the pharmaceutical +industry where statistical procedures need to be standardized and rock +solid/stable. R is obviously not immune to evolutions that break old +versions and hinder reproducibility/backward compatibility. Here is a +relatively recent true story about this and some colleagues who worked +on the statistics introductory course with R on FUN reported us +several issues with functions from a few functions (plotmeans from +gplots, survfit from survival, or hclust) whose default +parameters had changed over the years. It is thus probably a good +practice to explicitly indicate in your code default values (, which +can be cumbersome) and to restrict your dependencies as much as +possible. +

+ +

+This being said, the R development community is generally quite +careful about stability. We (the authors of this MOOC) think open +source (, which allows to inspect how computation is done and to +identify both mistakes and sources of non reproducibility) is more +important than the rock solid stability of SAS, which is a proprietary +software. Yet, if you really need to stay with SAS (similar solutions +probably exist for other languages as well), you should know that SAS +can be used within Jupyter using either the Python SASKernel or the +Python SASPy package (step by step explanations about this are given +here). Using such literate programming approach allied with systematic +control version and environment control will help anyway. +

+
+
+
+

Controlling your software environment

+
+

+As we mentioned in the video sequences, there are several solutions to +control your environment: +

+
    +
  • The easy (preserve the mess) ones: CDE or ReproZip
  • +
  • The more demanding (encourage cleanliness) where you start with a +clean environment and install only what's strictly necessary (and document it): +
      +
    • The very well known Docker
    • +
    • Singularity or Spack, which are more targeted toward high +performance computing users that have specific needs
    • +
    • Guix, Nix that are very clean (perfect?) solutions to this +dependency hell and which we recommend
    • +
  • +
+ +

+It may be hard to understand the difference between these different +approaches and decide which one is better in your context. +

+ +

+Here is a webinar where some of these tools are demoed in a +reproducible research context: Controling your environment (by Michael +Mercier and Cristian Ruiz) +

+ +

+You may also want to have a look at the Popper conventions (webinar by +Ivo Gimenez through google hangout) or at the presentation of Konrad +Hinsen on Active Papers (http://www.activepapers.org/). +

+
+
+
+

Preservation/Archiving

+
+

+Ensuring software is properly archived, i.e, is safely stored so that +it can be accessed in a perennial way, can be quite tricky. If you +have never seen Roberto Di Cosmo presenting the Software Heritage +project, this is a must see. https://www.softwareheritage.org/ +

+ +

+For regular data, we highly recommend using https://www.zenodo.org/ +whenever data is not sensitive. +

+
+
+
+

Workflows

+
+

+In the video sequences, we mentioned workflows (original domain in parenthesis): +

+ + +

+You may want to have a look at this webinar: Reproducible Science in +Bio-informatics: Current Status, Solutions and Research Opportunities +(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard). +

+
+
+ + +
+

Publication practices

+
+

+You may want to have a look at the following two webinars: +

+ +
+
+
+

Experimentation

+
+

+Experimentation was not covered in this MOOC whereas it is an +essential part of science. The main reason is that practices and +constraints can vary so wildly from a domain to an other that it could +not be properly covered in a first edition. We would be happy to +gather references you consider as interesting in your domain so do not +hesitate to provide us with such references by using the forum and we +will update this page. +

+ + +
+
+
+
+

Tracking environment information

+
+
+
+

Getting information about your Git repository

+

When taking notes, it may be difficult to remember which version of the code or of a file was used. This is what version control is useful @@ -123,13 +327,13 @@ is the price to pay for running git from within the notebook itself.

-
-

Getting information about Python(3) libraries

-
+
+

Getting information about Python(3) libraries

+
-
-

Getting the list of installed packages and their version

-
+
+

Getting the list of installed packages and their version

+

This topic is discussed on StackOverflow. When using pip (the Python package installer) within a shell command, it is easy to query the @@ -237,9 +441,9 @@ Requires: patsy, pandas

-
-

How to list imported modules?

-
+
+

How to list imported modules?

+

Without resorting to pip (that will list all available packages), you may want to know which modules are loaded in a Python session as well @@ -300,9 +504,9 @@ zlib 1.0

-
-

Setting up an environment with pip

-
+
+

Setting up an environment with pip

+

The easiest way to go is as follows:

@@ -319,9 +523,9 @@ dynamic libraries that are wrapped by Python though.

-
-

Installing a new package or a specific version

-
+
+

Installing a new package or a specific version

+

The Jupyter environment we deployed on our servers for the MOOC is based on the version 4.5.4 of Miniconda and Python 3.6. In this @@ -388,13 +592,13 @@ It is even possible to install a specific (possibly much older) version, e.g.,:

-
-

Getting information about R libraries

-
+
+

Getting information about R libraries

+
-
-

Getting the list imported modules and their version

-
+
+

Getting the list imported modules and their version

+

The best way seems to be to rely on the devtools package (if this package is not installed, you should install it first by running in R @@ -462,9 +666,9 @@ clean R dependency management should thus have a look at -

Getting the list of installed packages and their version

-
+
+

Getting the list of installed packages and their version

+

Finally, it is good to know that there is a built-in R command (installed.packages) allowing to retrieve and list the details of all @@ -719,9 +923,9 @@ packages installed.

-
-

Installing a new package or a specific version

-
+
+

Installing a new package or a specific version

+

This section is mostly a cut and paste from the recent post by Ian Pylvainen on this topic. It comprises a very clear explanation on how @@ -729,9 +933,9 @@ to proceed.

-
-

Installing a pre-compiled version

-
+
    +
  • Installing a pre-compiled version
    +

    If you're on a Debian or a Ubuntu system, it may be difficult to access a specific version without breaking your system. So unless you @@ -754,10 +958,9 @@ install.packages(packageurl, repos= -

    Using devtools

    -
    +
  • +
  • Using devtools
    +

    The simplest method to install the version you need is to use the install_version() function of the devtools package (obviously, you @@ -771,10 +974,9 @@ install_version("ggplot2", version =

-
-
-

Alternatively, you may want to install an older package from source

-
+ +
  • Alternatively, you may want to install an older package from source
    +

    If you devtools fails or if you do not want to depend on it, you can install it from source via install.packages() directed to the right @@ -801,10 +1003,9 @@ R CMD INSTALL ggplot2_0.9.1.tar.gz

  • -
    -
    -

    Potential issues

    -
    + +
  • Potential issues
    +

    There are a few potential issues that may arise with installing older versions of packages: @@ -818,6 +1019,8 @@ to downgrade R to a compatible version or update your R code to work with a newer version of the package.

  • + +