From 0130fd48009204c3293c200c52501b8edc725c97 Mon Sep 17 00:00:00 2001 From: Arnaud Legrand Date: Tue, 9 Oct 2018 14:47:28 +0200 Subject: [PATCH] Split resources in two documents --- module4/ressources/Makefile | 2 +- ...ources.html => resources_environment.html} | 330 ++++-------------- ...esources.org => resources_environment.org} | 151 +------- module4/ressources/resources_refs.html | 205 +++++++++++ module4/ressources/resources_refs.org | 125 +++++++ 5 files changed, 411 insertions(+), 402 deletions(-) rename module4/ressources/{resources.html => resources_environment.html} (63%) rename module4/ressources/{resources.org => resources_environment.org} (67%) create mode 100644 module4/ressources/resources_refs.html create mode 100644 module4/ressources/resources_refs.org diff --git a/module4/ressources/Makefile b/module4/ressources/Makefile index 0a519c5..4727054 100644 --- a/module4/ressources/Makefile +++ b/module4/ressources/Makefile @@ -1,4 +1,4 @@ -all: resources.html exo1.html exo2.html exo3.html +all: resources_refs.html resources_environment.html exo1.html exo2.html exo3.html NLINES=10000000 diff --git a/module4/ressources/resources.html b/module4/ressources/resources_environment.html similarity index 63% rename from module4/ressources/resources.html rename to module4/ressources/resources_environment.html index b82d01c..497c7a6 100644 --- a/module4/ressources/resources.html +++ b/module4/ressources/resources_environment.html @@ -1,36 +1,29 @@
+

Tracking environment information

-
-

Additional references

-
-
-
-

"Thoughts" on language/software stability

-
-

-As we explained, the programming language used in an analysis has a -clear influence on the reproducibility of your analysis. It is not a -characteristic of the language itself but rather a consequence of the -development philosophy of the underlying community. For example C is a -very stable language with a very clear specification designed by a -committee (even though some compilers may not respect this norm). -

- -

-On the other end of the spectrum, Python had a much more organic -development based on a readability philosophy and valuing continuous -improvement over backwards-compatibility. Furthermore, Python is -commonly used as a wrapping language (e.g., to easily use C or FORTRAN -libraries) and has its own packaging system. All these design choices -tend to make reproducibility often a bit painful with Python, even -though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division. -

- -

-R, in comparison is much closer (in terms of developer community) to -languages like SAS, which is heavily used in the pharmaceutical -industry where statistical procedures need to be standardized and rock -solid/stable. R is obviously not immune to evolutions that break old -versions and hinder reproducibility/backward compatibility. Here is a -relatively recent true story about this and some colleagues who worked -on the statistics introductory course with R on FUN reported us -several issues with a few functions (plotmeans from gplots, -survfit from survival, or hclust) whose default parameters had -changed over the years. It is thus probably good practice to give -explicit values for all parameters (which can be cumbersome) instead -of relying on default values, and to restrict your dependencies as much -as possible. -

- -

-This being said, the R development community is generally quite -careful about stability. We (the authors of this MOOC) believe that open -source (which allows to inspect how computation is done and to -identify both mistakes and sources of non-reproducibility) is more -important than the rock solid stability of SAS, which is proprietary -software. Yet, if you really need to stay with SAS (similar solutions -probably exist for other languages as well), you should know that SAS -can be used within Jupyter using either the Python SASKernel or the -Python SASPy package (step by step explanations about this are given -here). Using such literate programming approach allied with systematic -version and environment control will always help. -

-
-
-
-

Controlling your software environment

-
-

-As we mentioned in the video sequences, there are several solutions to -control your environment: -

-
    -
  • The easy (preserve the mess) ones: CDE or ReproZip
  • -
  • The more demanding (encourage cleanliness) where you start with a -clean environment and install only what's strictly necessary (and document it): -
      -
    • The very well known Docker
    • -
    • Singularity or Spack, which are more targeted toward the specific -needs of high performance computing users
    • -
    • Guix, Nix that are very clean (perfect?) solutions to this -dependency hell and which we recommend
    • -
  • -
- -

-It may be hard to understand the difference between these different -approaches and decide which one is better in your context. -

- -

-Here is a webinar where some of these tools are demoed in a -reproducible research context: Controling your environment (by Michael -Mercier and Cristian Ruiz) -

- -

-You may also want to have a look at the Popper conventions (webinar by -Ivo Gimenez through google hangout) or at the presentation of Konrad -Hinsen on Active Papers (http://www.activepapers.org/). -

-
-
-
-

Preservation/Archiving

-
-

-Ensuring software is properly archived, i.e, is safely stored so that -it can be accessed in a perennial way, can be quite tricky. If you -have never seen Roberto Di Cosmo presenting the Software Heritage -project, this is a must see. https://www.softwareheritage.org/ -

- -

-For regular data, we highly recommend using https://www.zenodo.org/ -whenever the data is not sensitive. -

-
-
-
-

Workflows

-
-

-In the video sequences, we mentioned workflow managers (original application domain in parenthesis): -

- - -

-You may want to have a look at this webinar: Reproducible Science in -Bio-informatics: Current Status, Solutions and Research Opportunities -(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard). -

-
-
- - -
-

Publication practices

-
-

-You may want to have a look at the following two webinars: -

- -
-
-
-

Experimentation

-
-

-Experimentation was not covered in this MOOC, although it is an -essential part of science. The main reason is that practices and -constraints can vary so wildly from one domain to another that it could -not be properly covered in a first edition. We would be happy to -gather references you consider as interesting in your domain so do not -hesitate to provide us with such references by using the forum and we -will update this page. -

- - -
-
-
-
-

Tracking environment information

-
-
-
-

Getting information about your Git repository

-
+
+

Getting information about your Git repository

+

When taking notes, it may be difficult to remember which version of the code or of a file was used. This is what version control is useful @@ -328,13 +125,13 @@ is the price to pay for running git from within the notebook itself.

-
-

Getting information about Python(3) libraries

-
+
+

Getting information about Python(3) libraries

+
-
-

Getting information about your system

-
+
+

Getting information about your system

+

This topic is discussed on StackOverflow.

@@ -351,9 +148,9 @@ uname_result(system='Linux', node='icarus', release='4.15.0-2-amd64', version='#
-
-

Getting the list of installed packages and their version

-
+
+

Getting the list of installed packages and their version

+

This topic is discussed on StackOverflow. When using pip (the Python package installer) within a shell command, it is easy to query the @@ -461,9 +258,9 @@ Requires: patsy, pandas

-
-

How to list imported modules?

-
+
+

How to list imported modules?

+

Without resorting to pip (that will list all available packages), you may want to know which modules are loaded in a Python session as well @@ -525,9 +322,9 @@ zlib 1.0

-
-

Saving and restoring an environment with pip

-
+
+

Saving and restoring an environment with pip

+

The easiest way to go is as follows:

@@ -544,9 +341,9 @@ dynamic libraries that are wrapped by Python though.

-
-

Installing a new package or a specific version

-
+
+

Installing a new package or a specific version

+

The Jupyter environment we deployed on our servers for the MOOC is based on the version 4.5.4 of Miniconda and Python 3.6. In this @@ -613,13 +410,13 @@ It is even possible to install a specific (possibly much older) version, e.g.,:

-
-

Getting information about R libraries

-
+
+

Getting information about R libraries

+
-
-

Getting the list imported modules and their version

-
+
+

Getting the list imported modules and their version

+

The best way seems to be to rely on the devtools package (if this package is not installed, you should install it first by running in R @@ -687,9 +484,9 @@ clean R dependency management should thus have a look at -

Getting the list of installed packages and their version

-
+
+

Getting the list of installed packages and their version

+

Finally, it is good to know that there is a built-in R command (installed.packages) allowing to retrieve and list the details of all @@ -944,9 +741,9 @@ packages installed.

-
-

Installing a new package or a specific version

-
+
+

Installing a new package or a specific version

+

This section is mostly a cut and paste from the recent post by Ian Pylvainen on this topic. It comprises a very clear explanation of how @@ -954,9 +751,9 @@ to proceed.

-
    -
  • Installing a pre-compiled version
    -
    +
    +

    Installing a pre-compiled version

    +

    If you're on a Debian or a Ubuntu system, it may be difficult to access a specific version without breaking your system. So unless you @@ -979,9 +776,10 @@ install.packages(packageurl, repos=Using devtools
    -

    +
    +
    +

    Using devtools

    +

    The simplest method to install the version you need is to use the install_version() function of the devtools package (obviously, you @@ -995,9 +793,10 @@ install_version("ggplot2", version =

    -
  • -
  • Installing from source code
    -
    +
    +
    +

    Installing from source code

    +

    Alternatively, you may want to install an older package from source If devtools fails or if you do not want to depend on it, you can install @@ -1025,9 +824,10 @@ R CMD INSTALL ggplot2_0.9.1.tar.gz

    -
  • -
  • Potential issues
    -
    +
    +
    +

    Potential issues

    +

    There are a few potential issues that may arise with installing older versions of packages: @@ -1041,8 +841,6 @@ to downgrade R to a compatible version or update your R code to work with a newer version of the package.

- -
diff --git a/module4/ressources/resources.org b/module4/ressources/resources_environment.org similarity index 67% rename from module4/ressources/resources.org rename to module4/ressources/resources_environment.org index 566d77d..6e58bcd 100644 --- a/module4/ressources/resources.org +++ b/module4/ressources/resources_environment.org @@ -1,131 +1,12 @@ # -*- mode: org -*- -#+TITLE: +#+TITLE: Tracking environment information #+AUTHOR: Arnaud Legrand #+DATE: June, 2018 #+STARTUP: overview indent #+OPTIONS: num:nil toc:t #+PROPERTY: header-args :eval never-export -* Additional references -** "Thoughts" on language/software stability -As we explained, the programming language used in an analysis has a -clear influence on the reproducibility of your analysis. It is not a -characteristic of the language itself but rather a consequence of the -development philosophy of the underlying community. For example C is a -very stable language with a [[https://en.wikipedia.org/wiki/C_(programming_language)#ANSI_C_and_ISO_C][very clear specification designed by a -committee]] (even though some compilers may not respect this norm). - -On the other end of the spectrum, [[https://en.wikipedia.org/wiki/Python_(programming_language)][Python]] had a much more organic -development based on a readability philosophy and valuing continuous -improvement over backwards-compatibility. Furthermore, Python is -commonly used as a wrapping language (e.g., to easily use C or FORTRAN -libraries) and has its own packaging system. All these design choices -tend to make reproducibility often a bit painful with Python, even -though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division. - -[[https://en.wikipedia.org/wiki/R_(programming_language)][R]], in comparison is much closer (in terms of developer community) to -languages like [[https://en.wikipedia.org/wiki/SAS_(software)][SAS]], which is heavily used in the pharmaceutical -industry where statistical procedures need to be standardized and rock -solid/stable. R is obviously not immune to evolutions that break old -versions and hinder reproducibility/backward compatibility. Here is a -relatively recent [[http://members.cbio.mines-paristech.fr/~thocking/HOCKING-reproducible-research-with-R.html][true story about this]] and some colleagues who worked -on the [[https://www.fun-mooc.fr/courses/UPSUD/42001S06/session06/about][statistics introductory course with R on FUN]] reported us -several issues with a few functions (=plotmeans= from =gplots=, -=survfit= from =survival=, or =hclust=) whose default parameters had -changed over the years. It is thus probably good practice to give -explicit values for all parameters (which can be cumbersome) instead -of relying on default values, and to restrict your dependencies as much -as possible. - -This being said, the R development community is generally quite -careful about stability. We (the authors of this MOOC) believe that open -source (which allows to inspect how computation is done and to -identify both mistakes and sources of non-reproducibility) is more -important than the rock solid stability of SAS, which is proprietary -software. Yet, if you really need to stay with SAS (similar solutions -probably exist for other languages as well), you should know that SAS -can be used within Jupyter using either the [[https://sassoftware.github.io/sas_kernel/][Python SASKernel]] or the -[[https://sassoftware.github.io/saspy/][Python SASPy]] package (step by step explanations about this are given -[[https://app-learninglab.inria.fr/gitlab/85bc36e0a8096c618fbd5993d1cca191/mooc-rr/blob/master/documents/tuto_jupyter_windows/tuto_jupyter_windows.md][here]]). Using such literate programming approach allied with systematic -version and environment control will always help. -** Controlling your software environment -As we mentioned in the video sequences, there are several solutions to -control your environment: -- The easy (preserve the mess) ones: [[http://www.pgbovine.net/cde.html][CDE]] or [[https://vida-nyu.github.io/reprozip/][ReproZip]] -- The more demanding (encourage cleanliness) where you start with a - clean environment and install only what's strictly necessary (and document it): - - The very well known [[https://www.docker.io/][Docker]] - - [[https://singularity.lbl.gov/][Singularity]] or [[https://spack.io/][Spack]], which are more targeted toward the specific - needs of high performance computing users - - [[https://www.gnu.org/software/guix/][Guix]], [[https://nixos.org/][Nix]] that are very clean (perfect?) solutions to this - dependency hell and which we recommend - -It may be hard to understand the difference between these different -approaches and decide which one is better in your context. - -Here is a webinar where some of these tools are demoed in a -reproducible research context: [[https://github.com/alegrand/RR_webinars/blob/master/2_controling_your_environment/index.org][Controling your environment (by Michael -Mercier and Cristian Ruiz)]] - -You may also want to have a look at [[http://falsifiable.us/][the Popper conventions]] ([[https://github.com/alegrand/RR_webinars/blob/master/11_popper/index.org][webinar by -Ivo Gimenez through google hangout]]) or at the [[https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org][presentation of Konrad -Hinsen on Active Papers]] (http://www.activepapers.org/). -** Preservation/Archiving -Ensuring software is properly archived, i.e, is safely stored so that -it can be accessed in a perennial way, can be quite tricky. If you -have never seen [[https://github.com/alegrand/RR_webinars/blob/master/5_archiving_software_and_data/index.org][Roberto Di Cosmo presenting the Software Heritage -project]], this is a must see. https://www.softwareheritage.org/ - -For regular data, we highly recommend using https://www.zenodo.org/ -whenever the data is not sensitive. -** Workflows -In the video sequences, we mentioned workflow managers (original application domain in parenthesis): -- [[https://galaxyproject.org/][Galaxy]] (genomics), [[https://kepler-project.org/][Kepler]] (ecology), [[https://taverna.apache.org/][Taverna]] (bio-informatics), [[https://pegasus.isi.edu/][Pegasus]] - (astronomy), [[http://cknowledge.org/][Collective Knowledge]] (compiling optimization) , - [[https://www.vistrails.org][VisTrails]] (image processing) -- Light-weight: [[http://dask.pydata.org/][dask]] (python), [[https://ropensci.github.io/drake/][drake]] (R), [[http://swift-lang.org/][swift]] (molecular biology), - [[https://snakemake.readthedocs.io/][snakemake]] (like =make= but more expressive and in =python=) ... -- Hybrids: [[https://vatlab.github.io/sos-docs/][SOS-notebook]], ... - -You may want to have a look at this webinar: [[https://github.com/alegrand/RR_webinars/blob/master/6_reproducibility_bioinformatics/index.org][Reproducible Science in -Bio-informatics: Current Status, Solutions and Research Opportunities -(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard).]] - -** Numerical and statistical issues -We have mentioned these topics in our MOOC but we could by no way -cover them properly. We only suggest here a few interesting talks -about this. -- [[https://github.com/alegrand/RR_webinars/blob/master/10_statistics_and_replication_in_HCI/index.org][In this talk, Pierre Dragicevic provides a nice illustration of the - consequences of statistical uncertainty and of how some concepts - (e.G. p-values) are commonly badly understood.]] -- [[https://github.com/alegrand/RR_webinars/blob/master/3_numerical_reproducibility/index.org][Nathalie Revol, Philippe Langlois and Stef Graillat present the main - challenges encountered when trying to achieve numerical - reproducibility and present recent research work on this topic.]] -** Publication practices -You may want to have a look at the following two webinars: -- [[https://github.com/alegrand/RR_webinars/blob/master/8_artifact_evaluation/index.org][Enabling open and reproducible research at computer systems’ - conferences (by Grigori Fursin)]]. In particular, this talk discusses - /artifact evaluation/ that is becoming more and more popular. -- [[https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org][Publication Modes Favoring Reproducible Research (by Konrad Hinsen - and Nicolas Rougier)]]. In this talk, the motivation for the [[http://rescience.github.io/][ReScience - journal]] initiative are presented. -- [[https://www.youtube.com/watch?v=HuJ2G8rXHMs][Simine Vazire - When Should We be Skeptical of Scientific Claims?]], - which is discussing publication practices in social sciences and in - particular HARKing (Hypothesizing After the Results are Known), - p-hacking, etc. -** Experimentation -Experimentation was not covered in this MOOC, although it is an -essential part of science. The main reason is that practices and -constraints can vary so wildly from one domain to another that it could -not be properly covered in a first edition. We would be happy to -gather references you consider as interesting in your domain so do not -hesitate to provide us with such references by using the forum and we -will update this page. - -- [[https://github.com/alegrand/RR_webinars/blob/master/9_experimental_testbeds/index.org][A recent talk by Lucas Nussbaum on Experimental Testbeds in Computer - Science]]. -* Tracking environment information -** Getting information about your Git repository +* Getting information about your Git repository When taking notes, it may be difficult to remember which version of the code or of a file was used. This is what version control is useful for. Here are a few useful commands that we typically insert at the @@ -203,8 +84,8 @@ Obviously, in this case you need to save the notebook before running this cell, hence the output of this final command (with the new git hash) will not be stored in the cell. This is not really a problem and is the price to pay for running git from within the notebook itself. -** Getting information about Python(3) libraries -*** Getting information about your system +* Getting information about Python(3) libraries +** Getting information about your system This topic is discussed on [[https://stackoverflow.com/questions/3103178/how-to-get-the-system-info-with-python][StackOverflow]]. #+begin_src python :results output :exports both import platform @@ -214,7 +95,7 @@ print(platform.uname()) #+RESULTS: : uname_result(system='Linux', node='icarus', release='4.15.0-2-amd64', version='#1 SMP Debian 4.15.11-1 (2018-03-20)', machine='x86_64', processor='') -*** Getting the list of installed packages and their version +** Getting the list of installed packages and their version This topic is discussed on [[https://stackoverflow.com/questions/20180543/how-to-check-version-of-python-modules][StackOverflow]]. When using =pip= (the Python package installer) within a shell command, it is easy to query the version of all installed packages (note that on your system, you may @@ -310,7 +191,7 @@ License: BSD License Location: /home/alegrand/.local/lib/python3.6/site-packages Requires: patsy, pandas #+end_example -*** How to list imported modules? +** How to list imported modules? Without resorting to pip (that will list all available packages), you may want to know which modules are loaded in a Python session as well as their version. Inspired by [[https://stackoverflow.com/questions/4858100/how-to-list-imported-modules][StackOverflow]], here is a simple @@ -368,7 +249,7 @@ urllib.request 3.6 zlib 1.0 #+end_example -*** Saving and restoring an environment with pip +** Saving and restoring an environment with pip The easiest way to go is as follows: #+begin_src shell :results output :exports both pip3 freeze > requirements.txt # to obtain the list of packages with their version @@ -378,7 +259,7 @@ pip3 install -r requirements.txt # to install the previous list of packages, pos If you want to have several installed Python environments, you may want to use [[https://docs.pipenv.org/][Pipenv]]. I doubt it allows to track correctly FORTRAN or C dynamic libraries that are wrapped by Python though. -*** Installing a new package or a specific version +** Installing a new package or a specific version The Jupyter environment we deployed on our servers for the MOOC is based on the version 4.5.4 of Miniconda and Python 3.6. In this environment you should simply use the =pip= command (remember on your @@ -430,8 +311,8 @@ It is even possible to install a specific (possibly much older) version, e.g.,: #+begin_src shell :results output :exports both pip install statsmodels==0.6.1 #+end_src -** Getting information about R libraries -*** Getting the list imported modules and their version +* Getting information about R libraries +** Getting the list imported modules and their version The best way seems to be to rely on the =devtools= package (if this package is not installed, you should install it first by running in =R= the command =install.packages("devtools")=). @@ -493,7 +374,7 @@ Packages ---------------------------------------------------------------------- Some actually advocate that [[https://github.com/ropensci/rrrpkg][writing a reproducible research compendium is best done by writing an R package]]. Those of you willing to have a clean R dependency management should thus have a look at [[https://rstudio.github.io/packrat/][Packrat]]. -*** Getting the list of installed packages and their version +** Getting the list of installed packages and their version Finally, it is good to know that there is a built-in R command (=installed.packages=) allowing to retrieve and list the details of all packages installed. @@ -514,12 +395,12 @@ head(installed.packages()) | StanHeaders | /home/alegrand/R/x86_64-pc-linux-gnu-library/3.5 | 2.17.2 | nil | nil | nil | nil | RcppEigen, BH | nil | BSD_3_clause + file LICENSE | nil | nil | nil | nil | yes | 3.5.1 | | | acepack | /home/alegrand/R/x86_64-pc-linux-gnu-library/3.5 | 1.4.1 | nil | nil | nil | nil | testthat | nil | MIT + file LICENSE | nil | nil | nil | nil | yes | 3.5.1 | | -*** Installing a new package or a specific version +** Installing a new package or a specific version This section is mostly a cut and paste from the [[https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages][recent post by Ian Pylvainen]] on this topic. It comprises a very clear explanation of how to proceed. -**** Installing a pre-compiled version +*** Installing a pre-compiled version If you're on a Debian or a Ubuntu system, it may be difficult to access a specific version without breaking your system. So unless you are moving to the latest version available in your Linux distribution, @@ -536,7 +417,7 @@ similar to the example below: packageurl <- "https://cran-archive.r-project.org/bin/windows/contrib/2.13/BBmisc_1.0-58.zip" install.packages(packageurl, repos=NULL, type="binary") #+end_src -**** Using devtools +*** Using devtools The simplest method to install the version you need is to use the =install_version()= function of the =devtools= package (obviously, you need to install =devtools= first, which can be done by running in =R= the @@ -546,7 +427,7 @@ command =install.packages("devtools")=). For instance: require(devtools) install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org") #+end_src -**** Installing from source code +*** Installing from source code Alternatively, you may want to install an older package from source If devtools fails or if you do not want to depend on it, you can install it from source via =install.packages()= directed using the right @@ -565,7 +446,7 @@ line outside of R. For instance (in bash): wget http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.1.tar.gz R CMD INSTALL ggplot2_0.9.1.tar.gz #+end_src -**** Potential issues +*** Potential issues There are a few potential issues that may arise with installing older versions of packages: - You may be losing functionality or bug fixes that are only present diff --git a/module4/ressources/resources_refs.html b/module4/ressources/resources_refs.html new file mode 100644 index 0000000..971aa65 --- /dev/null +++ b/module4/ressources/resources_refs.html @@ -0,0 +1,205 @@ +
+

Additional references

+ + +
+

"Thoughts" on language/software stability

+
+

+As we explained, the programming language used in an analysis has a +clear influence on the reproducibility of your analysis. It is not a +characteristic of the language itself but rather a consequence of the +development philosophy of the underlying community. For example C is a +very stable language with a very clear specification designed by a +committee (even though some compilers may not respect this norm). +

+ +

+On the other end of the spectrum, Python had a much more organic +development based on a readability philosophy and valuing continuous +improvement over backwards-compatibility. Furthermore, Python is +commonly used as a wrapping language (e.g., to easily use C or FORTRAN +libraries) and has its own packaging system. All these design choices +tend to make reproducibility often a bit painful with Python, even +though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division. +

+ +

+R, in comparison is much closer (in terms of developer community) to +languages like SAS, which is heavily used in the pharmaceutical +industry where statistical procedures need to be standardized and rock +solid/stable. R is obviously not immune to evolutions that break old +versions and hinder reproducibility/backward compatibility. Here is a +relatively recent true story about this and some colleagues who worked +on the statistics introductory course with R on FUN reported us +several issues with a few functions (plotmeans from gplots, +survfit from survival, or hclust) whose default parameters had +changed over the years. It is thus probably good practice to give +explicit values for all parameters (which can be cumbersome) instead +of relying on default values, and to restrict your dependencies as much +as possible. +

+ +

+This being said, the R development community is generally quite +careful about stability. We (the authors of this MOOC) believe that open +source (which allows to inspect how computation is done and to +identify both mistakes and sources of non-reproducibility) is more +important than the rock solid stability of SAS, which is proprietary +software. Yet, if you really need to stay with SAS (similar solutions +probably exist for other languages as well), you should know that SAS +can be used within Jupyter using either the Python SASKernel or the +Python SASPy package (step by step explanations about this are given +here). Using such literate programming approach allied with systematic +version and environment control will always help. +

+
+
+
+

Controlling your software environment

+
+

+As we mentioned in the video sequences, there are several solutions to +control your environment: +

+
    +
  • The easy (preserve the mess) ones: CDE or ReproZip
  • +
  • The more demanding (encourage cleanliness) where you start with a +clean environment and install only what's strictly necessary (and document it): +
      +
    • The very well known Docker
    • +
    • Singularity or Spack, which are more targeted toward the specific +needs of high performance computing users
    • +
    • Guix, Nix that are very clean (perfect?) solutions to this +dependency hell and which we recommend
    • +
  • +
+ +

+It may be hard to understand the difference between these different +approaches and decide which one is better in your context. +

+ +

+Here is a webinar where some of these tools are demoed in a +reproducible research context: Controling your environment (by Michael +Mercier and Cristian Ruiz) +

+ +

+You may also want to have a look at the Popper conventions (webinar by +Ivo Gimenez through google hangout) or at the presentation of Konrad +Hinsen on Active Papers (http://www.activepapers.org/). +

+
+
+
+

Preservation/Archiving

+
+

+Ensuring software is properly archived, i.e, is safely stored so that +it can be accessed in a perennial way, can be quite tricky. If you +have never seen Roberto Di Cosmo presenting the Software Heritage +project, this is a must see. https://www.softwareheritage.org/ +

+ +

+For regular data, we highly recommend using https://www.zenodo.org/ +whenever the data is not sensitive. +

+
+
+
+

Workflows

+
+

+In the video sequences, we mentioned workflow managers (original application domain in parenthesis): +

+ + +

+You may want to have a look at this webinar: Reproducible Science in +Bio-informatics: Current Status, Solutions and Research Opportunities +(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard). +

+
+
+ + +
+

Publication practices

+
+

+You may want to have a look at the following two webinars: +

+ +
+
+
+

Experimentation

+
+

+Experimentation was not covered in this MOOC, although it is an +essential part of science. The main reason is that practices and +constraints can vary so wildly from one domain to another that it could +not be properly covered in a first edition. We would be happy to +gather references you consider as interesting in your domain so do not +hesitate to provide us with such references by using the forum and we +will update this page. +

+ + +
+
+
diff --git a/module4/ressources/resources_refs.org b/module4/ressources/resources_refs.org new file mode 100644 index 0000000..718e0ba --- /dev/null +++ b/module4/ressources/resources_refs.org @@ -0,0 +1,125 @@ +# -*- mode: org -*- +#+TITLE: Additional references +#+AUTHOR: Arnaud Legrand +#+DATE: June, 2018 +#+STARTUP: overview indent +#+OPTIONS: num:nil toc:t +#+PROPERTY: header-args :eval never-export + +* "Thoughts" on language/software stability +As we explained, the programming language used in an analysis has a +clear influence on the reproducibility of your analysis. It is not a +characteristic of the language itself but rather a consequence of the +development philosophy of the underlying community. For example C is a +very stable language with a [[https://en.wikipedia.org/wiki/C_(programming_language)#ANSI_C_and_ISO_C][very clear specification designed by a +committee]] (even though some compilers may not respect this norm). + +On the other end of the spectrum, [[https://en.wikipedia.org/wiki/Python_(programming_language)][Python]] had a much more organic +development based on a readability philosophy and valuing continuous +improvement over backwards-compatibility. Furthermore, Python is +commonly used as a wrapping language (e.g., to easily use C or FORTRAN +libraries) and has its own packaging system. All these design choices +tend to make reproducibility often a bit painful with Python, even +though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division. + +[[https://en.wikipedia.org/wiki/R_(programming_language)][R]], in comparison is much closer (in terms of developer community) to +languages like [[https://en.wikipedia.org/wiki/SAS_(software)][SAS]], which is heavily used in the pharmaceutical +industry where statistical procedures need to be standardized and rock +solid/stable. R is obviously not immune to evolutions that break old +versions and hinder reproducibility/backward compatibility. Here is a +relatively recent [[http://members.cbio.mines-paristech.fr/~thocking/HOCKING-reproducible-research-with-R.html][true story about this]] and some colleagues who worked +on the [[https://www.fun-mooc.fr/courses/UPSUD/42001S06/session06/about][statistics introductory course with R on FUN]] reported us +several issues with a few functions (=plotmeans= from =gplots=, +=survfit= from =survival=, or =hclust=) whose default parameters had +changed over the years. It is thus probably good practice to give +explicit values for all parameters (which can be cumbersome) instead +of relying on default values, and to restrict your dependencies as much +as possible. + +This being said, the R development community is generally quite +careful about stability. We (the authors of this MOOC) believe that open +source (which allows to inspect how computation is done and to +identify both mistakes and sources of non-reproducibility) is more +important than the rock solid stability of SAS, which is proprietary +software. Yet, if you really need to stay with SAS (similar solutions +probably exist for other languages as well), you should know that SAS +can be used within Jupyter using either the [[https://sassoftware.github.io/sas_kernel/][Python SASKernel]] or the +[[https://sassoftware.github.io/saspy/][Python SASPy]] package (step by step explanations about this are given +[[https://app-learninglab.inria.fr/gitlab/85bc36e0a8096c618fbd5993d1cca191/mooc-rr/blob/master/documents/tuto_jupyter_windows/tuto_jupyter_windows.md][here]]). Using such literate programming approach allied with systematic +version and environment control will always help. +* Controlling your software environment +As we mentioned in the video sequences, there are several solutions to +control your environment: +- The easy (preserve the mess) ones: [[http://www.pgbovine.net/cde.html][CDE]] or [[https://vida-nyu.github.io/reprozip/][ReproZip]] +- The more demanding (encourage cleanliness) where you start with a + clean environment and install only what's strictly necessary (and document it): + - The very well known [[https://www.docker.io/][Docker]] + - [[https://singularity.lbl.gov/][Singularity]] or [[https://spack.io/][Spack]], which are more targeted toward the specific + needs of high performance computing users + - [[https://www.gnu.org/software/guix/][Guix]], [[https://nixos.org/][Nix]] that are very clean (perfect?) solutions to this + dependency hell and which we recommend + +It may be hard to understand the difference between these different +approaches and decide which one is better in your context. + +Here is a webinar where some of these tools are demoed in a +reproducible research context: [[https://github.com/alegrand/RR_webinars/blob/master/2_controling_your_environment/index.org][Controling your environment (by Michael +Mercier and Cristian Ruiz)]] + +You may also want to have a look at [[http://falsifiable.us/][the Popper conventions]] ([[https://github.com/alegrand/RR_webinars/blob/master/11_popper/index.org][webinar by +Ivo Gimenez through google hangout]]) or at the [[https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org][presentation of Konrad +Hinsen on Active Papers]] (http://www.activepapers.org/). +* Preservation/Archiving +Ensuring software is properly archived, i.e, is safely stored so that +it can be accessed in a perennial way, can be quite tricky. If you +have never seen [[https://github.com/alegrand/RR_webinars/blob/master/5_archiving_software_and_data/index.org][Roberto Di Cosmo presenting the Software Heritage +project]], this is a must see. https://www.softwareheritage.org/ + +For regular data, we highly recommend using https://www.zenodo.org/ +whenever the data is not sensitive. +* Workflows +In the video sequences, we mentioned workflow managers (original application domain in parenthesis): +- [[https://galaxyproject.org/][Galaxy]] (genomics), [[https://kepler-project.org/][Kepler]] (ecology), [[https://taverna.apache.org/][Taverna]] (bio-informatics), [[https://pegasus.isi.edu/][Pegasus]] + (astronomy), [[http://cknowledge.org/][Collective Knowledge]] (compiling optimization) , + [[https://www.vistrails.org][VisTrails]] (image processing) +- Light-weight: [[http://dask.pydata.org/][dask]] (python), [[https://ropensci.github.io/drake/][drake]] (R), [[http://swift-lang.org/][swift]] (molecular biology), + [[https://snakemake.readthedocs.io/][snakemake]] (like =make= but more expressive and in =python=) ... +- Hybrids: [[https://vatlab.github.io/sos-docs/][SOS-notebook]], ... + +You may want to have a look at this webinar: [[https://github.com/alegrand/RR_webinars/blob/master/6_reproducibility_bioinformatics/index.org][Reproducible Science in +Bio-informatics: Current Status, Solutions and Research Opportunities +(by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard).]] + +* Numerical and statistical issues +We have mentioned these topics in our MOOC but we could by no way +cover them properly. We only suggest here a few interesting talks +about this. +- [[https://github.com/alegrand/RR_webinars/blob/master/10_statistics_and_replication_in_HCI/index.org][In this talk, Pierre Dragicevic provides a nice illustration of the + consequences of statistical uncertainty and of how some concepts + (e.G. p-values) are commonly badly understood.]] +- [[https://github.com/alegrand/RR_webinars/blob/master/3_numerical_reproducibility/index.org][Nathalie Revol, Philippe Langlois and Stef Graillat present the main + challenges encountered when trying to achieve numerical + reproducibility and present recent research work on this topic.]] +* Publication practices +You may want to have a look at the following two webinars: +- [[https://github.com/alegrand/RR_webinars/blob/master/8_artifact_evaluation/index.org][Enabling open and reproducible research at computer systems’ + conferences (by Grigori Fursin)]]. In particular, this talk discusses + /artifact evaluation/ that is becoming more and more popular. +- [[https://github.com/alegrand/RR_webinars/blob/master/7_publications/index.org][Publication Modes Favoring Reproducible Research (by Konrad Hinsen + and Nicolas Rougier)]]. In this talk, the motivation for the [[http://rescience.github.io/][ReScience + journal]] initiative are presented. +- [[https://www.youtube.com/watch?v=HuJ2G8rXHMs][Simine Vazire - When Should We be Skeptical of Scientific Claims?]], + which is discussing publication practices in social sciences and in + particular HARKing (Hypothesizing After the Results are Known), + p-hacking, etc. +* Experimentation +Experimentation was not covered in this MOOC, although it is an +essential part of science. The main reason is that practices and +constraints can vary so wildly from one domain to another that it could +not be properly covered in a first edition. We would be happy to +gather references you consider as interesting in your domain so do not +hesitate to provide us with such references by using the forum and we +will update this page. + +- [[https://github.com/alegrand/RR_webinars/blob/master/9_experimental_testbeds/index.org][A recent talk by Lucas Nussbaum on Experimental Testbeds in Computer + Science]]. -- 2.18.1