# Journal de bord du Mooc / Mooc's logbook FR Espace réservé au journal de bord du Mooc EN Reserved for the Mooc's logbook ## Module 1, June 6, 2022: **Key notes:** - The importance of taking notes - The history of notebooks - Pros and cons - The challenges **Exercise 01-1:** *Which two files contain the character string "LE MOOC RECHERCHE REPRODUCTIBLE C'EST GENIAL" ?* **Answer** module1/exo1/aebef6b0a5.txt module1/exo1/f683bbad4b.txt *Find for the file module1/exo2/readme.md and, using the blame or history functions of GitLab, retrieve the commit that added the title Helloworld Python.* *Which is the commit number?* **Answer** 505c4e26 *Who is the author of the commit?* **Answer** Arnaud Legrand **Quiz 1** *Why has a European project recently used the logbooks of the Portuguese, Spanish, Dutch and English Indian Companies* To try to reconstitute the ocean climate criss-crossed by the Western navies *What note media are illustrated in the course video "Note-taking concerns everyone" by Christophe Pouzat?* - Notes in books and manuscripts margins - Notes in field books - Notes on cards and paper slips *Why did Leibniz order the construction of a closet ?* To store and order notes written on paper slips *For the curious, visit the Darwin Online web sites go to the notebooks and describe how Darwin took his notes.* First in notebooks then on cards and paper sheets stored in folders **Quiz 2** *What is the origin of the codex?* The Egyptian production of papyrus was not large enough to meet the demand of writers *What aspect of Eusebius work is presented in this sequence?* His canon tables (cross-references between the Gospel books) *In which line should the keyword "Analysis" go in John Locke's index ?* « Aa » **Quiz 3** *What is a text file ?* A file made up (stored as) UTF-8 characters *What is a tag ?* A character, or series of characters, used to structure a document that will be invisible to the final reader *Markdown is a markup language:* "Light" **Quiz 4** *LibreOffice makes the comparison of two successive versions possible.* True *A wiki engine allows us to modify a single page at a time.* True *GitHub and GitLab let us work with binary files like images.* True **Quiz 5** *What are the limitations of the search functionality of text editors ?* - They only work with text files - They work on a single file at a time *What is DocFetcher ?* - A cross-plateform software - A desktop search engine *What does it make sense to use tags and keywords ?* - To filter out overabundant information - To find quickly relevant information *Text files are the only files to which tags and keywords can be added.* False --- ## Module 2, June 6, 2022 **Exercice 02-2** *What is the average ?* 14.11 *What is the minimum ?* 2.8 *What is the maximum ?* 23.4 *What is the median ?* 14.5 *What is the standard deviation ?* 4.33 **Quiz 06** *A computational document allows you to:* - Improve the traceability of a calculation - Easily present your work to colleagues - Access all the calculations underlying an analysis *Which environment(s) are presented to you in this MOOC?* - Rstudio - Emacs/OrgMode - Jupyter *Which environment is recommended if your preferred language is Python?* Jupyter *Which environment is recommended if your preferred language is the R language?* Rstudio *RstudioWhich environment is used daily by the three authors of this MOOC?* Emacs/OrgMode b. Emacs/OrgMode - correct **Quiz 7** *In the studies we have presented to you, what prevents, sometimes for several years, the debate on the relevance of a study?* - Unpublished computation procedures - Data used in the study was not released *In the various examples presented (economics, MRI, crystallography), what are the main causes of errors ?* - Data acquisition (bias, machine calibration, etc.) - Computation errors - Inadequate data processing or statistics *What are the consequences of lack of transparency?* - It's difficult to rely on the work of others - Articles contain less information (no details on calculations, experimental protocols, data analysis, etc.) and are therefore easier to read - It is difficult to verify and reproduce the analyses presented in the articles - Two articles may present results that seem to contradict each other, but are both perfectly correct, as the lack of detail prevents the exact conditions of application from being determined **Quiz 8** *What are the main technical causes behind the difficulties in reproducing someone else's work?* - Lack of documentation on the choices made - Interactive graphical software that hide computation details - Computation errors - Data loss (no backup or no more readable format) *Which solutions are mentioned?* - Using a laboratory notebook - Code review and continuous integration - Using version control systems and several backup mechanisms *What are the most legitimate/valid fears associated with the systematic disclosure of data (open data)* - Some information may be sensitive and its disclosure may hurt people - My resources are limited. If I systematically host all this data on the web page provided by my employer, I am likely to quickly exceed my quota **Quiz 9** *What is commonly found in a computational document?* - Commentaries - Code - An overview of data - Computational results - Hypertext links - Images *What does a computational document allow?* - Inspect the computations - Easily re-run the computations if the original environment is available - Document the code - Explain why a particular computation is made based on the data analysis so far - Use multiple languages to perform computations (even if it may require some work) **QuizP 01** *What does an environment like Jupyter provide in comparison to working in the Python console or running R scripts directly?* - It provides a well-structured history of the analyses performed - It allows you to inspect data, keep a history of this inspection, and explain the transformations you perform as you go along - It saves intermediate results, whether textual or graphical. - It allows you to generate documents in HTML or PDF - It allows you to ensure that a figure is the result of the computation described in the document *In Jupyter, what features are provided for the Python language but not available for the R language?* There are the same features for both languages *What allows you to be effective in an environment like Jupyter?* - The export functions and the ability to easily re-run the code from the beginning - Autocompletion - Learning keyboard shortcuts - Reading the documentation and cheat sheets **Quiz 10** *What should I prepare for if I use a computational document ?* - By letting my co-authors easily access and modify my code, they may break everything - My collaborators may realize how bad I am at writing code and how often I fiddle with my results (assuming that this is the case...;) - I will have no more excuses for not rereading and checking the code of my collaborators - My co-authors and I will have to make sure that it works on each of our machines and it will take us time - We'll have to install a lot of complicated stuff when our machines are not up to date *What are the benefits of using and publishing a computational document?* - These tools are relatively easy to learn, which allows as many people as possible to use them and better understand my work - These tools allow to have in a single document (1) an overview of the data (2) the code (3) the computation results, and especially (4) explanations on how these three types of objects are articulated with each other - This makes it possible to be transparent about how we reach a particular conclusion - This makes it easier for others to reuse all or part of our computation procedures *How to make your computational document available in a sustainable way?* - Gitlab, Github, … - An open archive (HAL, figshare, zenodo) **Quiz 11** *What makes the three environments Jupyter, Rstudio and OrgMode different?* - Ease of installation and learning curve - The year of creation and the underlying language - The ability to write documents in a specific style for submission to a journal or conference - The underlying file format *What justifies using one of these three environments (Jupyter, Rstudio and OrgMode) rather than an other?* - The type of document (tutorial sheet, laboratory notebook, article...) that you wish to write - The audience who will contribute to or use this document *A computational document facilitates the collection and sharing of information on data and on a computation. But for this information to be exploitable it is important to:* - Manage your document using a version manager - Structure the document to make it as readable as possible - Explain in natural language the general idea behind a computation and why a particular computation decision is taken - Use keywords to facilitate indexing and navigation - Think about who the information is intended for (yourself, a colleague, your advisor, ...) ## Module 3, June 7 2022 **Exercice 03-2** *Which year had the strongest epidemic?* 2009 *Which year had the weakest epidemic?* 2020 **Quiz 12** *What distinguishes a replicable data analysis from a traditional analysis?* The code for all computations is included *What are the advantages of a replicable analysis?* - It is easier to modify - It is easier to verify **Quiz 13** *Where do the data on the incidence of influenza-like illness come from?* From the “réseau Sentinelles”, a network of general practitioners *In which format are the data avialable?* CSV format *Which is the sampling frequency of the incidence data?* One value per week *Why do we advise against removing the missing data line from the downloaded data file?* It would leave no visible trace of the manipulation **QuizP 04** *Where did we find the URL for downloading the data?* In the Web browser’s download history *How do we handle missing data?* We remove the data points before continuing with the analysis **QuizP 07** *Why do we have to transform the week labels?* Pandas cannot interpret the format of the original data *What's the point of checking that the distance between two consecutive weeks is seven days?* - The check would find weeks completely absent from the dataset - The check could have identified mistakes in the date conversion *Which methods did we use to verify our work?* - Visual inspection - Code written specifically for verification **QuizP 10** *Why did we choose the first of August as the beginning of each annual period?* The incidence of influenza-like illness is weakest around that date *Why don’t our annual periods contain exactly 52 weeks?* A year has always more than 7 x 52 days