diff --git a/module3/ressources/analyse-syndrome-grippal-orgmode+Lisp+Python+R.org b/module3/ressources/analyse-syndrome-grippal-orgmode+Lisp+Python+R.org index 32af4a1f83169d4c207fe97019dce71afdaf2bb1..aa40045e1ad07322b51c49ae61a75f69f4c2eace 100644 --- a/module3/ressources/analyse-syndrome-grippal-orgmode+Lisp+Python+R.org +++ b/module3/ressources/analyse-syndrome-grippal-orgmode+Lisp+Python+R.org @@ -10,7 +10,7 @@ #+HTML_HEAD: #+HTML_HEAD: -#+PROPERTY: header-args :session :exports both +#+PROPERTY: header-args :session * Préface @@ -22,7 +22,7 @@ Une version plus ancienne d'Emacs devrait suffire, mais en ce cas il est prudent Bibliothèque supplémentaire: - [[https://github.com/magnars/dash.el][Dash]] 2.13.0 -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (require 'dash) (unless (featurep 'dash) (print "Veuillez installer Dash !")) @@ -33,13 +33,13 @@ Bibliothèque supplémentaire: ** Python 3.6 Nous utilisons le traitement de dates en format ISO 8601, qui a été implémenté en Python seulement avec la version 3.6. -#+BEGIN_SRC python :results output +#+BEGIN_SRC python :results output :exports both import sys if sys.version_info.major < 3 or sys.version_info.minor < 6: print("Veuillez utiliser Python 3.6 (ou plus) !") #+END_SRC -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (unless (featurep 'ob-python) (print "Veuillez activer python dans org-babel (org-babel-do-languages) !")) #+END_SRC @@ -47,7 +47,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6: ** R 3.4 Nous n'utilisons que des fonctionnalités de base du langage R, une version antérieure devrait suffire. -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (unless (featurep 'ob-R) (print "Veuillez activer R dans org-babel (org-babel-do-languages) !")) #+END_SRC @@ -80,7 +80,7 @@ Pour éviter de télécharger les données plusieurs fois, nous les gardons dans #+NAME: data-buffer-name *données syndrome grippal* -#+BEGIN_SRC emacs-lisp :results silent :var url=data-url :var name=data-buffer-name +#+BEGIN_SRC emacs-lisp :results silent :var url=data-url :var name=data-buffer-name :exports both (require 'url) (with-current-buffer (get-buffer-create name) (unless buffer-read-only @@ -90,7 +90,7 @@ Pour éviter de télécharger les données plusieurs fois, nous les gardons dans #+END_SRC La prochaine étape est l'extraction des données qui nous intéressent. D'abord nous découpons le contenu du fichier en lignes, dont nous jetons la première qui ne contient qu'un commentaire. Les autres lignes sont découpées en colonnes. Pour détecter les données manquantes, nous vérifions si une ligne contient au moins un champ vide. A la fin de ce bloc, deux variables contiennent les données manquantes et les donées valables. -#+BEGIN_SRC emacs-lisp :results silent :var name=data-buffer-name +#+BEGIN_SRC emacs-lisp :results silent :var name=data-buffer-name :exports both (require 'cl) (require 'dash) (defun missing-data? (row) @@ -105,19 +105,19 @@ La prochaine étape est l'extraction des données qui nous intéressent. D'abord #+END_SRC Regardons les données manquantes... -#+BEGIN_SRC emacs-lisp +#+BEGIN_SRC emacs-lisp :exports both missing-data-lines #+END_SRC Nous ne gardons que la première (~"week"~) et la troisième (~"inc"~) colonne des données valables. Nous insérons ~hline~ comme deuxième élément de notre tableau pour indiquer à org-mode la séparation entre l'en-tête (les noms des colonnes) et les données. #+NAME: data -#+BEGIN_SRC emacs-lisp :results silent +#+BEGIN_SRC emacs-lisp :results silent :exports both (-insert-at 1 'hline (-select-columns '(0 2) valid-data-lines)) #+END_SRC Regardons les premières et les dernières lignes: -#+BEGIN_SRC emacs-lisp :results value :var data=data :colnames yes +#+BEGIN_SRC emacs-lisp :results value :var data=data :colnames yes :exports both (-concat (-take 5 data) '(hline) (-take-last 5 data)) @@ -126,7 +126,7 @@ Regardons les premières et les dernières lignes: ** Vérification Il est toujours prudent de vérifier si les données semblent crédibles. Nous savons que les semaines sont données par six chiffres (quatre pour l'année et deux pour la semaine), dont les deux premiers sont ou "19" ou "20", et que les incidences sont des nombres entiers positifs. -#+BEGIN_SRC emacs-lisp :results output :var data=data :colnames yes +#+BEGIN_SRC emacs-lisp :results output :var data=data :colnames yes :exports both (defun check-week (week) (unless (string-match-p (rx (or "19" "20") (repeat 4 digit)) week) (princ (format "Invalid week value: %s\n" week)))) @@ -146,7 +146,7 @@ Rien à signaler ! ** Conversions Pour faciliter les traitements suivants, nous remplaçons les numéros de semaine ISO par les dates qui correspondent aux lundis. A cette occasion, nous trions aussi les données par la date, et nous transformons les incidences en nombres entiers. Nous utilisons le langage Python 3 parce qu'il est un des rares à proposer la conversion de semaines ISO en dates dans sa biblithèque standard. -#+BEGIN_SRC python :results silent :var input_data=data +#+BEGIN_SRC python :results silent :var input_data=data :exports both import datetime data = [(datetime.datetime.strptime(year_and_week + ":1" , '%G%V:%u').date(), int(inc)) @@ -155,14 +155,14 @@ data.sort(key = lambda record: record[0]) #+END_SRC Regardons de nouveau les premières et les dernières lignes: -#+BEGIN_SRC python :results value +#+BEGIN_SRC python :results value :exports both str_data = [(str(date), str(inc)) for date, inc in data] [('date', 'inc'), None] + str_data[:5] + [None] + str_data[-5:] #+END_SRC ** Vérification des dates Nous faisons encore une vérification: nos dates doivent être séparées d'exactement une semaine, sauf autour du point manquant. -#+BEGIN_SRC python :results output +#+BEGIN_SRC python :results output :exports both dates = [date for date, _ in data] for date1, date2 in zip(dates[:-1], dates[1:]): if date2-date1 != datetime.timedelta(weeks=1): @@ -174,24 +174,24 @@ Nous passons au langage R pour inspecter nos données, parce que l'analyse et la Nous utilisons le mécanisme d'échange de données proposé par org-mode, ce qui nécessite un peu de code Python pour transformer les données dans le bon format. #+NAME: data-for-R -#+BEGIN_SRC python :results silent +#+BEGIN_SRC python :results silent :exports both [('date', 'inc'), None] + [(str(date), inc) for date, inc in data] #+END_SRC En R, les données arrivent sous forme d'un data frame, mais il faut encore convertir les dates, qui arrivent comme chaînes de caractères. -#+BEGIN_SRC R :results output :var data=data-for-R +#+BEGIN_SRC R :results output :var data=data-for-R :exports both data$date <- as.Date(data$date) summary(data) #+END_SRC ** Inspection Regardons enfin à quoi ressemblent nos données ! -#+BEGIN_SRC R :results output graphics :file inc-plot.png +#+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both plot(data, type="l", xlab="Date", ylab="Incidence hebdomadaire") #+END_SRC Un zoom sur les dernières années montre mieux la situation des pics en hiver. Le creux des incidences se trouve en été. -#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png +#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both plot(tail(data, 200), type="l", xlab="Date", ylab="Incidence hebdomadaire") #+END_SRC @@ -201,7 +201,7 @@ plot(tail(data, 200), type="l", xlab="Date", ylab="Incidence hebdomadaire") Étant donné que le pic de l'épidémie se situe en hiver, à cheval entre deux années civiles, nous définissons la période de référence entre deux minima de l'incidence, du 1er août de l'année /N/ au 1er août de l'année /N+1/. Nous mettons l'année /N+1/ comme étiquette sur cette année décalée, car le pic de l'épidémie est toujours au début de l'année /N+1/. Comme l'incidence du syndrome grippal est très faible en été, cette modification ne risque pas de fausser nos conclusions. Voici une fonction qui calcule l'incidence annuelle en appliquant ces conventions. -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both pic_annuel = function(annee) { debut = paste0(annee-1,"-08-01") fin = paste0(annee,"-08-01") @@ -211,11 +211,11 @@ pic_annuel = function(annee) { #+END_SRC Nous devons aussi faire attention aux premières et dernières années de notre jeux de données. Les données commencent en octobre 1984, ce qui ne permet pas de quantifier complètement le pic attribué à l'année 1985. Nous le supprimons donc de notre analyse. Par contre, les données se terminent après le 1er août 2018 (pour une exécution après cette date bien sûr), ce qui nous permet d'inclure cette année dans l'analyse. -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both annees <- 1986:2018 #+END_SRC -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both inc_annuelle = data.frame(annee = annees, incidence = sapply(annees, pic_annuel)) head(inc_annuelle) @@ -223,17 +223,17 @@ head(inc_annuelle) ** Inspection Voici les incidences annuelles en graphique. -#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png +#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both plot(inc_annuelle, type="p", xlab="Année", ylab="Incidence annuelle") #+END_SRC ** Identification des épidémies les plus fortes Une liste triée par ordre décroissant d'incidence annuelle permet de plus facilement repérer les valeurs les plus élevées: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both head(inc_annuelle[order(-inc_annuelle$incidence),]) #+END_SRC Enfin, un histogramme montre bien que les épidémies fortes, qui touchent environ 10% de la population française, sont assez rares: il y en eu trois au cours des 35 dernières années. -#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png +#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both hist(inc_annuelle$incidence, breaks=10, xlab="Incidence annuelle", ylab="Nb d'observations", main="") #+END_SRC diff --git a/module3/ressources/analyse-syndrome-grippal-orgmode+R.org b/module3/ressources/analyse-syndrome-grippal-orgmode+R.org index d2fc753e3d9ffc909fd48abf1a807cfba82c21e0..a7818fe7e6aa08c37a6c97b5394470f1549419e8 100644 --- a/module3/ressources/analyse-syndrome-grippal-orgmode+R.org +++ b/module3/ressources/analyse-syndrome-grippal-orgmode+R.org @@ -10,7 +10,7 @@ #+HTML_HEAD: #+HTML_HEAD: -#+PROPERTY: header-args :session :exports both +#+PROPERTY: header-args :session * Préface @@ -21,18 +21,18 @@ Une version plus ancienne d'Emacs devrait suffire, mais en ce cas il est prudent ** R 3.4 Nous n'utilisons que des fonctionnalités de base du langage R, une version antérieure devrait suffire. -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (unless (featurep 'ob-R) (print "Veuillez activer R dans org-babel (org-babel-do-languages) !")) #+END_SRC Pour la manipulation des dates au format ISO-8601, nous avons besoin du paquet ~parsedate~. -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both library(parsedate) #+END_SRC Enfin, nous allons demander à R d'écrire les nombres décimaux en français — c'est-à-dire avec une virgule entre parties entière et décimale — et d'utiliser un nombre de colonnes plus large que celui utilisé par défaut. -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both options(OutDec=",") options(width=150) #+END_SRC @@ -61,25 +61,25 @@ Voici l'explication des colonnes donnée sur [[https://ns.sentiweb.fr/incidence/ ** Téléchargement La première ligne du fichier CSV est un commentaire, que nous ignorons en précisant ~skip=1~. -#+BEGIN_SRC R :session :results silent :var url=data-url +#+BEGIN_SRC R :session :results silent :var url=data-url :exports both data = read.csv(trimws(url), skip=1) #+END_SRC Regardons ce que nous avons obtenu ! -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both head(data) tail(data) #+END_SRC ** Vérification Il est toujours prudent de vérifier si les données semblent crédibles. Commençons par regarder s'il y a des points manquants dans ce jeu de données: -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both na_records = apply(data, 1, function(x) any(is.na(x))) data[na_records,] #+END_SRC Voyons aussi comment R a interpreté nos données: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both class(data$week) class(data$inc) #+END_SRC @@ -88,7 +88,7 @@ Ce sont des entiers, tout va bien ! ** Conversions Pour faciliter les traitements suivants, nous remplaçons les numéros de semaine ISO par les dates qui correspondent aux lundis. D'abord, une petite fonction qui fait le travail: -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both convert_week = function(w) { ws = paste(w) iso = paste0(substring(ws,1,4), "-W", substring(ws,5,6)) @@ -97,34 +97,34 @@ convert_week = function(w) { #+END_SRC Nous appliquons cette fonction à tous les points, créant une nouvelle colonne `date` dans notre jeu de données: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both data$date = as.Date(convert_week(data$week)) #+END_SRC Vérifions qu'elle est de classe `Date`: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both class(data$date) #+END_SRC Les points sont dans l'ordre chronologique inverse, il est donc utile de les trier: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both data = data[order(data$date),] #+END_SRC ** Vérification des dates Nous faisons encore une vérification: nos dates doivent être séparées d'exactement une semaine. -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both all(diff(data$date) == 7) #+END_SRC ** Inspection Regardons enfin à quoi ressemblent nos données ! -#+BEGIN_SRC R :results output graphics :file inc-plot.png +#+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both plot(data$date, data$inc, type="l", xlab="Date", ylab="Incidence hebdomadaire") #+END_SRC Un zoom sur les dernières années montre mieux la situation des pics en hiver. Le creux des incidences se trouve en été. -#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png +#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence hebdomadaire")) #+END_SRC @@ -134,7 +134,7 @@ with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence heb Étant donné que le pic de l'épidémie se situe en hiver, à cheval entre deux années civiles, nous définissons la période de référence entre deux minima de l'incidence, du 1er août de l'année /N/ au 1er août de l'année /N+1/. Nous mettons l'année /N+1/ comme étiquette sur cette année décalée, car le pic de l'épidémie est toujours au début de l'année /N+1/. Comme l'incidence du syndrome grippal est très faible en été, cette modification ne risque pas de fausser nos conclusions. Voici une fonction qui calcule l'incidence annuelle en appliquant ces conventions. -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both pic_annuel = function(annee) { debut = paste0(annee-1,"-08-01") fin = paste0(annee,"-08-01") @@ -144,12 +144,12 @@ pic_annuel = function(annee) { #+END_SRC Nous devons aussi faire attention aux premières et dernières années de notre jeux de données. Les données commencent en octobre 1984, ce qui ne permet pas de quantifier complètement le pic attribué à l'année 1985. Nous le supprimons donc de notre analyse. Par contre, pour une exécution en octobre 2018, les données se terminent après le 1er août 2018, ce qui nous permet d'inclure cette année. -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both annees <- 1986:2018 #+END_SRC Nous créons un nouveau jeu de données pour l'incidence annuelle, en applicant la fonction `pic_annuel` à chaque année: -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both inc_annuelle = data.frame(annee = annees, incidence = sapply(annees, pic_annuel)) head(inc_annuelle) @@ -157,17 +157,17 @@ head(inc_annuelle) ** Inspection Voici les incidences annuelles en graphique. -#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png +#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both plot(inc_annuelle, type="p", xlab="Année", ylab="Incidence annuelle") #+END_SRC ** Identification des épidémies les plus fortes Une liste triée par ordre décroissant d'incidence annuelle permet de plus facilement repérer les valeurs les plus élevées: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both head(inc_annuelle[order(-inc_annuelle$incidence),]) #+END_SRC Enfin, un histogramme montre bien que les épidémies fortes, qui touchent environ 10% de la population française, sont assez rares: il y en eu trois au cours des 35 dernières années. -#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png +#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both hist(inc_annuelle$incidence, breaks=10, xlab="Incidence annuelle", ylab="Nb d'observations", main="") #+END_SRC diff --git a/module3/ressources/influenza-like-illness-analysis-orgmode+Lisp+Python+R.org b/module3/ressources/influenza-like-illness-analysis-orgmode+Lisp+Python+R.org index 51165faf4f4bc9e2dc92c670cc94ce237d970346..910773c9bd92a7da0ab4c1c62c0e2dec0b87d1ab 100644 --- a/module3/ressources/influenza-like-illness-analysis-orgmode+Lisp+Python+R.org +++ b/module3/ressources/influenza-like-illness-analysis-orgmode+Lisp+Python+R.org @@ -10,7 +10,7 @@ #+HTML_HEAD: #+HTML_HEAD: -#+PROPERTY: header-args :session :exports both +#+PROPERTY: header-args :session * Foreword @@ -20,13 +20,13 @@ For running this analysis, you need the following software: Older versions may suffice. For Emacs versions older than 26, org-mode must be updated to version 9.x. ** Python 3.6 or higher We use the ISO 8601 date format, which has been added to Python's standard library with version 3.6. -#+BEGIN_SRC python :results output +#+BEGIN_SRC python :results output :exports both import sys if sys.version_info.major < 3 or sys.version_info.minor < 6: print("Please use Python 3.6 (or higher)!") #+END_SRC -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (unless (featurep 'ob-python) (print "Please activate python in org-babel (org-babel-do-languages)!")) #+END_SRC @@ -34,7 +34,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6: ** R 3.4 We use only basic R functionality, so a earlier version might be OK, but we did not test this. -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (unless (featurep 'ob-R) (print "Please activate R in org-babel (org-babel-do-languages)!")) #+END_SRC @@ -66,7 +66,7 @@ To avoid downloading the data multiple times, we keep it in a buffer, which is a #+NAME: data-buffer-name *influenza-like-illness data* -#+BEGIN_SRC emacs-lisp :results silent :var url=data-url :var name=data-buffer-name +#+BEGIN_SRC emacs-lisp :results silent :var url=data-url :var name=data-buffer-name :exports both (require 'url) (with-current-buffer (get-buffer-create name) (unless buffer-read-only @@ -76,7 +76,7 @@ To avoid downloading the data multiple times, we keep it in a buffer, which is a #+END_SRC The next step is the extraction of the data we need. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns. To identify missing data values, we check if a line contains at least one empty field. At the end of this code block two variables contain the missing data and the valid data, respectively. -#+BEGIN_SRC emacs-lisp :results silent :var name=data-buffer-name +#+BEGIN_SRC emacs-lisp :results silent :var name=data-buffer-name :exports both (require 'cl) (require 'dash) (defun missing-data? (row) @@ -91,19 +91,19 @@ The next step is the extraction of the data we need. We first split the file int #+END_SRC Let's look at the missing data... -#+BEGIN_SRC emacs-lisp +#+BEGIN_SRC emacs-lisp :exports both missing-data-lines #+END_SRC There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We insert ~hline~ as the second element of the table to tell org-mode to separate the header from the data. #+NAME: data -#+BEGIN_SRC emacs-lisp :results silent +#+BEGIN_SRC emacs-lisp :results silent :exports both (-insert-at 1 'hline (-select-columns '(0 2) valid-data-lines)) #+END_SRC Let's look at the first and last lines: -#+BEGIN_SRC emacs-lisp :results value :var data=data :colnames yes +#+BEGIN_SRC emacs-lisp :results value :var data=data :colnames yes :exports both (-concat (-take 5 data) '(hline) (-take-last 5 data)) @@ -112,7 +112,7 @@ Let's look at the first and last lines: ** Verification It is always prudent to verify if the data looks credible. A simple fact we can check for is that weeks are given as six-digit integers (four for the year, two for the week), that the first two are "19" or "20", and that the incidence values are positive integers. -#+BEGIN_SRC emacs-lisp :results output :var data=data :colnames yes +#+BEGIN_SRC emacs-lisp :results output :var data=data :colnames yes :exports both (defun check-week (week) (unless (string-match-p (rx (or "19" "20") (repeat 4 digit)) week) (princ (format "Invalid week value: %s\n" week)))) @@ -132,7 +132,7 @@ Everything seems to be OK ! ** Conversions In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This is also a good occasion to sort the lines by increasing data, and to convert the incidences from strings to integers. We use the Python 3 language which is one of the few to have ISO week handling in its standard library. -#+BEGIN_SRC python :results silent :var input_data=data +#+BEGIN_SRC python :results silent :var input_data=data :exports both import datetime data = [(datetime.datetime.strptime(year_and_week + ":1" , '%G%V:%u').date(), int(inc)) @@ -141,14 +141,14 @@ data.sort(key = lambda record: record[0]) #+END_SRC We'll look again at the first and last lines: -#+BEGIN_SRC python :results value +#+BEGIN_SRC python :results value :exports both str_data = [(str(date), str(inc)) for date, inc in data] [('date', 'inc'), None] + str_data[:5] + [None] + str_data[-5:] #+END_SRC ** Date verification We do one more verification: our dates must be separated by exactly one week, except around the missing data point. -#+BEGIN_SRC python :results output +#+BEGIN_SRC python :results output :exports both dates = [date for date, _ in data] for date1, date2 in zip(dates[:-1], dates[1:]): if date2-date1 != datetime.timedelta(weeks=1): @@ -160,24 +160,24 @@ We switch to R for data inspection and analysis, because the code is more concis Org-mode's data exchange mechanism requires some Python code for transforming the data to the right format. #+NAME: data-for-R -#+BEGIN_SRC python :results silent +#+BEGIN_SRC python :results silent :exports both [('date', 'inc'), None] + [(str(date), inc) for date, inc in converted_data] #+END_SRC In R, the dataset arrives as a data frame, which is fine. But the dates arrive as strings and must be converted. -#+BEGIN_SRC R :results output :var data=data-for-R +#+BEGIN_SRC R :results output :var data=data-for-R :exports both data$date <- as.Date(data$date) summary(data) #+END_SRC ** Inspection Finally we can look at a plot of our data! -#+BEGIN_SRC R :results output graphics :file inc-plot.png +#+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both plot(data, type="l", xlab="Date", ylab="Weekly incidence") #+END_SRC A zoom on the last few years makes the peaks in winter stand out more clearly. -#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png +#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence") #+END_SRC @@ -187,7 +187,7 @@ plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence") Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions. This R function computes the annual incidence as defined above: -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both yearly_peak = function(year) { debut = paste0(year-1,"-08-01") fin = paste0(year,"-08-01") @@ -197,12 +197,12 @@ yearly_peak = function(year) { #+END_SRC We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available. -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both years <- 1986:2018 #+END_SRC We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year: -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both annnual_inc = data.frame(year = years, incidence = sapply(years, yearly_peak)) head(annnual_inc) @@ -210,17 +210,17 @@ head(annnual_inc) ** Inspection A plot of the annual incidence: -#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png +#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence") #+END_SRC ** Identification of the strongest epidemics A list sorted by decreasing annual incidence makes it easy to find the most important ones: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both head(annnual_inc[order(-annnual_inc$incidence),]) #+END_SRC Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people. -#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png +#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="") #+END_SRC diff --git a/module3/ressources/influenza-like-illness-analysis-orgmode+R.org b/module3/ressources/influenza-like-illness-analysis-orgmode+R.org index 9adac912f6acfe587f940622a0e8890359a55d15..f84b97f8acec24dbb974ca820010b8c02e6e94b5 100644 --- a/module3/ressources/influenza-like-illness-analysis-orgmode+R.org +++ b/module3/ressources/influenza-like-illness-analysis-orgmode+R.org @@ -10,7 +10,7 @@ #+HTML_HEAD: #+HTML_HEAD: -#+PROPERTY: header-args :session :exports both +#+PROPERTY: header-args :session * Foreword @@ -21,13 +21,13 @@ Older versions may suffice. For Emacs versions older than 26, org-mode must be u ** R 3.4 We use only basic R functionality, so a earlier version might be OK, but we did not test this. -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (unless (featurep 'ob-R) (print "Please activate R in org-babel (org-babel-do-languages)!")) #+END_SRC For handling dates in ISO-8601 format, we need the library ~parsedate~. -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both library(parsedate) #+END_SRC @@ -55,25 +55,25 @@ This is the documentation of the data from [[https://ns.sentiweb.fr/incidence/cs ** Download The first line of the CSV file is a comment, which we ignore with ~skip=1~. -#+BEGIN_SRC R :session :results silent :var url=data-url +#+BEGIN_SRC R :session :results silent :var url=data-url :exports both data = read.csv(trimws(url), skip=1) #+END_SRC Let's see what we got! -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both head(data) tail(data) #+END_SRC ** Verification Are there missing data points? -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both na_records = apply(data, 1, function(x) any(is.na(x))) data[na_records,] #+END_SRC The two relevant columns for us are `week` and `inc`. Let's verify their classes: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both class(data$week) class(data$inc) #+END_SRC @@ -82,7 +82,7 @@ Integers, fine! ** Conversions In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This function does it for one value: -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both convert_week = function(w) { ws = paste(w) iso = paste0(substring(ws,1,4), "-W", substring(ws,5,6)) @@ -91,34 +91,34 @@ convert_week = function(w) { #+END_SRC We apply it to all points, creating a new column ~date~ in our data frame: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both data$date = as.Date(convert_week(data$week)) #+END_SRC Let's check that is has class ~Date~: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both class(data$date) #+END_SRC The points are in inverse chronological order, so it's preferable to sort them: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both data = data[order(data$date),] #+END_SRC ** Verification of the dates Another check: our dates should be separated by exactly seven days: -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both all(diff(data$date) == 7) #+END_SRC ** Inspection Finally we can look at a plot of our data! -#+BEGIN_SRC R :results output graphics :file inc-plot.png +#+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both plot(data$date, data$inc, type="l", xlab="Date", ylab="Incidence hebdomadaire") #+END_SRC A zoom on the last few years makes the peaks in winter stand out more clearly. -#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png +#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence hebdomadaire")) #+END_SRC @@ -128,7 +128,7 @@ with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence heb Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions. This R function computes the annual incidence as defined above: -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both yearly_peak = function(year) { debut = paste0(year-1,"-08-01") fin = paste0(year,"-08-01") @@ -138,12 +138,12 @@ yearly_peak = function(year) { #+END_SRC We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available. -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both years <- 1986:2018 #+END_SRC We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year: -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both annnual_inc = data.frame(year = years, incidence = sapply(years, yearly_peak)) head(annnual_inc) @@ -151,17 +151,17 @@ head(annnual_inc) ** Inspection A plot of the annual incidence: -#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png +#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence") #+END_SRC ** Identification of the strongest epidemics A list sorted by decreasing annual incidence makes it easy to find the most important ones: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both head(annnual_inc[order(-annnual_inc$incidence),]) #+END_SRC Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people. -#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png +#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="") #+END_SRC diff --git a/module3/ressources/influenza-like-illness-analysis-orgmode.org b/module3/ressources/influenza-like-illness-analysis-orgmode.org index 9ecafaf46df9393d37f1160252eafa45932aa2e4..9119b572debdee863c341ea9eb470ab46d9886aa 100644 --- a/module3/ressources/influenza-like-illness-analysis-orgmode.org +++ b/module3/ressources/influenza-like-illness-analysis-orgmode.org @@ -10,7 +10,7 @@ #+HTML_HEAD: #+HTML_HEAD: -#+PROPERTY: header-args :session :exports both +#+PROPERTY: header-args :session * Foreword @@ -20,7 +20,7 @@ For running this analysis, you need the following software: Older versions may suffice. For Emacs versions older than 26, org-mode must be updated to version 9.x. ** Python 3.6 or higher We use the ISO 8601 date format, which has been added to Python's standard library with version 3.6. -#+BEGIN_SRC python :results output +#+BEGIN_SRC python :results output :exports both import sys if sys.version_info.major < 3 or sys.version_info.minor < 6: print("Please use Python 3.6 (or higher)!") @@ -28,7 +28,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6: #+RESULTS: -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (unless (featurep 'ob-python) (print "Please activate python in org-babel (org-babel-do-languages)!")) #+END_SRC @@ -38,7 +38,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6: ** R 3.4 We use only basic R functionality, so a earlier version might be OK, but we did not test this. -#+BEGIN_SRC emacs-lisp :results output +#+BEGIN_SRC emacs-lisp :results output :exports both (unless (featurep 'ob-R) (print "Please activate R in org-babel (org-babel-do-languages)!")) #+END_SRC @@ -71,7 +71,7 @@ The [[https://en.wikipedia.org/wiki/ISO_8601][ISO-8601]] format is popular in Eu ** Download After downloading the raw data, we extract the part we are interested in. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns. -#+BEGIN_SRC python :results silent :var data_url=data-url +#+BEGIN_SRC python :results silent :var data_url=data-url :exports both from urllib.request import urlopen data = urlopen(data_url).read() @@ -81,7 +81,7 @@ table = [line.split(',') for line in data_lines] #+END_SRC Let's have a look at what we have so far: -#+BEGIN_SRC python :results value +#+BEGIN_SRC python :results value :exports both table[:5] #+END_SRC @@ -96,7 +96,7 @@ table[:5] Unfortunately there are many ways to indicate the absence of a data value in a dataset. Here we check for a common one: empty fields. For completeness, we should also look for non-numerical data in numerical columns. We don't do this here, but checks in later processing steps would catch such anomalies. We make a new dataset without the lines that contain empty fields. We print those lines to preserve a trace of their contents. -#+BEGIN_SRC python :results output +#+BEGIN_SRC python :results output :exports both valid_table = [] for row in table: missing = any([column == '' for column in row]) @@ -111,7 +111,7 @@ for row in table: ** Extraction of the required columns There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We check the names in the header to be sure we pick the right data. We make a new table containing just the two columns required, without the header. -#+BEGIN_SRC python :results silent +#+BEGIN_SRC python :results silent :exports both week = [row[0] for row in valid_table] assert week[0] == 'week' del week[0] @@ -122,7 +122,7 @@ data = list(zip(week, inc)) #+END_SRC Let's look at the first and last lines. We insert ~None~ to indicate to org-mode the separation between the three parts of the table: header, first lines, last lines. -#+BEGIN_SRC python :results value +#+BEGIN_SRC python :results value :exports both [('week', 'inc'), None] + data[:5] + [None] + data[-5:] #+END_SRC @@ -143,7 +143,7 @@ Let's look at the first and last lines. We insert ~None~ to indicate to org-mode ** Verification It is always prudent to verify if the data looks credible. A simple fact we can check for is that weeks are given as six-digit integers (four for the year, two for the week), and that the incidence values are positive integers. -#+BEGIN_SRC python :results output +#+BEGIN_SRC python :results output :exports both for week, inc in data: if len(week) != 6 or not week.isdigit(): print("Suspicious value in column 'week': ", (week, inc)) @@ -158,7 +158,7 @@ No problem - fine! ** Date conversion In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This is also a good occasion to sort the lines by increasing data, and to convert the incidences from strings to integers. -#+BEGIN_SRC python :results silent +#+BEGIN_SRC python :results silent :exports both import datetime converted_data = [(datetime.datetime.strptime(year_and_week + ":1" , '%G%V:%u').date(), int(inc)) @@ -167,7 +167,7 @@ converted_data.sort(key = lambda record: record[0]) #+END_SRC We'll look again at the first and last lines: -#+BEGIN_SRC python :results value +#+BEGIN_SRC python :results value :exports both str_data = [(str(date), str(inc)) for date, inc in converted_data] [('date', 'inc'), None] + str_data[:5] + [None] + str_data[-5:] #+END_SRC @@ -189,7 +189,7 @@ str_data = [(str(date), str(inc)) for date, inc in converted_data] ** Date verification We do one more verification: our dates must be separated by exactly one week, except around the missing data point. -#+BEGIN_SRC python :results output +#+BEGIN_SRC python :results output :exports both dates = [date for date, _ in converted_data] for date1, date2 in zip(dates[:-1], dates[1:]): if date2-date1 != datetime.timedelta(weeks=1): @@ -204,12 +204,12 @@ We switch to R for data inspection and analysis, because the code is more concis Org-mode's data exchange mechanism requires some Python code for transforming the data to the right format. #+NAME: data-for-R -#+BEGIN_SRC python :results silent +#+BEGIN_SRC python :results silent :exports both [('date', 'inc'), None] + [(str(date), inc) for date, inc in converted_data] #+END_SRC In R, the dataset arrives as a data frame, which is fine. But the dates arrive as strings and must be converted. -#+BEGIN_SRC R :results output :var data=data-for-R +#+BEGIN_SRC R :results output :var data=data-for-R :exports both data$date <- as.Date(data$date) summary(data) #+END_SRC @@ -225,7 +225,7 @@ summary(data) ** Inspection Finally we can look at a plot of our data! -#+BEGIN_SRC R :results output graphics :file inc-plot.png +#+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both plot(data, type="l", xlab="Date", ylab="Weekly incidence") #+END_SRC @@ -233,7 +233,7 @@ plot(data, type="l", xlab="Date", ylab="Weekly incidence") [[file:inc-plot.png]] A zoom on the last few years makes the peaks in winter stand out more clearly. -#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png +#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence") #+END_SRC @@ -246,7 +246,7 @@ plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence") Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions. This R function computes the annual incidence as defined above: -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both yearly_peak = function(year) { debut = paste0(year-1,"-08-01") fin = paste0(year,"-08-01") @@ -256,12 +256,12 @@ yearly_peak = function(year) { #+END_SRC We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available. -#+BEGIN_SRC R :results silent +#+BEGIN_SRC R :results silent :exports both years <- 1986:2018 #+END_SRC We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year: -#+BEGIN_SRC R :results value +#+BEGIN_SRC R :results value :exports both annnual_inc = data.frame(year = years, incidence = sapply(years, yearly_peak)) head(annnual_inc) @@ -277,7 +277,7 @@ head(annnual_inc) ** Inspection A plot of the annual incidence: -#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png +#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence") #+END_SRC @@ -286,7 +286,7 @@ plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence") ** Identification of the strongest epidemics A list sorted by decreasing annual incidence makes it easy to find the most important ones: -#+BEGIN_SRC R :results output +#+BEGIN_SRC R :results output :exports both head(annnual_inc[order(-annnual_inc$incidence),]) #+END_SRC @@ -300,7 +300,7 @@ head(annnual_inc[order(-annnual_inc$incidence),]) : 14 1999 3897443 Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people. -#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png +#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="") #+END_SRC