Commit b748e4ca authored by Konrad Hinsen's avatar Konrad Hinsen

Ça marche, alors faisons pareil pour les autres versions

parent 1da1ab19
...@@ -10,7 +10,7 @@ ...@@ -10,7 +10,7 @@
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script>
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script>
#+PROPERTY: header-args :session :exports both #+PROPERTY: header-args :session
* Préface * Préface
...@@ -21,18 +21,18 @@ Une version plus ancienne d'Emacs devrait suffire, mais en ce cas il est prudent ...@@ -21,18 +21,18 @@ Une version plus ancienne d'Emacs devrait suffire, mais en ce cas il est prudent
** R 3.4 ** R 3.4
Nous n'utilisons que des fonctionnalités de base du langage R, une version antérieure devrait suffire. Nous n'utilisons que des fonctionnalités de base du langage R, une version antérieure devrait suffire.
#+BEGIN_SRC emacs-lisp :results output #+BEGIN_SRC emacs-lisp :results output :exports both
(unless (featurep 'ob-R) (unless (featurep 'ob-R)
(print "Veuillez activer R dans org-babel (org-babel-do-languages) !")) (print "Veuillez activer R dans org-babel (org-babel-do-languages) !"))
#+END_SRC #+END_SRC
Pour la manipulation des dates au format ISO-8601, nous avons besoin du paquet ~parsedate~. Pour la manipulation des dates au format ISO-8601, nous avons besoin du paquet ~parsedate~.
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
library(parsedate) library(parsedate)
#+END_SRC #+END_SRC
Enfin, nous allons demander à R d'écrire les nombres décimaux en français — c'est-à-dire avec une virgule entre parties entière et décimale — et d'utiliser un nombre de colonnes plus large que celui utilisé par défaut. Enfin, nous allons demander à R d'écrire les nombres décimaux en français — c'est-à-dire avec une virgule entre parties entière et décimale — et d'utiliser un nombre de colonnes plus large que celui utilisé par défaut.
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
options(OutDec=",") options(OutDec=",")
options(width=150) options(width=150)
#+END_SRC #+END_SRC
...@@ -61,25 +61,25 @@ Voici l'explication des colonnes donnée sur [[https://ns.sentiweb.fr/incidence/ ...@@ -61,25 +61,25 @@ Voici l'explication des colonnes donnée sur [[https://ns.sentiweb.fr/incidence/
** Téléchargement ** Téléchargement
La première ligne du fichier CSV est un commentaire, que nous ignorons en précisant ~skip=1~. La première ligne du fichier CSV est un commentaire, que nous ignorons en précisant ~skip=1~.
#+BEGIN_SRC R :session :results silent :var url=data-url #+BEGIN_SRC R :session :results silent :var url=data-url :exports both
data = read.csv(trimws(url), skip=1) data = read.csv(trimws(url), skip=1)
#+END_SRC #+END_SRC
Regardons ce que nous avons obtenu ! Regardons ce que nous avons obtenu !
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
head(data) head(data)
tail(data) tail(data)
#+END_SRC #+END_SRC
** Vérification ** Vérification
Il est toujours prudent de vérifier si les données semblent crédibles. Commençons par regarder s'il y a des points manquants dans ce jeu de données: Il est toujours prudent de vérifier si les données semblent crédibles. Commençons par regarder s'il y a des points manquants dans ce jeu de données:
#+BEGIN_SRC R :results value #+BEGIN_SRC R :results value :exports both
na_records = apply(data, 1, function(x) any(is.na(x))) na_records = apply(data, 1, function(x) any(is.na(x)))
data[na_records,] data[na_records,]
#+END_SRC #+END_SRC
Voyons aussi comment R a interpreté nos données: Voyons aussi comment R a interpreté nos données:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
class(data$week) class(data$week)
class(data$inc) class(data$inc)
#+END_SRC #+END_SRC
...@@ -88,7 +88,7 @@ Ce sont des entiers, tout va bien ! ...@@ -88,7 +88,7 @@ Ce sont des entiers, tout va bien !
** Conversions ** Conversions
Pour faciliter les traitements suivants, nous remplaçons les numéros de semaine ISO par les dates qui correspondent aux lundis. D'abord, une petite fonction qui fait le travail: Pour faciliter les traitements suivants, nous remplaçons les numéros de semaine ISO par les dates qui correspondent aux lundis. D'abord, une petite fonction qui fait le travail:
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
convert_week = function(w) { convert_week = function(w) {
ws = paste(w) ws = paste(w)
iso = paste0(substring(ws,1,4), "-W", substring(ws,5,6)) iso = paste0(substring(ws,1,4), "-W", substring(ws,5,6))
...@@ -97,34 +97,34 @@ convert_week = function(w) { ...@@ -97,34 +97,34 @@ convert_week = function(w) {
#+END_SRC #+END_SRC
Nous appliquons cette fonction à tous les points, créant une nouvelle colonne `date` dans notre jeu de données: Nous appliquons cette fonction à tous les points, créant une nouvelle colonne `date` dans notre jeu de données:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
data$date = as.Date(convert_week(data$week)) data$date = as.Date(convert_week(data$week))
#+END_SRC #+END_SRC
Vérifions qu'elle est de classe `Date`: Vérifions qu'elle est de classe `Date`:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
class(data$date) class(data$date)
#+END_SRC #+END_SRC
Les points sont dans l'ordre chronologique inverse, il est donc utile de les trier: Les points sont dans l'ordre chronologique inverse, il est donc utile de les trier:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
data = data[order(data$date),] data = data[order(data$date),]
#+END_SRC #+END_SRC
** Vérification des dates ** Vérification des dates
Nous faisons encore une vérification: nos dates doivent être séparées d'exactement une semaine. Nous faisons encore une vérification: nos dates doivent être séparées d'exactement une semaine.
#+BEGIN_SRC R :results value #+BEGIN_SRC R :results value :exports both
all(diff(data$date) == 7) all(diff(data$date) == 7)
#+END_SRC #+END_SRC
** Inspection ** Inspection
Regardons enfin à quoi ressemblent nos données ! Regardons enfin à quoi ressemblent nos données !
#+BEGIN_SRC R :results output graphics :file inc-plot.png #+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both
plot(data$date, data$inc, type="l", xlab="Date", ylab="Incidence hebdomadaire") plot(data$date, data$inc, type="l", xlab="Date", ylab="Incidence hebdomadaire")
#+END_SRC #+END_SRC
Un zoom sur les dernières années montre mieux la situation des pics en hiver. Le creux des incidences se trouve en été. Un zoom sur les dernières années montre mieux la situation des pics en hiver. Le creux des incidences se trouve en été.
#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png #+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both
with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence hebdomadaire")) with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence hebdomadaire"))
#+END_SRC #+END_SRC
...@@ -134,7 +134,7 @@ with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence heb ...@@ -134,7 +134,7 @@ with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence heb
Étant donné que le pic de l'épidémie se situe en hiver, à cheval entre deux années civiles, nous définissons la période de référence entre deux minima de l'incidence, du 1er août de l'année /N/ au 1er août de l'année /N+1/. Nous mettons l'année /N+1/ comme étiquette sur cette année décalée, car le pic de l'épidémie est toujours au début de l'année /N+1/. Comme l'incidence du syndrome grippal est très faible en été, cette modification ne risque pas de fausser nos conclusions. Étant donné que le pic de l'épidémie se situe en hiver, à cheval entre deux années civiles, nous définissons la période de référence entre deux minima de l'incidence, du 1er août de l'année /N/ au 1er août de l'année /N+1/. Nous mettons l'année /N+1/ comme étiquette sur cette année décalée, car le pic de l'épidémie est toujours au début de l'année /N+1/. Comme l'incidence du syndrome grippal est très faible en été, cette modification ne risque pas de fausser nos conclusions.
Voici une fonction qui calcule l'incidence annuelle en appliquant ces conventions. Voici une fonction qui calcule l'incidence annuelle en appliquant ces conventions.
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
pic_annuel = function(annee) { pic_annuel = function(annee) {
debut = paste0(annee-1,"-08-01") debut = paste0(annee-1,"-08-01")
fin = paste0(annee,"-08-01") fin = paste0(annee,"-08-01")
...@@ -144,12 +144,12 @@ pic_annuel = function(annee) { ...@@ -144,12 +144,12 @@ pic_annuel = function(annee) {
#+END_SRC #+END_SRC
Nous devons aussi faire attention aux premières et dernières années de notre jeux de données. Les données commencent en octobre 1984, ce qui ne permet pas de quantifier complètement le pic attribué à l'année 1985. Nous le supprimons donc de notre analyse. Par contre, pour une exécution en octobre 2018, les données se terminent après le 1er août 2018, ce qui nous permet d'inclure cette année. Nous devons aussi faire attention aux premières et dernières années de notre jeux de données. Les données commencent en octobre 1984, ce qui ne permet pas de quantifier complètement le pic attribué à l'année 1985. Nous le supprimons donc de notre analyse. Par contre, pour une exécution en octobre 2018, les données se terminent après le 1er août 2018, ce qui nous permet d'inclure cette année.
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
annees <- 1986:2018 annees <- 1986:2018
#+END_SRC #+END_SRC
Nous créons un nouveau jeu de données pour l'incidence annuelle, en applicant la fonction `pic_annuel` à chaque année: Nous créons un nouveau jeu de données pour l'incidence annuelle, en applicant la fonction `pic_annuel` à chaque année:
#+BEGIN_SRC R :results value #+BEGIN_SRC R :results value :exports both
inc_annuelle = data.frame(annee = annees, inc_annuelle = data.frame(annee = annees,
incidence = sapply(annees, pic_annuel)) incidence = sapply(annees, pic_annuel))
head(inc_annuelle) head(inc_annuelle)
...@@ -157,17 +157,17 @@ head(inc_annuelle) ...@@ -157,17 +157,17 @@ head(inc_annuelle)
** Inspection ** Inspection
Voici les incidences annuelles en graphique. Voici les incidences annuelles en graphique.
#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png #+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both
plot(inc_annuelle, type="p", xlab="Année", ylab="Incidence annuelle") plot(inc_annuelle, type="p", xlab="Année", ylab="Incidence annuelle")
#+END_SRC #+END_SRC
** Identification des épidémies les plus fortes ** Identification des épidémies les plus fortes
Une liste triée par ordre décroissant d'incidence annuelle permet de plus facilement repérer les valeurs les plus élevées: Une liste triée par ordre décroissant d'incidence annuelle permet de plus facilement repérer les valeurs les plus élevées:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
head(inc_annuelle[order(-inc_annuelle$incidence),]) head(inc_annuelle[order(-inc_annuelle$incidence),])
#+END_SRC #+END_SRC
Enfin, un histogramme montre bien que les épidémies fortes, qui touchent environ 10% de la population française, sont assez rares: il y en eu trois au cours des 35 dernières années. Enfin, un histogramme montre bien que les épidémies fortes, qui touchent environ 10% de la population française, sont assez rares: il y en eu trois au cours des 35 dernières années.
#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png #+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both
hist(inc_annuelle$incidence, breaks=10, xlab="Incidence annuelle", ylab="Nb d'observations", main="") hist(inc_annuelle$incidence, breaks=10, xlab="Incidence annuelle", ylab="Nb d'observations", main="")
#+END_SRC #+END_SRC
...@@ -10,7 +10,7 @@ ...@@ -10,7 +10,7 @@
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script>
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script>
#+PROPERTY: header-args :session :exports both #+PROPERTY: header-args :session
* Foreword * Foreword
...@@ -20,13 +20,13 @@ For running this analysis, you need the following software: ...@@ -20,13 +20,13 @@ For running this analysis, you need the following software:
Older versions may suffice. For Emacs versions older than 26, org-mode must be updated to version 9.x. Older versions may suffice. For Emacs versions older than 26, org-mode must be updated to version 9.x.
** Python 3.6 or higher ** Python 3.6 or higher
We use the ISO 8601 date format, which has been added to Python's standard library with version 3.6. We use the ISO 8601 date format, which has been added to Python's standard library with version 3.6.
#+BEGIN_SRC python :results output #+BEGIN_SRC python :results output :exports both
import sys import sys
if sys.version_info.major < 3 or sys.version_info.minor < 6: if sys.version_info.major < 3 or sys.version_info.minor < 6:
print("Please use Python 3.6 (or higher)!") print("Please use Python 3.6 (or higher)!")
#+END_SRC #+END_SRC
#+BEGIN_SRC emacs-lisp :results output #+BEGIN_SRC emacs-lisp :results output :exports both
(unless (featurep 'ob-python) (unless (featurep 'ob-python)
(print "Please activate python in org-babel (org-babel-do-languages)!")) (print "Please activate python in org-babel (org-babel-do-languages)!"))
#+END_SRC #+END_SRC
...@@ -34,7 +34,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6: ...@@ -34,7 +34,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6:
** R 3.4 ** R 3.4
We use only basic R functionality, so a earlier version might be OK, but we did not test this. We use only basic R functionality, so a earlier version might be OK, but we did not test this.
#+BEGIN_SRC emacs-lisp :results output #+BEGIN_SRC emacs-lisp :results output :exports both
(unless (featurep 'ob-R) (unless (featurep 'ob-R)
(print "Please activate R in org-babel (org-babel-do-languages)!")) (print "Please activate R in org-babel (org-babel-do-languages)!"))
#+END_SRC #+END_SRC
...@@ -66,7 +66,7 @@ To avoid downloading the data multiple times, we keep it in a buffer, which is a ...@@ -66,7 +66,7 @@ To avoid downloading the data multiple times, we keep it in a buffer, which is a
#+NAME: data-buffer-name #+NAME: data-buffer-name
*influenza-like-illness data* *influenza-like-illness data*
#+BEGIN_SRC emacs-lisp :results silent :var url=data-url :var name=data-buffer-name #+BEGIN_SRC emacs-lisp :results silent :var url=data-url :var name=data-buffer-name :exports both
(require 'url) (require 'url)
(with-current-buffer (get-buffer-create name) (with-current-buffer (get-buffer-create name)
(unless buffer-read-only (unless buffer-read-only
...@@ -76,7 +76,7 @@ To avoid downloading the data multiple times, we keep it in a buffer, which is a ...@@ -76,7 +76,7 @@ To avoid downloading the data multiple times, we keep it in a buffer, which is a
#+END_SRC #+END_SRC
The next step is the extraction of the data we need. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns. To identify missing data values, we check if a line contains at least one empty field. At the end of this code block two variables contain the missing data and the valid data, respectively. The next step is the extraction of the data we need. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns. To identify missing data values, we check if a line contains at least one empty field. At the end of this code block two variables contain the missing data and the valid data, respectively.
#+BEGIN_SRC emacs-lisp :results silent :var name=data-buffer-name #+BEGIN_SRC emacs-lisp :results silent :var name=data-buffer-name :exports both
(require 'cl) (require 'cl)
(require 'dash) (require 'dash)
(defun missing-data? (row) (defun missing-data? (row)
...@@ -91,19 +91,19 @@ The next step is the extraction of the data we need. We first split the file int ...@@ -91,19 +91,19 @@ The next step is the extraction of the data we need. We first split the file int
#+END_SRC #+END_SRC
Let's look at the missing data... Let's look at the missing data...
#+BEGIN_SRC emacs-lisp #+BEGIN_SRC emacs-lisp :exports both
missing-data-lines missing-data-lines
#+END_SRC #+END_SRC
There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We insert ~hline~ as the second element of the table to tell org-mode to separate the header from the data. There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We insert ~hline~ as the second element of the table to tell org-mode to separate the header from the data.
#+NAME: data #+NAME: data
#+BEGIN_SRC emacs-lisp :results silent #+BEGIN_SRC emacs-lisp :results silent :exports both
(-insert-at 1 'hline (-insert-at 1 'hline
(-select-columns '(0 2) valid-data-lines)) (-select-columns '(0 2) valid-data-lines))
#+END_SRC #+END_SRC
Let's look at the first and last lines: Let's look at the first and last lines:
#+BEGIN_SRC emacs-lisp :results value :var data=data :colnames yes #+BEGIN_SRC emacs-lisp :results value :var data=data :colnames yes :exports both
(-concat (-take 5 data) (-concat (-take 5 data)
'(hline) '(hline)
(-take-last 5 data)) (-take-last 5 data))
...@@ -112,7 +112,7 @@ Let's look at the first and last lines: ...@@ -112,7 +112,7 @@ Let's look at the first and last lines:
** Verification ** Verification
It is always prudent to verify if the data looks credible. A simple fact we can check for is that weeks are given as six-digit integers (four for the year, two for the week), that the first two are "19" or "20", and that the incidence values are positive integers. It is always prudent to verify if the data looks credible. A simple fact we can check for is that weeks are given as six-digit integers (four for the year, two for the week), that the first two are "19" or "20", and that the incidence values are positive integers.
#+BEGIN_SRC emacs-lisp :results output :var data=data :colnames yes #+BEGIN_SRC emacs-lisp :results output :var data=data :colnames yes :exports both
(defun check-week (week) (defun check-week (week)
(unless (string-match-p (rx (or "19" "20") (repeat 4 digit)) week) (unless (string-match-p (rx (or "19" "20") (repeat 4 digit)) week)
(princ (format "Invalid week value: %s\n" week)))) (princ (format "Invalid week value: %s\n" week))))
...@@ -132,7 +132,7 @@ Everything seems to be OK ! ...@@ -132,7 +132,7 @@ Everything seems to be OK !
** Conversions ** Conversions
In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This is also a good occasion to sort the lines by increasing data, and to convert the incidences from strings to integers. We use the Python 3 language which is one of the few to have ISO week handling in its standard library. In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This is also a good occasion to sort the lines by increasing data, and to convert the incidences from strings to integers. We use the Python 3 language which is one of the few to have ISO week handling in its standard library.
#+BEGIN_SRC python :results silent :var input_data=data #+BEGIN_SRC python :results silent :var input_data=data :exports both
import datetime import datetime
data = [(datetime.datetime.strptime(year_and_week + ":1" , '%G%V:%u').date(), data = [(datetime.datetime.strptime(year_and_week + ":1" , '%G%V:%u').date(),
int(inc)) int(inc))
...@@ -141,14 +141,14 @@ data.sort(key = lambda record: record[0]) ...@@ -141,14 +141,14 @@ data.sort(key = lambda record: record[0])
#+END_SRC #+END_SRC
We'll look again at the first and last lines: We'll look again at the first and last lines:
#+BEGIN_SRC python :results value #+BEGIN_SRC python :results value :exports both
str_data = [(str(date), str(inc)) for date, inc in data] str_data = [(str(date), str(inc)) for date, inc in data]
[('date', 'inc'), None] + str_data[:5] + [None] + str_data[-5:] [('date', 'inc'), None] + str_data[:5] + [None] + str_data[-5:]
#+END_SRC #+END_SRC
** Date verification ** Date verification
We do one more verification: our dates must be separated by exactly one week, except around the missing data point. We do one more verification: our dates must be separated by exactly one week, except around the missing data point.
#+BEGIN_SRC python :results output #+BEGIN_SRC python :results output :exports both
dates = [date for date, _ in data] dates = [date for date, _ in data]
for date1, date2 in zip(dates[:-1], dates[1:]): for date1, date2 in zip(dates[:-1], dates[1:]):
if date2-date1 != datetime.timedelta(weeks=1): if date2-date1 != datetime.timedelta(weeks=1):
...@@ -160,24 +160,24 @@ We switch to R for data inspection and analysis, because the code is more concis ...@@ -160,24 +160,24 @@ We switch to R for data inspection and analysis, because the code is more concis
Org-mode's data exchange mechanism requires some Python code for transforming the data to the right format. Org-mode's data exchange mechanism requires some Python code for transforming the data to the right format.
#+NAME: data-for-R #+NAME: data-for-R
#+BEGIN_SRC python :results silent #+BEGIN_SRC python :results silent :exports both
[('date', 'inc'), None] + [(str(date), inc) for date, inc in converted_data] [('date', 'inc'), None] + [(str(date), inc) for date, inc in converted_data]
#+END_SRC #+END_SRC
In R, the dataset arrives as a data frame, which is fine. But the dates arrive as strings and must be converted. In R, the dataset arrives as a data frame, which is fine. But the dates arrive as strings and must be converted.
#+BEGIN_SRC R :results output :var data=data-for-R #+BEGIN_SRC R :results output :var data=data-for-R :exports both
data$date <- as.Date(data$date) data$date <- as.Date(data$date)
summary(data) summary(data)
#+END_SRC #+END_SRC
** Inspection ** Inspection
Finally we can look at a plot of our data! Finally we can look at a plot of our data!
#+BEGIN_SRC R :results output graphics :file inc-plot.png #+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both
plot(data, type="l", xlab="Date", ylab="Weekly incidence") plot(data, type="l", xlab="Date", ylab="Weekly incidence")
#+END_SRC #+END_SRC
A zoom on the last few years makes the peaks in winter stand out more clearly. A zoom on the last few years makes the peaks in winter stand out more clearly.
#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png #+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both
plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence") plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence")
#+END_SRC #+END_SRC
...@@ -187,7 +187,7 @@ plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence") ...@@ -187,7 +187,7 @@ plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence")
Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions. Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions.
This R function computes the annual incidence as defined above: This R function computes the annual incidence as defined above:
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
yearly_peak = function(year) { yearly_peak = function(year) {
debut = paste0(year-1,"-08-01") debut = paste0(year-1,"-08-01")
fin = paste0(year,"-08-01") fin = paste0(year,"-08-01")
...@@ -197,12 +197,12 @@ yearly_peak = function(year) { ...@@ -197,12 +197,12 @@ yearly_peak = function(year) {
#+END_SRC #+END_SRC
We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available. We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available.
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
years <- 1986:2018 years <- 1986:2018
#+END_SRC #+END_SRC
We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year: We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year:
#+BEGIN_SRC R :results value #+BEGIN_SRC R :results value :exports both
annnual_inc = data.frame(year = years, annnual_inc = data.frame(year = years,
incidence = sapply(years, yearly_peak)) incidence = sapply(years, yearly_peak))
head(annnual_inc) head(annnual_inc)
...@@ -210,17 +210,17 @@ head(annnual_inc) ...@@ -210,17 +210,17 @@ head(annnual_inc)
** Inspection ** Inspection
A plot of the annual incidence: A plot of the annual incidence:
#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png #+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both
plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence") plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence")
#+END_SRC #+END_SRC
** Identification of the strongest epidemics ** Identification of the strongest epidemics
A list sorted by decreasing annual incidence makes it easy to find the most important ones: A list sorted by decreasing annual incidence makes it easy to find the most important ones:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
head(annnual_inc[order(-annnual_inc$incidence),]) head(annnual_inc[order(-annnual_inc$incidence),])
#+END_SRC #+END_SRC
Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people. Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people.
#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png #+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both
hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="") hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="")
#+END_SRC #+END_SRC
...@@ -10,7 +10,7 @@ ...@@ -10,7 +10,7 @@
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script>
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script>
#+PROPERTY: header-args :session :exports both #+PROPERTY: header-args :session
* Foreword * Foreword
...@@ -21,13 +21,13 @@ Older versions may suffice. For Emacs versions older than 26, org-mode must be u ...@@ -21,13 +21,13 @@ Older versions may suffice. For Emacs versions older than 26, org-mode must be u
** R 3.4 ** R 3.4
We use only basic R functionality, so a earlier version might be OK, but we did not test this. We use only basic R functionality, so a earlier version might be OK, but we did not test this.
#+BEGIN_SRC emacs-lisp :results output #+BEGIN_SRC emacs-lisp :results output :exports both
(unless (featurep 'ob-R) (unless (featurep 'ob-R)
(print "Please activate R in org-babel (org-babel-do-languages)!")) (print "Please activate R in org-babel (org-babel-do-languages)!"))
#+END_SRC #+END_SRC
For handling dates in ISO-8601 format, we need the library ~parsedate~. For handling dates in ISO-8601 format, we need the library ~parsedate~.
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
library(parsedate) library(parsedate)
#+END_SRC #+END_SRC
...@@ -55,25 +55,25 @@ This is the documentation of the data from [[https://ns.sentiweb.fr/incidence/cs ...@@ -55,25 +55,25 @@ This is the documentation of the data from [[https://ns.sentiweb.fr/incidence/cs
** Download ** Download
The first line of the CSV file is a comment, which we ignore with ~skip=1~. The first line of the CSV file is a comment, which we ignore with ~skip=1~.
#+BEGIN_SRC R :session :results silent :var url=data-url #+BEGIN_SRC R :session :results silent :var url=data-url :exports both
data = read.csv(trimws(url), skip=1) data = read.csv(trimws(url), skip=1)
#+END_SRC #+END_SRC
Let's see what we got! Let's see what we got!
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
head(data) head(data)
tail(data) tail(data)
#+END_SRC #+END_SRC
** Verification ** Verification
Are there missing data points? Are there missing data points?
#+BEGIN_SRC R :results value #+BEGIN_SRC R :results value :exports both
na_records = apply(data, 1, function(x) any(is.na(x))) na_records = apply(data, 1, function(x) any(is.na(x)))
data[na_records,] data[na_records,]
#+END_SRC #+END_SRC
The two relevant columns for us are `week` and `inc`. Let's verify their classes: The two relevant columns for us are `week` and `inc`. Let's verify their classes:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
class(data$week) class(data$week)
class(data$inc) class(data$inc)
#+END_SRC #+END_SRC
...@@ -82,7 +82,7 @@ Integers, fine! ...@@ -82,7 +82,7 @@ Integers, fine!
** Conversions ** Conversions
In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This function does it for one value: In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This function does it for one value:
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
convert_week = function(w) { convert_week = function(w) {
ws = paste(w) ws = paste(w)
iso = paste0(substring(ws,1,4), "-W", substring(ws,5,6)) iso = paste0(substring(ws,1,4), "-W", substring(ws,5,6))
...@@ -91,34 +91,34 @@ convert_week = function(w) { ...@@ -91,34 +91,34 @@ convert_week = function(w) {
#+END_SRC #+END_SRC
We apply it to all points, creating a new column ~date~ in our data frame: We apply it to all points, creating a new column ~date~ in our data frame:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
data$date = as.Date(convert_week(data$week)) data$date = as.Date(convert_week(data$week))
#+END_SRC #+END_SRC
Let's check that is has class ~Date~: Let's check that is has class ~Date~:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
class(data$date) class(data$date)
#+END_SRC #+END_SRC
The points are in inverse chronological order, so it's preferable to sort them: The points are in inverse chronological order, so it's preferable to sort them:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
data = data[order(data$date),] data = data[order(data$date),]
#+END_SRC #+END_SRC
** Verification of the dates ** Verification of the dates
Another check: our dates should be separated by exactly seven days: Another check: our dates should be separated by exactly seven days:
#+BEGIN_SRC R :results value #+BEGIN_SRC R :results value :exports both
all(diff(data$date) == 7) all(diff(data$date) == 7)
#+END_SRC #+END_SRC
** Inspection ** Inspection
Finally we can look at a plot of our data! Finally we can look at a plot of our data!
#+BEGIN_SRC R :results output graphics :file inc-plot.png #+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both
plot(data$date, data$inc, type="l", xlab="Date", ylab="Incidence hebdomadaire") plot(data$date, data$inc, type="l", xlab="Date", ylab="Incidence hebdomadaire")
#+END_SRC #+END_SRC
A zoom on the last few years makes the peaks in winter stand out more clearly. A zoom on the last few years makes the peaks in winter stand out more clearly.
#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png #+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both
with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence hebdomadaire")) with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence hebdomadaire"))
#+END_SRC #+END_SRC
...@@ -128,7 +128,7 @@ with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence heb ...@@ -128,7 +128,7 @@ with(tail(data, 200), plot(date, inc, type="l", xlab="Date", ylab="Incidence heb
Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions. Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions.
This R function computes the annual incidence as defined above: This R function computes the annual incidence as defined above:
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
yearly_peak = function(year) { yearly_peak = function(year) {
debut = paste0(year-1,"-08-01") debut = paste0(year-1,"-08-01")
fin = paste0(year,"-08-01") fin = paste0(year,"-08-01")
...@@ -138,12 +138,12 @@ yearly_peak = function(year) { ...@@ -138,12 +138,12 @@ yearly_peak = function(year) {
#+END_SRC #+END_SRC
We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available. We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available.
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
years <- 1986:2018 years <- 1986:2018
#+END_SRC #+END_SRC
We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year: We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year:
#+BEGIN_SRC R :results value #+BEGIN_SRC R :results value :exports both
annnual_inc = data.frame(year = years, annnual_inc = data.frame(year = years,
incidence = sapply(years, yearly_peak)) incidence = sapply(years, yearly_peak))
head(annnual_inc) head(annnual_inc)
...@@ -151,17 +151,17 @@ head(annnual_inc) ...@@ -151,17 +151,17 @@ head(annnual_inc)
** Inspection ** Inspection
A plot of the annual incidence: A plot of the annual incidence:
#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png #+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both
plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence") plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence")
#+END_SRC #+END_SRC
** Identification of the strongest epidemics ** Identification of the strongest epidemics
A list sorted by decreasing annual incidence makes it easy to find the most important ones: A list sorted by decreasing annual incidence makes it easy to find the most important ones:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
head(annnual_inc[order(-annnual_inc$incidence),]) head(annnual_inc[order(-annnual_inc$incidence),])
#+END_SRC #+END_SRC
Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people. Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people.
#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png #+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both
hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="") hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="")
#+END_SRC #+END_SRC
...@@ -10,7 +10,7 @@ ...@@ -10,7 +10,7 @@
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script>
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script>
#+PROPERTY: header-args :session :exports both #+PROPERTY: header-args :session
* Foreword * Foreword
...@@ -20,7 +20,7 @@ For running this analysis, you need the following software: ...@@ -20,7 +20,7 @@ For running this analysis, you need the following software:
Older versions may suffice. For Emacs versions older than 26, org-mode must be updated to version 9.x. Older versions may suffice. For Emacs versions older than 26, org-mode must be updated to version 9.x.
** Python 3.6 or higher ** Python 3.6 or higher
We use the ISO 8601 date format, which has been added to Python's standard library with version 3.6. We use the ISO 8601 date format, which has been added to Python's standard library with version 3.6.
#+BEGIN_SRC python :results output #+BEGIN_SRC python :results output :exports both
import sys import sys
if sys.version_info.major < 3 or sys.version_info.minor < 6: if sys.version_info.major < 3 or sys.version_info.minor < 6:
print("Please use Python 3.6 (or higher)!") print("Please use Python 3.6 (or higher)!")
...@@ -28,7 +28,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6: ...@@ -28,7 +28,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6:
#+RESULTS: #+RESULTS:
#+BEGIN_SRC emacs-lisp :results output #+BEGIN_SRC emacs-lisp :results output :exports both
(unless (featurep 'ob-python) (unless (featurep 'ob-python)
(print "Please activate python in org-babel (org-babel-do-languages)!")) (print "Please activate python in org-babel (org-babel-do-languages)!"))
#+END_SRC #+END_SRC
...@@ -38,7 +38,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6: ...@@ -38,7 +38,7 @@ if sys.version_info.major < 3 or sys.version_info.minor < 6:
** R 3.4 ** R 3.4
We use only basic R functionality, so a earlier version might be OK, but we did not test this. We use only basic R functionality, so a earlier version might be OK, but we did not test this.
#+BEGIN_SRC emacs-lisp :results output #+BEGIN_SRC emacs-lisp :results output :exports both
(unless (featurep 'ob-R) (unless (featurep 'ob-R)
(print "Please activate R in org-babel (org-babel-do-languages)!")) (print "Please activate R in org-babel (org-babel-do-languages)!"))
#+END_SRC #+END_SRC
...@@ -71,7 +71,7 @@ The [[https://en.wikipedia.org/wiki/ISO_8601][ISO-8601]] format is popular in Eu ...@@ -71,7 +71,7 @@ The [[https://en.wikipedia.org/wiki/ISO_8601][ISO-8601]] format is popular in Eu
** Download ** Download
After downloading the raw data, we extract the part we are interested in. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns. After downloading the raw data, we extract the part we are interested in. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns.
#+BEGIN_SRC python :results silent :var data_url=data-url #+BEGIN_SRC python :results silent :var data_url=data-url :exports both
from urllib.request import urlopen from urllib.request import urlopen
data = urlopen(data_url).read() data = urlopen(data_url).read()
...@@ -81,7 +81,7 @@ table = [line.split(',') for line in data_lines] ...@@ -81,7 +81,7 @@ table = [line.split(',') for line in data_lines]
#+END_SRC #+END_SRC
Let's have a look at what we have so far: Let's have a look at what we have so far:
#+BEGIN_SRC python :results value #+BEGIN_SRC python :results value :exports both
table[:5] table[:5]
#+END_SRC #+END_SRC
...@@ -96,7 +96,7 @@ table[:5] ...@@ -96,7 +96,7 @@ table[:5]
Unfortunately there are many ways to indicate the absence of a data value in a dataset. Here we check for a common one: empty fields. For completeness, we should also look for non-numerical data in numerical columns. We don't do this here, but checks in later processing steps would catch such anomalies. Unfortunately there are many ways to indicate the absence of a data value in a dataset. Here we check for a common one: empty fields. For completeness, we should also look for non-numerical data in numerical columns. We don't do this here, but checks in later processing steps would catch such anomalies.
We make a new dataset without the lines that contain empty fields. We print those lines to preserve a trace of their contents. We make a new dataset without the lines that contain empty fields. We print those lines to preserve a trace of their contents.
#+BEGIN_SRC python :results output #+BEGIN_SRC python :results output :exports both
valid_table = [] valid_table = []
for row in table: for row in table:
missing = any([column == '' for column in row]) missing = any([column == '' for column in row])
...@@ -111,7 +111,7 @@ for row in table: ...@@ -111,7 +111,7 @@ for row in table:
** Extraction of the required columns ** Extraction of the required columns
There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We check the names in the header to be sure we pick the right data. We make a new table containing just the two columns required, without the header. There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We check the names in the header to be sure we pick the right data. We make a new table containing just the two columns required, without the header.
#+BEGIN_SRC python :results silent #+BEGIN_SRC python :results silent :exports both
week = [row[0] for row in valid_table] week = [row[0] for row in valid_table]
assert week[0] == 'week' assert week[0] == 'week'
del week[0] del week[0]
...@@ -122,7 +122,7 @@ data = list(zip(week, inc)) ...@@ -122,7 +122,7 @@ data = list(zip(week, inc))
#+END_SRC #+END_SRC
Let's look at the first and last lines. We insert ~None~ to indicate to org-mode the separation between the three parts of the table: header, first lines, last lines. Let's look at the first and last lines. We insert ~None~ to indicate to org-mode the separation between the three parts of the table: header, first lines, last lines.
#+BEGIN_SRC python :results value #+BEGIN_SRC python :results value :exports both
[('week', 'inc'), None] + data[:5] + [None] + data[-5:] [('week', 'inc'), None] + data[:5] + [None] + data[-5:]
#+END_SRC #+END_SRC
...@@ -143,7 +143,7 @@ Let's look at the first and last lines. We insert ~None~ to indicate to org-mode ...@@ -143,7 +143,7 @@ Let's look at the first and last lines. We insert ~None~ to indicate to org-mode
** Verification ** Verification
It is always prudent to verify if the data looks credible. A simple fact we can check for is that weeks are given as six-digit integers (four for the year, two for the week), and that the incidence values are positive integers. It is always prudent to verify if the data looks credible. A simple fact we can check for is that weeks are given as six-digit integers (four for the year, two for the week), and that the incidence values are positive integers.
#+BEGIN_SRC python :results output #+BEGIN_SRC python :results output :exports both
for week, inc in data: for week, inc in data:
if len(week) != 6 or not week.isdigit(): if len(week) != 6 or not week.isdigit():
print("Suspicious value in column 'week': ", (week, inc)) print("Suspicious value in column 'week': ", (week, inc))
...@@ -158,7 +158,7 @@ No problem - fine! ...@@ -158,7 +158,7 @@ No problem - fine!
** Date conversion ** Date conversion
In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This is also a good occasion to sort the lines by increasing data, and to convert the incidences from strings to integers. In order to facilitate the subsequent treatment, we replace the ISO week numbers by the dates of each week's Monday. This is also a good occasion to sort the lines by increasing data, and to convert the incidences from strings to integers.
#+BEGIN_SRC python :results silent #+BEGIN_SRC python :results silent :exports both
import datetime import datetime
converted_data = [(datetime.datetime.strptime(year_and_week + ":1" , '%G%V:%u').date(), converted_data = [(datetime.datetime.strptime(year_and_week + ":1" , '%G%V:%u').date(),
int(inc)) int(inc))
...@@ -167,7 +167,7 @@ converted_data.sort(key = lambda record: record[0]) ...@@ -167,7 +167,7 @@ converted_data.sort(key = lambda record: record[0])
#+END_SRC #+END_SRC
We'll look again at the first and last lines: We'll look again at the first and last lines:
#+BEGIN_SRC python :results value #+BEGIN_SRC python :results value :exports both
str_data = [(str(date), str(inc)) for date, inc in converted_data] str_data = [(str(date), str(inc)) for date, inc in converted_data]
[('date', 'inc'), None] + str_data[:5] + [None] + str_data[-5:] [('date', 'inc'), None] + str_data[:5] + [None] + str_data[-5:]
#+END_SRC #+END_SRC
...@@ -189,7 +189,7 @@ str_data = [(str(date), str(inc)) for date, inc in converted_data] ...@@ -189,7 +189,7 @@ str_data = [(str(date), str(inc)) for date, inc in converted_data]
** Date verification ** Date verification
We do one more verification: our dates must be separated by exactly one week, except around the missing data point. We do one more verification: our dates must be separated by exactly one week, except around the missing data point.
#+BEGIN_SRC python :results output #+BEGIN_SRC python :results output :exports both
dates = [date for date, _ in converted_data] dates = [date for date, _ in converted_data]
for date1, date2 in zip(dates[:-1], dates[1:]): for date1, date2 in zip(dates[:-1], dates[1:]):
if date2-date1 != datetime.timedelta(weeks=1): if date2-date1 != datetime.timedelta(weeks=1):
...@@ -204,12 +204,12 @@ We switch to R for data inspection and analysis, because the code is more concis ...@@ -204,12 +204,12 @@ We switch to R for data inspection and analysis, because the code is more concis
Org-mode's data exchange mechanism requires some Python code for transforming the data to the right format. Org-mode's data exchange mechanism requires some Python code for transforming the data to the right format.
#+NAME: data-for-R #+NAME: data-for-R
#+BEGIN_SRC python :results silent #+BEGIN_SRC python :results silent :exports both
[('date', 'inc'), None] + [(str(date), inc) for date, inc in converted_data] [('date', 'inc'), None] + [(str(date), inc) for date, inc in converted_data]
#+END_SRC #+END_SRC
In R, the dataset arrives as a data frame, which is fine. But the dates arrive as strings and must be converted. In R, the dataset arrives as a data frame, which is fine. But the dates arrive as strings and must be converted.
#+BEGIN_SRC R :results output :var data=data-for-R #+BEGIN_SRC R :results output :var data=data-for-R :exports both
data$date <- as.Date(data$date) data$date <- as.Date(data$date)
summary(data) summary(data)
#+END_SRC #+END_SRC
...@@ -225,7 +225,7 @@ summary(data) ...@@ -225,7 +225,7 @@ summary(data)
** Inspection ** Inspection
Finally we can look at a plot of our data! Finally we can look at a plot of our data!
#+BEGIN_SRC R :results output graphics :file inc-plot.png #+BEGIN_SRC R :results output graphics :file inc-plot.png :exports both
plot(data, type="l", xlab="Date", ylab="Weekly incidence") plot(data, type="l", xlab="Date", ylab="Weekly incidence")
#+END_SRC #+END_SRC
...@@ -233,7 +233,7 @@ plot(data, type="l", xlab="Date", ylab="Weekly incidence") ...@@ -233,7 +233,7 @@ plot(data, type="l", xlab="Date", ylab="Weekly incidence")
[[file:inc-plot.png]] [[file:inc-plot.png]]
A zoom on the last few years makes the peaks in winter stand out more clearly. A zoom on the last few years makes the peaks in winter stand out more clearly.
#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png #+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png :exports both
plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence") plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence")
#+END_SRC #+END_SRC
...@@ -246,7 +246,7 @@ plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence") ...@@ -246,7 +246,7 @@ plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence")
Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions. Since the peaks of the epidemic happen in winter, near the transition between calendar years, we define the reference period for the annual incidence from August 1st of year /N/ to August 1st of year /N+1/. We label this period as year /N+1/ because the peak is always located in year /N+1/. The very low incidence in summer ensures that the arbitrariness of the choice of reference period has no impact on our conclusions.
This R function computes the annual incidence as defined above: This R function computes the annual incidence as defined above:
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
yearly_peak = function(year) { yearly_peak = function(year) {
debut = paste0(year-1,"-08-01") debut = paste0(year-1,"-08-01")
fin = paste0(year,"-08-01") fin = paste0(year,"-08-01")
...@@ -256,12 +256,12 @@ yearly_peak = function(year) { ...@@ -256,12 +256,12 @@ yearly_peak = function(year) {
#+END_SRC #+END_SRC
We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available. We must also be careful with the first and last years of the dataset. The data start in October 1984, meaning that we don't have all the data for the peak attributed to the year 1985. We therefore exclude it from the analysis. For the same reason, we define 2018 as the final year. We can increase this value to 2019 only when all data up to July 2019 is available.
#+BEGIN_SRC R :results silent #+BEGIN_SRC R :results silent :exports both
years <- 1986:2018 years <- 1986:2018
#+END_SRC #+END_SRC
We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year: We make a new data frame for the annual incidence, applying the function ~yearly_peak~ to each year:
#+BEGIN_SRC R :results value #+BEGIN_SRC R :results value :exports both
annnual_inc = data.frame(year = years, annnual_inc = data.frame(year = years,
incidence = sapply(years, yearly_peak)) incidence = sapply(years, yearly_peak))
head(annnual_inc) head(annnual_inc)
...@@ -277,7 +277,7 @@ head(annnual_inc) ...@@ -277,7 +277,7 @@ head(annnual_inc)
** Inspection ** Inspection
A plot of the annual incidence: A plot of the annual incidence:
#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png #+BEGIN_SRC R :results output graphics :file annual-inc-plot.png :exports both
plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence") plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence")
#+END_SRC #+END_SRC
...@@ -286,7 +286,7 @@ plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence") ...@@ -286,7 +286,7 @@ plot(annnual_inc, type="p", xlab="Année", ylab="Annual incidence")
** Identification of the strongest epidemics ** Identification of the strongest epidemics
A list sorted by decreasing annual incidence makes it easy to find the most important ones: A list sorted by decreasing annual incidence makes it easy to find the most important ones:
#+BEGIN_SRC R :results output #+BEGIN_SRC R :results output :exports both
head(annnual_inc[order(-annnual_inc$incidence),]) head(annnual_inc[order(-annnual_inc$incidence),])
#+END_SRC #+END_SRC
...@@ -300,7 +300,7 @@ head(annnual_inc[order(-annnual_inc$incidence),]) ...@@ -300,7 +300,7 @@ head(annnual_inc[order(-annnual_inc$incidence),])
: 14 1999 3897443 : 14 1999 3897443
Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people. Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population, but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people.
#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png #+BEGIN_SRC R :results output graphics :file annual-inc-hist.png :exports both
hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="") hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="")
#+END_SRC #+END_SRC
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment