#+TITLE: Analysis of the incidence of chickenpox
#+LANGUAGE: en
#+OPTIONS: *:nil num:1 toc:t

#+PROPERTY: header-args  :session  :exports both

* Foreword

For running this analysis, you need the following software:

** Emacs 25 or higher
Older versions may suffice. For Emacs versions older than 26, org-mode must be updated to version 9.x.

** R 4.0
We use only basic R functionality, so a earlier version might be OK, but we did not test this.

#+BEGIN_SRC emacs-lisp :results output
(unless (featurep 'ob-R)
  (print "Please activate R in org-babel (org-babel-do-languages)!"))
#+END_SRC

* Data preprocessing

The data on the incidence of influenza-like illness are available from
the Web site of the [[http://www.sentiweb.fr/][Réseau Sentinelles]]. We download them as a file in
CSV format, in which each line corresponds to a week in the
observation period. Only the complete dataset, starting in 1984 and
ending with a recent week, is available for download. The URL is:
#+NAME: data-url
http://www.sentiweb.fr/datasets/incidence-PAY-7.csv

This is the documentation of the data from [[https://ns.sentiweb.fr/incidence/csv-schema-v1.json][the download site]]:

| Column name  | Description                                                                                                               |
|--------------+---------------------------------------------------------------------------------------------------------------------------|
| ~week~       | ISO8601 Yearweek number as numeric (year*100 + week nubmer)                                                               |
| ~geo_insee~  | Identifier of the geographic area, from INSEE https://www.insee.fr                                                        |
| ~geo_name~   | Geographic label of the area, corresponding to INSEE code. This label is not an id and is only provided for human reading |
| ~indicator~  | Unique identifier of the indicator, see metadata document https://www.sentiweb.fr/meta.json                               |
| ~inc~        | Estimated incidence value for the time step, in the geographic level                                                      |
| ~inc_low~    | Lower bound of the estimated incidence 95% Confidence Interval                                                            |
| ~inc_up~     | Upper bound of the estimated incidence 95% Confidence Interval                                                            |
| ~inc100~     | Estimated rate incidence per 100,000 inhabitants                                                                          |
| ~inc100_low~ | Lower bound of the estimated incidence 95% Confidence Interval                                                            |
| ~inc100_up~  | Upper bound of the estimated rate incidence 95% Confidence Interval                                                       |

The [[https://en.wikipedia.org/wiki/ISO_8601][ISO-8601]] format is popular in Europe, but less so in North
America. This may explain why few software packages handle this
format. The Python language does it since version 3.6. We therefore
use Python for the pre-processing phase, which has the advantage of
not requiring any additional library. (Note: we will explain in module
4 why it is desirable for reproducibility to use as few external
libraries as possible.)

** Download
We first check if the data is available in a data file. Otherwise, we
download it into the datafile
#+NAME: data-file
data.csv

#+BEGIN_SRC R :results output :var data_url=data-url :var data_file=data-file
f <- trimws(data_file)
if(!file.exists(f)) {
  download.file(trimws(data_url), f)
}
#+END_SRC

#+RESULTS:

After that we read the csv. Note that the first line is skipped, since
it contains a comment. And also, we convert empty strings to NA for
easier processing later.

#+BEGIN_SRC R :results output :var data_file=data-file
data <- read.csv(trimws(data_file), skip = 1, na.strings=c("", "NA"))
head(data)
#+END_SRC

#+RESULTS:
#+begin_example

    week indicator  inc inc_low inc_up inc100 inc100_low inc100_up geo_insee
1 202017         7  113       0    346      0          0         0        FR
2 202016         7  770      77   1463      1          0         2        FR
3 202015         7 1918     675   3161      3          1         5        FR
4 202014         7 3879    2227   5531      6          3         9        FR
5 202013         7 7326    5236   9416     11          8        14        FR
6 202012         7 8123    5790  10456     12          8        16        FR
  geo_name
1   France
2   France
3   France
4   France
5   France
6   France
#+end_example

** Checking for missing data
Unfortunately there are many ways to indicate the absence of a data
value in a dataset. In the previous second we converted empty strings
to NA, now we will look for NA values. For completeness, we should
also look for non-numerical data in numerical columns. We don't do
this here, but checks in later processing steps would catch such
anomalies.

#+BEGIN_SRC R :results output
sum(is.na(data))
#+END_SRC

#+RESULTS:
: [1] 0

Luckily there seems to be no missing data.

** Extraction of the required columns
There are only two columns that we will need for our analysis: the
first (~"week"~) and the third (~"inc"~). We check the names in the
header to be sure we pick the right data. We make a new table
containing just the two columns required, without the header.

#+BEGIN_SRC R :results output
library(dplyr)
data <- select(data, week, inc)
head(data)
tail(data)
#+END_SRC

#+RESULTS:
#+begin_example

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

    week  inc
1 202017  113
2 202016  770
3 202015 1918
4 202014 3879
5 202013 7326
6 202012 8123

       week   inc
1529 199102 16277
1530 199101 15565
1531 199052 19375
1532 199051 19080
1533 199050 11079
1534 199049  1143
#+end_example

** Verification
It is always prudent to verify if the data looks credible. A simple
fact we can check for is that weeks are given as six-digit integers
(four for the year, two for the week), and that the incidence values
are positive integers.

#+BEGIN_SRC R :results output
is.numeric(data$week)
is.numeric(data$inc)
sum(nchar(data$week) != 6) == 0
#+END_SRC

#+RESULTS:
: [1] TRUE
: 
: [1] TRUE
: 
: [1] TRUE

No problem - fine!

** Date conversion
In order to facilitate the subsequent treatment, we replace the ISO
week numbers by the dates of each week's Monday. This is also a good
occasion to sort the lines by increasing data, and to convert the
incidences from strings to integers.

#+BEGIN_SRC R :results output
library(parsedate)
convert_week = function(w) {
  iso <- paste0(substring(w, 1, 4), "-W", substring(w, 5, 6))
  as.character(parse_iso_8601(iso))
}
data$date <- as.Date(convert_week(data$week))
class(data$date)
#+END_SRC

#+RESULTS:
: 
: [1] "Date"

#+BEGIN_SRC R :results output
data <- select(data, date, inc)
data <- data[order(data$date),]
head(data)
tail(data)
#+END_SRC

#+RESULTS:
#+begin_example

           date   inc
1534 1990-12-03  1143
1533 1990-12-10 11079
1532 1990-12-17 19080
1531 1990-12-24 19375
1530 1990-12-31 15565
1529 1991-01-07 16277

        date  inc
6 2020-03-16 8123
5 2020-03-23 7326
4 2020-03-30 3879
3 2020-04-06 1918
2 2020-04-13  770
1 2020-04-20  113
#+end_example

** Date verification
We do one more verification: our dates must be separated by exactly
one week.

#+BEGIN_SRC R :results output
for(i in 2:nrow(data)) {
  if(difftime(data$date[i], data$date[i-1], units="days") != 7) {
    print(sprintf("The difference between %s and %s is greater than 7", data$date[i], data$date[i-1]));
  }
}
#+END_SRC

#+RESULTS:

All okay!

** Inspection
Finally we can look at a plot of our data!
#+BEGIN_SRC R :results output graphics :file inc-plot.png
plot(data, type="l", xlab="Date", ylab="Weekly incidence")
#+END_SRC

#+RESULTS:
[[file:inc-plot.png]]

A zoom on the last few years makes the peaks in winter stand out more
clearly.
#+BEGIN_SRC R :results output graphics :file inc-plot-zoom.png
plot(tail(data, 200), type="l", xlab="Date", ylab="Weekly incidence")
#+END_SRC

#+RESULTS:
[[file:inc-plot-zoom.png]]

* Study of the annual incidence

** Computation of the annual incidence
Since the peaks of the epidemic happen in winter, slightly after the
transition between calendar years, we define the reference period for
the annual incidence from September 1st of year /N/ to September 1st
of year /N+1/. We label this period as year /N+1/ because the peak is
always located in year /N+1/. The very low incidence in summer ensures
that the arbitrariness of the choice of reference period has no impact
on our conclusions.

This R function computes the annual incidence as defined above:
#+BEGIN_SRC R :results silent
yearly_peak <- function(year) {
  start <- paste0(year-1,"-09-01")
  end <- paste0(year,"-09-01")
  weeks <- data$date > start & data$date <= end
  sum(data$inc[weeks], na.rm=TRUE)
}
#+END_SRC

We must also be careful with the first and last years of the
dataset. The data start in December 1990, meaning that we don't have
all the data for the peak attributed to the year 1990. We therefore
exclude it from the analysis. For the same reason, we define 2019 as
the final year. We can increase this value to 2020 only when all data
up to August 2020 is available.
#+BEGIN_SRC R :results silent
years <- 1991:2019
#+END_SRC

We make a new data frame for the annual incidence, applying the
function ~yearly_peak~ to each year:
#+BEGIN_SRC R :results value
annnual_inc <- data.frame(year = years,
                         incidence = sapply(years, yearly_peak))
head(annnual_inc)
#+END_SRC

#+RESULTS:
| 1991 | 553895 |
| 1992 | 834935 |
| 1993 | 642921 |
| 1994 | 662750 |
| 1995 | 651333 |
| 1996 | 564994 |

** Inspection
A plot of the annual incidence:
#+BEGIN_SRC R :results output graphics :file annual-inc-plot.png
plot(annnual_inc, type="p", xlab="Year", ylab="Annual incidence")
#+END_SRC

#+RESULTS:
[[file:annual-inc-plot.png]]

** Identification of the strongest epidemics
A list sorted by decreasing annual incidence makes it easy to find the
most important ones:
#+BEGIN_SRC R :results value
head(annnual_inc[order(-annnual_inc$incidence),])
#+END_SRC

#+RESULTS:
| 2009 | 841233 |
| 1992 | 834935 |
| 2010 | 834077 |
| 2016 | 779816 |
| 2004 | 778914 |
| 2003 | 760765 |

A list sorted by increasing annual incidence makes it easy to find the
years with the least incidence:
#+BEGIN_SRC R :results value
head(annnual_inc[order(annnual_inc$incidence),])
#+END_SRC

#+RESULTS:
| 2002 | 515343 |
| 2018 | 539765 |
| 2017 | 552906 |
| 1991 | 553895 |
| 1996 | 564994 |
| 2019 | 584116 |

Finally, a histogram clearly shows the few very strong epidemics,
which affect about 10% of the French population, but are rare: there
were three of them in the course of 35 years. The typical epidemic
affects only half as many people.
#+BEGIN_SRC R :results output graphics :file annual-inc-hist.png
hist(annnual_inc$incidence, breaks=10, xlab="Annual incidence", ylab="Number of observations", main="")
#+END_SRC

#+RESULTS:
[[file:annual-inc-hist.png]]