#+TITLE: Analysis of the incidence of chickenpox
#+AUTHOR: Thomas Rushton
#+DATE: 2024-04-24
#+LANGUAGE: en
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+PROPERTY: header-args :session
#+OPTIONS: ^:{}
* Data acquisition
Let's start by getting the data on chickenpox incidence from [[https://www.sentiweb.fr/][Rèseau Sentinelles]].
#+NAME: data-url
https://www.sentiweb.fr/datasets/incidence-PAY-7.csv
#+begin_src python :results output :var data_url=data-url
data_file = "chickenpox.csv"
import os
import urllib.request
if not os.path.exists(data_file):
urllib.request.urlretrieve(data_url, data_file)
#+end_src
#+RESULTS:
** Format the data
Discard the first line (which is a comment) and split the remaining
lines into columns.
#+begin_src python :results silent :exports both
data = open(data_file, 'rb').read()
lines = data.decode('latin-1').strip().split('\n')
data_lines = lines[1:]
table = [line.split(',') for line in data_lines]
# Could use the csv library instead...
#+end_src
Let's take a gander at the data:
#+begin_src python :results value :exports both
table[:5]
#+end_src
#+RESULTS:
| week | indicator | inc | inc_low | inc_up | inc100 | inc100_low | inc100_up | geo_insee | geo_name |
| 202416 | 7 | 19330 | 13879 | 24781 | 29 | 21 | 37 | FR | France |
| 202415 | 7 | 24807 | 17183 | 32431 | 37 | 26 | 48 | FR | France |
| 202414 | 7 | 16181 | 12544 | 19818 | 24 | 19 | 29 | FR | France |
| 202413 | 7 | 18322 | 14206 | 22438 | 27 | 21 | 33 | FR | France |
Alright, looks good 👍
** Sanitise the data
But the data may not be without its issues. Let's check for the
obvious case of missing/empty data:
#+begin_src python :results output :exports both
valid_table = []
for row in table:
missing = any([column == '' for column in row])
if missing:
print(row)
else:
valid_table.append(row)
#+end_src
#+RESULTS:
This is kind of grim, but it does the job.
** Extract required columns
For a temporal analysis, we just need the =week= and =inc= columns:
#+begin_src python :results silent :exports both
week = [row[0] for row in valid_table]
assert week[0] == 'week'
del week[0]
inc = [row[2] for row in valid_table]
assert inc[0] == 'inc'
del inc[0]
data = list(zip(week, inc))
#+end_src
Let's check what we have, using ~None~ to tell Org where to put
separators in the resulting table (which contains the first five and
last five weeks' data):
#+begin_src python :results value :exports both
[('week', 'incidence'), None] + data[:5] + [None] + data[-5:]
#+end_src
#+RESULTS:
| week | incidence |
|--------+-----------|
| 202416 | 19330 |
| 202415 | 24807 |
| 202414 | 16181 |
| 202413 | 18322 |
| 202412 | 12818 |
|--------+-----------|
| 199101 | 15565 |
| 199052 | 19375 |
| 199051 | 19080 |
| 199050 | 11079 |
| 199049 | 1143 |
** Convert dates
Dates are represented in ISO 8601 format (YYYYWW) so let's parse
those. Entries are sorted chronologically, but in reverse, so we'll
fix that here too.
#+begin_src python :results silent :exports both
import datetime
converted_data = [(datetime.datetime.strptime(year_and_week + ":1", '%G%V:%u').date(),
int(inc))
for year_and_week, inc in data]
converted_data.sort(key = lambda record: record[0])
#+end_src
Let's check again:
#+begin_src python :results value :exports both
data_as_str = [(str(date), str(inc)) for date, inc in converted_data]
[('date', 'incidence'), None] + data_as_str[:5] + [None] + data_as_str[-5:]
#+end_src
#+RESULTS:
| date | incidence |
|------------+-----------|
| 1990-12-03 | 1143 |
| 1990-12-10 | 11079 |
| 1990-12-17 | 19080 |
| 1990-12-24 | 19375 |
| 1990-12-31 | 15565 |
|------------+-----------|
| 2024-03-18 | 12818 |
| 2024-03-25 | 18322 |
| 2024-04-01 | 16181 |
| 2024-04-08 | 24807 |
| 2024-04-15 | 19330 |
** Visual inspection
So, now we can take a look at incidence over time. (The 'flu notebook
switches to R here, but we're going to stick with python.)
#+begin_src python :results value file :var filename="./incidence.png" :exports both
import matplotlib.pyplot as plt
plt.clf()
date,incidence = zip(*converted_data)
plt.figure(figsize=(7.5,5))
plt.plot(date,incidence)
# plt.tight_layout()
plt.savefig(filename)
filename
#+end_src
#+RESULTS:
[[file:./incidence.png]]
And we can zoom in on a period of, say, five years:
#+begin_src python :results value file :var filename="./incidence-zoom.png" :exports both
plt.clf()
start = 15
years = 5
date,incidence = zip(*converted_data[52*start:52*(start+years)])
plt.plot(date,incidence)
# plt.tight_layout()
plt.savefig(filename)
filename
#+end_src
#+RESULTS:
[[file:./incidence-zoom.png]]
It looks like incidence peaks in the spring, with lowest numbers
around September.
* Study of annual incidence
** Compute annual incidence
So, let's calculate the incidence for each year. We'll define this is
as the sum of the incidence reports from the beginning of September in
year /N-1/ to the end of August in year /N/.
#+begin_src python :results silent :exports both
def annual_incidence(year):
start = datetime.datetime.strptime(f"{year-1}-09-01", '%Y-%m-%d').date()
end = datetime.datetime.strptime(f"{year}-09-01", '%Y-%m-%d').date()
weeks = [d for d in converted_data if d[0] > start and d[0] <= end]
return sum([w[1] for w in weeks])
#+end_src
That was quick and dirty and not entirely a nice time.
Now we can define the years we're interested in. These are the years
for which we have a full year's worth of data:
#+begin_src python :results silent :exports both
years = range(1991, 2024)
#+end_src
NB. The second argument to ~range~ is non-inclusive.
Now we can perform a list comprehension to get the incidence for each
year:
#+begin_src python :results value :exports both
incidence_per_year = [(y,annual_incidence(y)) for y in years]
head, *tail = incidence_per_year
head
#+end_src
#+RESULTS:
| 1991 | 553895 |
** Visual Inspection
Now we can plot incidence against year.
#+begin_src python :results value file :var filename="./annual_incidence.png" :exports both
plt.clf()
years,incidences = zip(*incidence_per_year)
plt.bar(years,incidences)
plt.ylabel("Annual incidence")
plt.savefig(filename)
filename
#+end_src
#+RESULTS:
[[file:./annual_incidence.png]]
Eyeballing the plots, it looks like although 2003-04 had the greatest
spikes in terms of weekly cases, 2009-10 featured longer spells of
consistently high numbers of cases.
Anyway, let's get the strongest and weakest epidemics:
#+begin_src python :results value :exports both
key = lambda y: y[1]
strongest = max(incidence_per_year, key=key)
weakest = min(incidence_per_year, key=key)
(strongest,weakest)
#+end_src
#+RESULTS:
| 2009 | 841233 |
| 2020 | 221183 |
#+begin_quote
Parents: if you like your kids chickenpox-free, just keep your fingers crossed for a global pandemic
#+end_quote