#+TITLE: Analysis of the incidence of chickenpox
#+AUTHOR: Thomas Rushton
#+DATE: 2024-04-24
#+LANGUAGE: en
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+PROPERTY: header-args :session
#+OPTIONS: ^:{}
* Data acquisition
Let's start by getting the data on chickenpox incidence from [[https://www.sentiweb.fr/][Rèseau Sentinelles]].
#+NAME: data-url
https://www.sentiweb.fr/datasets/incidence-PAY-7.csv
#+begin_src python :results output :var data_url=data-url
data_file = "chickenpox.csv"
import os
import urllib.request
if not os.path.exists(data_file):
urllib.request.urlretrieve(data_url, data_file)
#+end_src
#+RESULTS:
** Format the data
Discard the first line (which is a comment) and split the remaining
lines into columns.
#+begin_src python :results silent :exports both
data = open(data_file, 'rb').read()
lines = data.decode('latin-1').strip().split('\n')
data_lines = lines[1:]
table = [line.split(',') for line in data_lines]
# Could use the csv library instead...
#+end_src
Let's take a gander at the data:
#+begin_src python :results value :exports both
table[:5]
#+end_src
#+RESULTS:
| week | indicator | inc | inc_low | inc_up | inc100 | inc100_low | inc100_up | geo_insee | geo_name |
| 202416 | 7 | 19330 | 13879 | 24781 | 29 | 21 | 37 | FR | France |
| 202415 | 7 | 24807 | 17183 | 32431 | 37 | 26 | 48 | FR | France |
| 202414 | 7 | 16181 | 12544 | 19818 | 24 | 19 | 29 | FR | France |
| 202413 | 7 | 18322 | 14206 | 22438 | 27 | 21 | 33 | FR | France |
Alright, looks good 👍
** Sanitise the data
But there may be problems with the data. Let's check for the obvious
case of missing/empty data:
#+begin_src python :results output :exports both
valid_table = []
for row in table:
missing = any([column == '' for column in row])
if missing:
print(row)
else:
valid_table.append(row)
#+end_src
#+RESULTS:
This is kind of grim, but it does the job.
** Extract required columns
For a temporal analysis, we just need the =week= and =inc= columns:
#+begin_src python :results silent :exports both
week = [row[0] for row in valid_table]
assert week[0] == 'week'
del week[0]
inc = [row[2] for row in valid_table]
assert inc[0] == 'inc'
del inc[0]
data = list(zip(week, inc))
#+end_src
Let's check what we have, using ~None~ to tell Org where to put
separators in the resulting table (which contains the first five and
last five weeks' data):
#+begin_src python :results value :exports both
[('week', 'inc'), None] + data[:5] + [None] + data[-5:]
#+end_src
#+RESULTS:
| week | inc |
|--------+-------|
| 202416 | 19330 |
| 202415 | 24807 |
| 202414 | 16181 |
| 202413 | 18322 |
| 202412 | 12818 |
|--------+-------|
| 199101 | 15565 |
| 199052 | 19375 |
| 199051 | 19080 |
| 199050 | 11079 |
| 199049 | 1143 |
** Convert dates
Dates are represented in ISO 8601 format (YYYYWW) so let's parse
those. It should already be sorted chronologically, but let's make
sure of that too.
#+begin_src python :results silent :exports both
import datetime
converted_data = [(datetime.datetime.strptime(year_and_week + ":1", '%G%V:%u').date(),
int(inc))
for year_and_week, inc in data]
converted_data.sort(key = lambda record: record[0])
#+end_src
Let's check again:
#+begin_src python :results value :exports both
data_as_str = [(str(date), str(inc)) for date, inc in converted_data]
[('date', 'inc'), None] + data_as_str[:5] + [None] + data_as_str[-5:]
#+end_src
#+RESULTS:
| date | inc |
|------------+-------|
| 1990-12-03 | 1143 |
| 1990-12-10 | 11079 |
| 1990-12-17 | 19080 |
| 1990-12-24 | 19375 |
| 1990-12-31 | 15565 |
|------------+-------|
| 2024-03-18 | 12818 |
| 2024-03-25 | 18322 |
| 2024-04-01 | 16181 |
| 2024-04-08 | 24807 |
| 2024-04-15 | 19330 |
** Visual inspection
So, now we can take a look at incidence over time. (The 'flu notebook
switches to R here, but we're going to stick with python.)
#+begin_src python :results output file :var filename="./incidence.png" :exports both
import matplotlib.pyplot as plt
plt.clf()
date,incidence = zip(*converted_data)
plt.plot(date,incidence)
plt.tight_layout()
plt.savefig(filename)
print(filename)
#+end_src
#+RESULTS:
[[file:./incidence.png]]
And we can zoom in on a period of, say, five years:
#+begin_src python :results output file :var filename="./incidence-zoom.png" :exports both
plt.clf()
start = 10
years = 5
date,incidence = zip(*converted_data[52*start:52*(start+years)])
plt.plot(date,incidence)
plt.tight_layout()
plt.savefig(filename)
print(filename)
#+end_src
#+RESULTS:
[[file:./incidence-zoom.png]]
It looks like incidence peaks in the spring, with lowest numbers
around September.