#+TITLE: Analysis of the incidence of chickenpox #+AUTHOR: Thomas Rushton #+DATE: 2024-04-24 #+LANGUAGE: en #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+PROPERTY: header-args :session #+OPTIONS: ^:{} * Data acquisition Let's start by getting the data on chickenpox incidence from [[https://www.sentiweb.fr/][Rèseau Sentinelles]]. #+NAME: data-url https://www.sentiweb.fr/datasets/incidence-PAY-7.csv #+begin_src python :results output :var data_url=data-url data_file = "chickenpox.csv" import os import urllib.request if not os.path.exists(data_file): urllib.request.urlretrieve(data_url, data_file) #+end_src #+RESULTS: ** Format the data Discard the first line (which is a comment) and split the remaining lines into columns. #+begin_src python :results silent :exports both data = open(data_file, 'rb').read() lines = data.decode('latin-1').strip().split('\n') data_lines = lines[1:] table = [line.split(',') for line in data_lines] # Could use the csv library instead... #+end_src Let's take a gander at the data: #+begin_src python :results value :exports both table[:5] #+end_src #+RESULTS: | week | indicator | inc | inc_low | inc_up | inc100 | inc100_low | inc100_up | geo_insee | geo_name | | 202416 | 7 | 19330 | 13879 | 24781 | 29 | 21 | 37 | FR | France | | 202415 | 7 | 24807 | 17183 | 32431 | 37 | 26 | 48 | FR | France | | 202414 | 7 | 16181 | 12544 | 19818 | 24 | 19 | 29 | FR | France | | 202413 | 7 | 18322 | 14206 | 22438 | 27 | 21 | 33 | FR | France | Alright, looks good 👍 ** Sanitise the data But there may be problems with the data. Let's check for the obvious case of missing/empty data: #+begin_src python :results output :exports both valid_table = [] for row in table: missing = any([column == '' for column in row]) if missing: print(row) else: valid_table.append(row) #+end_src #+RESULTS: This is kind of grim, but it does the job. ** Extract required columns For a temporal analysis, we just need the =week= and =inc= columns: #+begin_src python :results silent :exports both week = [row[0] for row in valid_table] assert week[0] == 'week' del week[0] inc = [row[2] for row in valid_table] assert inc[0] == 'inc' del inc[0] data = list(zip(week, inc)) #+end_src Let's check what we have, using ~None~ to tell Org where to put separators in the resulting table (which contains the first five and last five weeks' data): #+begin_src python :results value :exports both [('week', 'inc'), None] + data[:5] + [None] + data[-5:] #+end_src #+RESULTS: | week | inc | |--------+-------| | 202416 | 19330 | | 202415 | 24807 | | 202414 | 16181 | | 202413 | 18322 | | 202412 | 12818 | |--------+-------| | 199101 | 15565 | | 199052 | 19375 | | 199051 | 19080 | | 199050 | 11079 | | 199049 | 1143 | ** Convert dates Dates are represented in ISO 8601 format (YYYYWW) so let's parse those. It should already be sorted chronologically, but let's make sure of that too. #+begin_src python :results silent :exports both import datetime converted_data = [(datetime.datetime.strptime(year_and_week + ":1", '%G%V:%u').date(), int(inc)) for year_and_week, inc in data] converted_data.sort(key = lambda record: record[0]) #+end_src Let's check again: #+begin_src python :results value :exports both data_as_str = [(str(date), str(inc)) for date, inc in converted_data] [('date', 'inc'), None] + data_as_str[:5] + [None] + data_as_str[-5:] #+end_src #+RESULTS: | date | inc | |------------+-------| | 1990-12-03 | 1143 | | 1990-12-10 | 11079 | | 1990-12-17 | 19080 | | 1990-12-24 | 19375 | | 1990-12-31 | 15565 | |------------+-------| | 2024-03-18 | 12818 | | 2024-03-25 | 18322 | | 2024-04-01 | 16181 | | 2024-04-08 | 24807 | | 2024-04-15 | 19330 | ** Visual inspection So, now we can take a look at incidence over time. (The 'flu notebook switches to R here, but we're going to stick with python.) #+begin_src python :results output file :var filename="./incidence.png" :exports both import matplotlib.pyplot as plt plt.clf() date,incidence = zip(*converted_data) plt.plot(date,incidence) plt.tight_layout() plt.savefig(filename) print(filename) #+end_src #+RESULTS: [[file:./incidence.png]] And we can zoom in on a period of, say, five years: #+begin_src python :results output file :var filename="./incidence-zoom.png" :exports both plt.clf() start = 10 years = 5 date,incidence = zip(*converted_data[52*start:52*(start+years)]) plt.plot(date,incidence) plt.tight_layout() plt.savefig(filename) print(filename) #+end_src #+RESULTS: [[file:./incidence-zoom.png]] It looks like incidence peaks in the spring, with lowest numbers around September.