#+TITLE: Analysis of the incidence of chickenpox #+AUTHOR: Thomas Rushton #+DATE: 2024-04-24 #+LANGUAGE: en #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+PROPERTY: header-args :session #+OPTIONS: ^:{} * Data acquisition Let's start by getting the data on chickenpox incidence from [[https://www.sentiweb.fr/][Rèseau Sentinelles]]. #+NAME: data-url https://www.sentiweb.fr/datasets/incidence-PAY-7.csv #+begin_src python :results output :var data_url=data-url data_file = "chickenpox.csv" import os import urllib.request if not os.path.exists(data_file): urllib.request.urlretrieve(data_url, data_file) #+end_src #+RESULTS: ** Format the data Discard the first line (which is a comment) and split the remaining lines into columns. #+begin_src python :results silent :exports both data = open(data_file, 'rb').read() lines = data.decode('latin-1').strip().split('\n') data_lines = lines[1:] table = [line.split(',') for line in data_lines] # Could use the csv library instead... #+end_src Let's take a gander at the data: #+begin_src python :results value :exports both table[:5] #+end_src #+RESULTS: | week | indicator | inc | inc_low | inc_up | inc100 | inc100_low | inc100_up | geo_insee | geo_name | | 202416 | 7 | 19330 | 13879 | 24781 | 29 | 21 | 37 | FR | France | | 202415 | 7 | 24807 | 17183 | 32431 | 37 | 26 | 48 | FR | France | | 202414 | 7 | 16181 | 12544 | 19818 | 24 | 19 | 29 | FR | France | | 202413 | 7 | 18322 | 14206 | 22438 | 27 | 21 | 33 | FR | France | Alright, looks good 👍 ** Sanitise the data But the data may not be without its issues. Let's check for the obvious case of missing/empty data: #+begin_src python :results output :exports both valid_table = [] for row in table: missing = any([column == '' for column in row]) if missing: print(row) else: valid_table.append(row) #+end_src #+RESULTS: This is kind of grim, but it does the job. ** Extract required columns For a temporal analysis, we just need the =week= and =inc= columns: #+begin_src python :results silent :exports both week = [row[0] for row in valid_table] assert week[0] == 'week' del week[0] inc = [row[2] for row in valid_table] assert inc[0] == 'inc' del inc[0] data = list(zip(week, inc)) #+end_src Let's check what we have, using ~None~ to tell Org where to put separators in the resulting table (which contains the first five and last five weeks' data): #+begin_src python :results value :exports both [('week', 'incidence'), None] + data[:5] + [None] + data[-5:] #+end_src #+RESULTS: | week | incidence | |--------+-----------| | 202416 | 19330 | | 202415 | 24807 | | 202414 | 16181 | | 202413 | 18322 | | 202412 | 12818 | |--------+-----------| | 199101 | 15565 | | 199052 | 19375 | | 199051 | 19080 | | 199050 | 11079 | | 199049 | 1143 | ** Convert dates Dates are represented in ISO 8601 format (YYYYWW) so let's parse those. Entries are sorted chronologically, but in reverse, so we'll fix that here too. #+begin_src python :results silent :exports both import datetime converted_data = [(datetime.datetime.strptime(year_and_week + ":1", '%G%V:%u').date(), int(inc)) for year_and_week, inc in data] converted_data.sort(key = lambda record: record[0]) #+end_src Let's check again: #+begin_src python :results value :exports both data_as_str = [(str(date), str(inc)) for date, inc in converted_data] [('date', 'incidence'), None] + data_as_str[:5] + [None] + data_as_str[-5:] #+end_src #+RESULTS: | date | incidence | |------------+-----------| | 1990-12-03 | 1143 | | 1990-12-10 | 11079 | | 1990-12-17 | 19080 | | 1990-12-24 | 19375 | | 1990-12-31 | 15565 | |------------+-----------| | 2024-03-18 | 12818 | | 2024-03-25 | 18322 | | 2024-04-01 | 16181 | | 2024-04-08 | 24807 | | 2024-04-15 | 19330 | ** Visual inspection So, now we can take a look at incidence over time. (The 'flu notebook switches to R here, but we're going to stick with python.) #+begin_src python :results value file :var filename="./incidence.png" :exports both import matplotlib.pyplot as plt plt.clf() date,incidence = zip(*converted_data) plt.figure(figsize=(7.5,5)) plt.plot(date,incidence) # plt.tight_layout() plt.savefig(filename) filename #+end_src #+RESULTS: [[file:./incidence.png]] And we can zoom in on a period of, say, five years: #+begin_src python :results value file :var filename="./incidence-zoom.png" :exports both plt.clf() start = 15 years = 5 date,incidence = zip(*converted_data[52*start:52*(start+years)]) plt.plot(date,incidence) # plt.tight_layout() plt.savefig(filename) filename #+end_src #+RESULTS: [[file:./incidence-zoom.png]] It looks like incidence peaks in the spring, with lowest numbers around September. * Study of annual incidence ** Compute annual incidence So, let's calculate the incidence for each year. We'll define this is as the sum of the incidence reports from the beginning of September in year /N-1/ to the end of August in year /N/. #+begin_src python :results silent :exports both def annual_incidence(year): start = datetime.datetime.strptime(f"{year-1}-09-01", '%Y-%m-%d').date() end = datetime.datetime.strptime(f"{year}-09-01", '%Y-%m-%d').date() weeks = [d for d in converted_data if d[0] > start and d[0] <= end] return sum([w[1] for w in weeks]) #+end_src That was quick and dirty and not entirely a nice time. Now we can define the years we're interested in. These are the years for which we have a full year's worth of data: #+begin_src python :results silent :exports both years = range(1991, 2024) #+end_src NB. The second argument to ~range~ is non-inclusive. Now we can perform a list comprehension to get the incidence for each year: #+begin_src python :results value :exports both incidence_per_year = [(y,annual_incidence(y)) for y in years] head, *tail = incidence_per_year head #+end_src #+RESULTS: | 1991 | 553895 | ** Visual Inspection Now we can plot incidence against year. #+begin_src python :results value file :var filename="./annual_incidence.png" :exports both plt.clf() years,incidences = zip(*incidence_per_year) plt.bar(years,incidences) plt.ylabel("Annual incidence") plt.savefig(filename) filename #+end_src #+RESULTS: [[file:./annual_incidence.png]] Eyeballing the plots, it looks like although 2003-04 had the greatest spikes in terms of weekly cases, 2009-10 featured longer spells of consistently high numbers of cases. Anyway, let's get the strongest and weakest epidemics: #+begin_src python :results value :exports both key = lambda y: y[1] strongest = max(incidence_per_year, key=key) weakest = min(incidence_per_year, key=key) (strongest,weakest) #+end_src #+RESULTS: | 2009 | 841233 | | 2020 | 221183 | #+begin_quote Parents: if you like your kids chickenpox-free, just keep your fingers crossed for a global pandemic #+end_quote