# Incidence of influenza-like illness in France

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import isoweek

The data on the incidence of influenza-like illness are available from the Web site of the [RĂ©seau Sentinelles](http://www.sentiweb.fr/). We download them as a file in CSV format, in which each line corresponds to a week in the observation period. Only the complete dataset, starting in 1984 and ending with a recent week, is available for download.

In [None]:
data_url = "http://www.sentiweb.fr/datasets/incidence-PAY-3.csv"

This is the documentation of the data from [the download site](https://ns.sentiweb.fr/incidence/csv-schema-v1.json):

| Column name | Description |
|--------------|---------------------------------------------------------------------------------------------------------------------------|
| `week` | ISO8601 Yearweek number as numeric (year times 100 + week nubmer) |
| `indicator` | Unique identifier of the indicator, see metadata document https://www.sentiweb.fr/meta.json |
| `inc` | Estimated incidence value for the time step, in the geographic level |
| `inc_low` | Lower bound of the estimated incidence 95% Confidence Interval |
| `inc_up` | Upper bound of the estimated incidence 95% Confidence Interval |
| `inc100` | Estimated rate incidence per 100,000 inhabitants |
| `inc100_low` | Lower bound of the estimated incidence 95% Confidence Interval |
| `inc100_up` | Upper bound of the estimated rate incidence 95% Confidence Interval |
| `geo_insee` | Identifier of the geographic area, from INSEE https://www.insee.fr |
| `geo_name` | Geographic label of the area, corresponding to INSEE code. This label is not an id and is only provided for human reading |

The first line of the CSV file is a comment, which we ignore with `skip=1`.

In [None]:
raw_data = pd.read_csv(data_url, skiprows=1)
raw_data

Are there missing data points? Yes, week 19 of year 1989 does not have any observed values.

In [None]:
raw_data[raw_data.isnull().any(axis=1)]

We delete this point, which does not have big consequence for our rather simple analysis.

In [None]:
data = raw_data.dropna().copy()
data

Our dataset uses an uncommon encoding; the week number is attached
to the year number, leaving the impression of a six-digit integer.
That is how Pandas interprets it.

A second problem is that Pandas does not know about week numbers.
It needs to be given the dates of the beginning and end of the week.
We use the library `isoweek` for that.

Since the conversion is a bit lengthy, we write a small Python 
function for doing it. Then we apply it to all points in our dataset. 
The results go into a new column 'period'.

In [None]:
def convert_week(year_and_week_int):
 year_and_week_str = str(year_and_week_int)
 year = int(year_and_week_str[:4])
 week = int(year_and_week_str[4:])
 w = isoweek.Week(year, week)
 return pd.Period(w.day(0), 'W')

data['period'] = [convert_week(yw) for yw in data['week']]

There are two more small changes to make.

First, we define the observation periods as the new index of
our dataset. That turns it into a time series, which will be
convenient later on.

Second, we sort the points chronologically.

In [None]:
sorted_data = data.set_index('period').sort_index()

We check the consistency of the data. Between the end of a period and
the beginning of the next one, the difference should be zero, or very small.
We tolerate an error of one second.

This is OK except for one pair of consecutive periods between which
a whole week is missing.

We recognize the dates: it's the week without observations that we
have deleted earlier!

In [None]:
periods = sorted_data.index
for p1, p2 in zip(periods[:-1], periods[1:]):
 delta = p2.to_timestamp() - p1.end_time
 if delta > pd.Timedelta('1s'):
 print(p1, p2)

A first look at the data!

In [None]:
sorted_data['inc'].plot()

A zoom on the last few years shows more clearly that the peaks are situated in winter.

In [None]:
sorted_data['inc'][-200:].plot()

## Study of the annual incidence

Since the peaks of the epidemic happen in winter, near the transition
between calendar years, we define the reference period for the annual
incidence from August 1st of year $N$ to August 1st of year $N+1$. We
label this period as year $N+1$ because the peak is always located in
year $N+1$. The very low incidence in summer ensures that the arbitrariness
of the choice of reference period has no impact on our conclusions.

Our task is a bit complicated by the fact that a year does not have an
integer number of weeks. Therefore we modify our reference period a bit:
instead of August 1st, we use the first day of the week containing August 1st.

A final detail: the dataset starts in October 1984, the first peak is thus
incomplete, We start the analysis with the first full peak.

In [None]:
first_august_week = [pd.Period(pd.Timestamp(y, 8, 1), 'W')
 for y in range(1985,
 sorted_data.index[-1].year)]

Starting from this list of weeks that contain August 1st, we obtain intervals of approximately one year as the periods between two adjacent weeks in this list. We compute the sums of weekly incidences for all these periods.

We also check that our periods contain between 51 and 52 weeks, as a safeguard against potential mistakes in our code.

In [None]:
year = []
yearly_incidence = []
for week1, week2 in zip(first_august_week[:-1],
 first_august_week[1:]):
 one_year = sorted_data['inc'][week1:week2-1]
 assert abs(len(one_year)-52) < 2
 yearly_incidence.append(one_year.sum())
 year.append(week2.year)
yearly_incidence = pd.Series(data=yearly_incidence, index=year)

And here are the annual incidences.

In [None]:
yearly_incidence.plot(style='*')

A sorted list makes it easier to find the highest values (at the end).

In [None]:
yearly_incidence.sort_values()

Finally, a histogram clearly shows the few very strong epidemics, which affect about 10% of the French population,
but are rare: there were three of them in the course of 35 years. The typical epidemic affects only half as many people.

In [None]:
yearly_incidence.hist(xrot=20)