diff --git a/module3/exo1/influenza-like-illness-analysis.org b/module3/exo1/influenza-like-illness-analysis.org index 6c8b47ad2eefaa2efae0fcda6640ec8b078e7c32..0102ba8e57bddbea17eb487623dec3510952ff44 100644 --- a/module3/exo1/influenza-like-illness-analysis.org +++ b/module3/exo1/influenza-like-illness-analysis.org @@ -45,6 +45,9 @@ The data on the incidence of influenza-like illness are available from the Web s #+NAME: data-url http://www.sentiweb.fr/datasets/incidence-PAY-3.csv +#+NAME: data-csv +~/org/incidence-PAY-3.csv + This is the documentation of the data from [[https://ns.sentiweb.fr/incidence/csv-schema-v1.json][the download site]]: | Column name | Description | @@ -65,10 +68,12 @@ The [[https://en.wikipedia.org/wiki/ISO_8601][ISO-8601]] format is popular in Eu ** Download After downloading the raw data, we extract the part we are interested in. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns. -#+BEGIN_SRC python :results silent :var data_url=data-url +#+BEGIN_SRC python :results silent :var data_csv=data-csv from urllib.request import urlopen - -data = urlopen(data_url).read() +import csv +#data = urlopen(data_url).read() +with open(data_csv) as csv_file: + data = csv.DictReader(csv_file) lines = data.decode('latin-1').strip().split('\n') data_lines = lines[1:] table = [line.split(',') for line in data_lines] @@ -79,6 +84,8 @@ Let's have a look at what we have so far: table[:5] #+END_SRC +#+RESULTS: + ** Checking for missing data Unfortunately there are many ways to indicate the absence of a data value in a dataset. Here we check for a common one: empty fields. For completeness, we should also look for non-numerical data in numerical columns. We don't do this here, but checks in later processing steps would catch such anomalies.