diff --git a/module3/exo1/influenza-like-illness-analysis.org b/module3/exo1/influenza-like-illness-analysis.org index 6c8b47ad2eefaa2efae0fcda6640ec8b078e7c32..fcac274cd160ee85a1b8c668ff1a037923493e94 100644 --- a/module3/exo1/influenza-like-illness-analysis.org +++ b/module3/exo1/influenza-like-illness-analysis.org @@ -45,6 +45,9 @@ The data on the incidence of influenza-like illness are available from the Web s #+NAME: data-url http://www.sentiweb.fr/datasets/incidence-PAY-3.csv +#+NAME: file-name +influenzaincidence.csv + This is the documentation of the data from [[https://ns.sentiweb.fr/incidence/csv-schema-v1.json][the download site]]: | Column name | Description | @@ -65,20 +68,56 @@ The [[https://en.wikipedia.org/wiki/ISO_8601][ISO-8601]] format is popular in Eu ** Download After downloading the raw data, we extract the part we are interested in. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns. -#+BEGIN_SRC python :results silent :var data_url=data-url +#+BEGIN_SRC python :results output :var data_url=data-url :var file_name=file-name from urllib.request import urlopen +import shutil +import os.path + + + +def downloadFile(): + result = urlopen(data_url) #makes the requisition for the file + out_file = open(file_name, 'wb') #tries save it to a file named influenzaincidence.csv + shutil.copyfileobj(result, out_file) #use shutil.copyfileobj if the file is large. See https://docs.python.org/dev/library/shutil.html#shutil.copyfileobj + result = result.read() + out_file.close() #close the file after downloading it + print("File downloaded!") + return result + + +def loadData(): + if os.path.isfile(file_name): + print("File Exists!") + else: + print("File not available locally... Trying to download it:") + downloadFile() + + file = open(file_name,"rb") #tries to open the file + data = file.read() + file.close() + return data + + +data = loadData() -data = urlopen(data_url).read() lines = data.decode('latin-1').strip().split('\n') data_lines = lines[1:] table = [line.split(',') for line in data_lines] #+END_SRC +#+RESULTS: +: File not available locally... Trying to download it: +: File downloaded! + Let's have a look at what we have so far: #+BEGIN_SRC python :results value -table[:5] +table[:2] #+END_SRC +#+RESULTS: +| week | indicator | inc | inc_low | inc_up | inc100 | inc100_low | inc100_up | geo_insee | geo_name | +| 202108 | 3 | 27492 | 22140 | 32844 | 42 | 34 | 50 | FR | France | + ** Checking for missing data Unfortunately there are many ways to indicate the absence of a data value in a dataset. Here we check for a common one: empty fields. For completeness, we should also look for non-numerical data in numerical columns. We don't do this here, but checks in later processing steps would catch such anomalies. @@ -93,6 +132,9 @@ for row in table: valid_table.append(row) #+END_SRC +#+RESULTS: +: ['198919', '3', '0', '', '', '0', '', '', 'FR', 'France'] + ** Extraction of the required columns There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We check the names in the header to be sure we pick the right data. We make a new table containing just the two columns required, without the header. #+BEGIN_SRC python :results silent