Reading data from input file

parent 845a7d14
......@@ -45,6 +45,9 @@ The data on the incidence of influenza-like illness are available from the Web s
#+NAME: data-url
http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
#+NAME: file-name
influenzaincidence.csv
This is the documentation of the data from [[https://ns.sentiweb.fr/incidence/csv-schema-v1.json][the download site]]:
| Column name | Description |
......@@ -65,20 +68,56 @@ The [[https://en.wikipedia.org/wiki/ISO_8601][ISO-8601]] format is popular in Eu
** Download
After downloading the raw data, we extract the part we are interested in. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns.
#+BEGIN_SRC python :results silent :var data_url=data-url
#+BEGIN_SRC python :results output :var data_url=data-url :var file_name=file-name
from urllib.request import urlopen
import shutil
import os.path
def downloadFile():
result = urlopen(data_url) #makes the requisition for the file
out_file = open(file_name, 'wb') #tries save it to a file named influenzaincidence.csv
shutil.copyfileobj(result, out_file) #use shutil.copyfileobj if the file is large. See https://docs.python.org/dev/library/shutil.html#shutil.copyfileobj
result = result.read()
out_file.close() #close the file after downloading it
print("File downloaded!")
return result
def loadData():
if os.path.isfile(file_name):
print("File Exists!")
else:
print("File not available locally... Trying to download it:")
downloadFile()
file = open(file_name,"rb") #tries to open the file
data = file.read()
file.close()
return data
data = loadData()
data = urlopen(data_url).read()
lines = data.decode('latin-1').strip().split('\n')
data_lines = lines[1:]
table = [line.split(',') for line in data_lines]
#+END_SRC
#+RESULTS:
: File not available locally... Trying to download it:
: File downloaded!
Let's have a look at what we have so far:
#+BEGIN_SRC python :results value
table[:5]
table[:2]
#+END_SRC
#+RESULTS:
| week | indicator | inc | inc_low | inc_up | inc100 | inc100_low | inc100_up | geo_insee | geo_name |
| 202108 | 3 | 27492 | 22140 | 32844 | 42 | 34 | 50 | FR | France |
** Checking for missing data
Unfortunately there are many ways to indicate the absence of a data value in a dataset. Here we check for a common one: empty fields. For completeness, we should also look for non-numerical data in numerical columns. We don't do this here, but checks in later processing steps would catch such anomalies.
......@@ -93,6 +132,9 @@ for row in table:
valid_table.append(row)
#+END_SRC
#+RESULTS:
: ['198919', '3', '0', '', '', '0', '', '', 'FR', 'France']
** Extraction of the required columns
There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We check the names in the header to be sure we pick the right data. We make a new table containing just the two columns required, without the header.
#+BEGIN_SRC python :results silent
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment