This is the documentation of the data from [[https://ns.sentiweb.fr/incidence/csv-schema-v1.json][the download site]]:
| Column name | Description |
...
...
@@ -65,20 +68,56 @@ The [[https://en.wikipedia.org/wiki/ISO_8601][ISO-8601]] format is popular in Eu
** Download
After downloading the raw data, we extract the part we are interested in. We first split the file into lines, of which we discard the first one that contains a comment. We then split the remaining lines into columns.
result = urlopen(data_url) #makes the requisition for the file
out_file = open(file_name, 'wb') #tries save it to a file named influenzaincidence.csv
shutil.copyfileobj(result, out_file) #use shutil.copyfileobj if the file is large. See https://docs.python.org/dev/library/shutil.html#shutil.copyfileobj
result = result.read()
out_file.close() #close the file after downloading it
print("File downloaded!")
return result
def loadData():
if os.path.isfile(file_name):
print("File Exists!")
else:
print("File not available locally... Trying to download it:")
downloadFile()
file = open(file_name,"rb") #tries to open the file
Unfortunately there are many ways to indicate the absence of a data value in a dataset. Here we check for a common one: empty fields. For completeness, we should also look for non-numerical data in numerical columns. We don't do this here, but checks in later processing steps would catch such anomalies.
There are only two columns that we will need for our analysis: the first (~"week"~) and the third (~"inc"~). We check the names in the header to be sure we pick the right data. We make a new table containing just the two columns required, without the header.