Exercise 3 Part 2

0be13543 · Jamal KHAN · 09ae2950 · 0be13543 · 0be13543 · 0be13543
Commit 0be13543 authored Sep 16, 2020 by Jamal KHAN
5 changed files
--- a/module3/exo2/chickenpox_incidence.csv
+++ b/module3/exo2/chickenpox_incidence.csv
--- a/module3/exo2/chickepox_timeseries.png
+++ b/module3/exo2/chickepox_timeseries.png
--- a/module3/exo2/chickepox_timeseries_short.png
+++ b/module3/exo2/chickepox_timeseries_short.png
--- a/module3/exo2/chickepox_timeseries_yearly.png
+++ b/module3/exo2/chickepox_timeseries_yearly.png
--- a/module3/exo2/exercice_python_en.org
+++ b/module3/exo2/exercice_python_en.org
-#+TITLE:  Your title
+#+TITLE: Analysis of the incidence of chickenpox
-#+AUTHOR: Your name
+#+AUTHOR: Jamal KHAN
-#+DATE:   Today's date
+#+DATE: 2020-09-16
 #+LANGUAGE: en
-# #+PROPERTY: header-args :eval never-export
+# #+PROPERTY: header-args :session *python* :exports both
 #+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/htmlize.css"/>
 #+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/readtheorg.css"/>
@@ -11,84 +11,106 @@
 #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script>
 #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script>
-* Some explanations
+* Data download
+#+NAME: data-url
+https://www.sentiweb.fr/datasets/incidence-PAY-7.csv
-This is an org-mode document with code examples in R.  Once opened in
+#+BEGIN_SRC python :session *python* :results output :var data_url=data-url
-Emacs, this document can easily be exported to HTML, PDF, and Office
+data_file = 'chickenpox_incidence.csv'
-formats. For more information on org-mode, see
-https://orgmode.org/guide/.
-When you type the shortcut =C-c C-e h o=, this document will be
+import datetime
-exported as HTML. All the code in it will be re-executed, and the
+from urllib.request import urlretrieve
-results will be retrieved and included into the exported document. If
+import os
-you do not want to re-execute all code each time, you can delete the #
-and the space before ~#+PROPERTY:~ in the header of this document.
-Like we showed in the video, Python code is included as follows (and
+if not os.path.exists(data_file):
-is exxecuted by typing ~C-c C-c~):
+    urlretrieve(data_url, data_file)
-#+begin_src python :results output :exports both
+print(f'Data is retrieved at {datetime.datetime.utcnow()} UTC')
-print("Hello world!")
+#+END_SRC
-#+end_src
 #+RESULTS:
-: Hello world!
+: 
+: Data is retrieved at 2020-09-16 22:07:31.650075 UTC
-And now the same but in an Python session. With a session, Python's
+Now we extract the interesting part of the data. from the format of the file The week is column 0, incidence is column 4. We took Monday as the first day of the week so '%W' code in python/pandas.
-state, i.e. the values of all the variables, remains persistent from
-one code block to the next. The code is still executed using ~C-c
-C-c~.
-#+begin_src python :results output :session :exports both
+#+BEGIN_SRC python :session *python* :results outputs :export both
-import numpy
+import pandas as pd
-x=numpy.linspace(-15,15)
+data = pd.read_csv(data_file, skiprows=2, header=None)
-print(x)
+data = data.loc[:, [0, 2]].rename(columns={0:'Datetime', 2:'Incidence'})
-#+end_src
+data.Datetime = pd.to_datetime(data.Datetime*10+1, format='%Y%W%w')
+data = data.sort_values(by='Datetime')
+data = data.set_index('Datetime')
+data.describe()
+#+END_SRC
 #+RESULTS:
-#+begin_example
+:           Incidence
-[-15.         -14.3877551  -13.7755102  -13.16326531 -12.55102041
+: count   1554.000000
- -11.93877551 -11.32653061 -10.71428571 -10.10204082  -9.48979592
+: mean   12647.119691
-  -8.87755102  -8.26530612  -7.65306122  -7.04081633  -6.42857143
+: std     6657.542827
-  -5.81632653  -5.20408163  -4.59183673  -3.97959184  -3.36734694
+: min      161.000000
-  -2.75510204  -2.14285714  -1.53061224  -0.91836735  -0.30612245
+: 25%     7326.750000
-   0.30612245   0.91836735   1.53061224   2.14285714   2.75510204
+: 50%    12627.000000
-   3.36734694   3.97959184   4.59183673   5.20408163   5.81632653
+: 75%    17155.000000
-   6.42857143   7.04081633   7.65306122   8.26530612   8.87755102
+: max    36298.000000
-   9.48979592  10.10204082  10.71428571  11.32653061  11.93877551
-  12.55102041  13.16326531  13.7755102   14.3877551   15.        ]
+Now check for missing data. Pandas automatically handles the missing data, so we will check for na value in the dataframe.
-#+end_example
+#+BEGIN_SRC python :session *python* :results outputs :export both
+data.is_na()
-Finally, an example for graphical output:
+#+END_SRC
-#+begin_src python :results output file :session :var matplot_lib_filename="./cosxsx.png" :exports results
+#+RESULTS:
+Looks ok. Now time to plot a timeseries of the chicken pox incidence.
+#+BEGIN_SRC python :session *python* :results output file :var ts_plot="chickepox_timeseries.png" :export file
+import matplotlib.pyplot as plt
+fig, ax = plt.subplots(figsize=(6, 3))
+data.plot(ax=ax)
+plt.savefig(ts_plot)
+print(ts_plot)
+#+END_SRC
+#+RESULTS:
+[[file:chickepox_timeseries.png]]
+Additionally, the data starts at the beginning of 1991 and ends at the beginning of the 2020. The monthly evolution is not clear from the long timeseries. Plotting a shorter version.
+#+BEGIN_SRC python :session *python* :results output file :var ts_plot="chickepox_timeseries_short.png" :export file
+import matplotlib.pyplot as plt
+fig, ax = plt.subplots(figsize=(6, 3))
+data['1991-01-01':'1994-01-01'].plot(ax=ax)
+plt.savefig(ts_plot)
+print(ts_plot)
+#+END_SRC
+#+RESULTS:
+[[file:chickepox_timeseries_short.png]]
+It appears that the dip is in november. So I need to group the yearly data starting from November.
+#+BEGIN_SRC python :session *python* :results output file :var ts_plot="chickepox_timeseries_yearly.png" :export both
+data_yearly = data.groupby(data.index.shift(8, freq='m').year).sum()
+data_yearly = data_yearly.iloc[1:-2]
 import matplotlib.pyplot as plt
+fig, ax = plt.subplots(figsize=(6, 3))
+data_yearly.plot(ax=ax)
+plt.savefig(ts_plot)
+print(data_yearly.sort_values(by='Incidence'))
+print(ts_plot)
+#+END_SRC
-plt.figure(figsize=(10,5))
+#+RESULTS:
-plt.plot(x,numpy.cos(x)/x)
-plt.tight_layout()
-plt.savefig(matplot_lib_filename)
+Plot of the aggregated values. 
-print(matplot_lib_filename)
+#+BEGIN_SRC python :session *python* :results output file :var ts_plot="chickepox_timeseries_yearly.png" :export both
-#+end_src
+import matplotlib.pyplot as plt
+fig, ax = plt.subplots(figsize=(6, 3))
+data_yearly.plot(ax=ax)
+plt.savefig(ts_plot)
+print(ts_plot)
+#+END_SRC
 #+RESULTS:
-[[file:./cosxsx.png]]
+[[file:chickepox_timeseries_yearly.png]]
-Note the parameter ~:exports results~, which indicates that the code
-will not appear in the exported document. We recommend that in the
-context of this MOOC, you always leave this parameter setting as
-~:exports both~, because we want your analyses to be perfectly
-transparent and reproducible.
-Watch out: the figure generated by the code block is /not/ stored in
-the org document. It's a plain file, here named ~cosxsx.png~. You have
-to commit it explicitly if you want your analysis to be legible and
-understandable on GitLab.
-Finally, don't forget that we provide in the resource section of this
-MOOC a configuration with a few keyboard shortcuts that allow you to
-quickly create code blocks in Python by typing ~<p~, ~<P~ or ~<PP~
-followed by ~Tab~.
-Now it's your turn! You can delete all this information and replace it
-by your computational document.