exo4

7fbc3cae · Corentin Ambroise · 67fa5272 · 7fbc3cae · 7fbc3cae · 7fbc3cae
Commit 7fbc3cae authored Mar 03, 2021 by Corentin Ambroise
4 changed files
--- a/exo4/coronavirus.politologue.com-pays-2021-03-03.csv
+++ b/exo4/coronavirus.politologue.com-pays-2021-03-03.csv
--- a/exo4/exercice_python_fr.html
+++ b/exo4/exercice_python_fr.html
+<?xml version="1.0" encoding="utf-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
+"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
+<head>
+<!-- 2021-03-03 Mer 15:39 -->
+<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
+<meta name="viewport" content="width=device-width, initial-scale=1" />
+<title>Worldwide covid evolution in February 2021</title>
+<meta name="generator" content="Org mode" />
+<meta name="author" content="Corentin Ambroise" />
+<style type="text/css">
+ <!--/*--><![CDATA[/*><!--*/
+  .title  { text-align: center;
+             margin-bottom: .2em; }
+  .subtitle { text-align: center;
+              font-size: medium;
+              font-weight: bold;
+              margin-top:0; }
+  .todo   { font-family: monospace; color: red; }
+  .done   { font-family: monospace; color: green; }
+  .priority { font-family: monospace; color: orange; }
+  .tag    { background-color: #eee; font-family: monospace;
+            padding: 2px; font-size: 80%; font-weight: normal; }
+  .timestamp { color: #bebebe; }
+  .timestamp-kwd { color: #5f9ea0; }
+  .org-right  { margin-left: auto; margin-right: 0px;  text-align: right; }
+  .org-left   { margin-left: 0px;  margin-right: auto; text-align: left; }
+  .org-center { margin-left: auto; margin-right: auto; text-align: center; }
+  .underline { text-decoration: underline; }
+  #postamble p, #preamble p { font-size: 90%; margin: .2em; }
+  p.verse { margin-left: 3%; }
+  pre {
+    border: 1px solid #ccc;
+    box-shadow: 3px 3px 3px #eee;
+    padding: 8pt;
+    font-family: monospace;
+    overflow: auto;
+    margin: 1.2em;
+  }
+  pre.src {
+    position: relative;
+    overflow: visible;
+    padding-top: 1.2em;
+  }
+  pre.src:before {
+    display: none;
+    position: absolute;
+    background-color: white;
+    top: -10px;
+    right: 10px;
+    padding: 3px;
+    border: 1px solid black;
+  }
+  pre.src:hover:before { display: inline;}
+  /* Languages per Org manual */
+  pre.src-asymptote:before { content: 'Asymptote'; }
+  pre.src-awk:before { content: 'Awk'; }
+  pre.src-C:before { content: 'C'; }
+  /* pre.src-C++ doesn't work in CSS */
+  pre.src-clojure:before { content: 'Clojure'; }
+  pre.src-css:before { content: 'CSS'; }
+  pre.src-D:before { content: 'D'; }
+  pre.src-ditaa:before { content: 'ditaa'; }
+  pre.src-dot:before { content: 'Graphviz'; }
+  pre.src-calc:before { content: 'Emacs Calc'; }
+  pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
+  pre.src-fortran:before { content: 'Fortran'; }
+  pre.src-gnuplot:before { content: 'gnuplot'; }
+  pre.src-haskell:before { content: 'Haskell'; }
+  pre.src-hledger:before { content: 'hledger'; }
+  pre.src-java:before { content: 'Java'; }
+  pre.src-js:before { content: 'Javascript'; }
+  pre.src-latex:before { content: 'LaTeX'; }
+  pre.src-ledger:before { content: 'Ledger'; }
+  pre.src-lisp:before { content: 'Lisp'; }
+  pre.src-lilypond:before { content: 'Lilypond'; }
+  pre.src-lua:before { content: 'Lua'; }
+  pre.src-matlab:before { content: 'MATLAB'; }
+  pre.src-mscgen:before { content: 'Mscgen'; }
+  pre.src-ocaml:before { content: 'Objective Caml'; }
+  pre.src-octave:before { content: 'Octave'; }
+  pre.src-org:before { content: 'Org mode'; }
+  pre.src-oz:before { content: 'OZ'; }
+  pre.src-plantuml:before { content: 'Plantuml'; }
+  pre.src-processing:before { content: 'Processing.js'; }
+  pre.src-python:before { content: 'Python'; }
+  pre.src-R:before { content: 'R'; }
+  pre.src-ruby:before { content: 'Ruby'; }
+  pre.src-sass:before { content: 'Sass'; }
+  pre.src-scheme:before { content: 'Scheme'; }
+  pre.src-screen:before { content: 'Gnu Screen'; }
+  pre.src-sed:before { content: 'Sed'; }
+  pre.src-sh:before { content: 'shell'; }
+  pre.src-sql:before { content: 'SQL'; }
+  pre.src-sqlite:before { content: 'SQLite'; }
+  /* additional languages in org.el's org-babel-load-languages alist */
+  pre.src-forth:before { content: 'Forth'; }
+  pre.src-io:before { content: 'IO'; }
+  pre.src-J:before { content: 'J'; }
+  pre.src-makefile:before { content: 'Makefile'; }
+  pre.src-maxima:before { content: 'Maxima'; }
+  pre.src-perl:before { content: 'Perl'; }
+  pre.src-picolisp:before { content: 'Pico Lisp'; }
+  pre.src-scala:before { content: 'Scala'; }
+  pre.src-shell:before { content: 'Shell Script'; }
+  pre.src-ebnf2ps:before { content: 'ebfn2ps'; }
+  /* additional language identifiers per "defun org-babel-execute"
+       in ob-*.el */
+  pre.src-cpp:before  { content: 'C++'; }
+  pre.src-abc:before  { content: 'ABC'; }
+  pre.src-coq:before  { content: 'Coq'; }
+  pre.src-groovy:before  { content: 'Groovy'; }
+  /* additional language identifiers from org-babel-shell-names in
+     ob-shell.el: ob-shell is the only babel language using a lambda to put
+     the execution function name together. */
+  pre.src-bash:before  { content: 'bash'; }
+  pre.src-csh:before  { content: 'csh'; }
+  pre.src-ash:before  { content: 'ash'; }
+  pre.src-dash:before  { content: 'dash'; }
+  pre.src-ksh:before  { content: 'ksh'; }
+  pre.src-mksh:before  { content: 'mksh'; }
+  pre.src-posh:before  { content: 'posh'; }
+  /* Additional Emacs modes also supported by the LaTeX listings package */
+  pre.src-ada:before { content: 'Ada'; }
+  pre.src-asm:before { content: 'Assembler'; }
+  pre.src-caml:before { content: 'Caml'; }
+  pre.src-delphi:before { content: 'Delphi'; }
+  pre.src-html:before { content: 'HTML'; }
+  pre.src-idl:before { content: 'IDL'; }
+  pre.src-mercury:before { content: 'Mercury'; }
+  pre.src-metapost:before { content: 'MetaPost'; }
+  pre.src-modula-2:before { content: 'Modula-2'; }
+  pre.src-pascal:before { content: 'Pascal'; }
+  pre.src-ps:before { content: 'PostScript'; }
+  pre.src-prolog:before { content: 'Prolog'; }
+  pre.src-simula:before { content: 'Simula'; }
+  pre.src-tcl:before { content: 'tcl'; }
+  pre.src-tex:before { content: 'TeX'; }
+  pre.src-plain-tex:before { content: 'Plain TeX'; }
+  pre.src-verilog:before { content: 'Verilog'; }
+  pre.src-vhdl:before { content: 'VHDL'; }
+  pre.src-xml:before { content: 'XML'; }
+  pre.src-nxml:before { content: 'XML'; }
+  /* add a generic configuration mode; LaTeX export needs an additional
+     (add-to-list 'org-latex-listings-langs '(conf " ")) in .emacs */
+  pre.src-conf:before { content: 'Configuration File'; }
+  table { border-collapse:collapse; }
+  caption.t-above { caption-side: top; }
+  caption.t-bottom { caption-side: bottom; }
+  td, th { vertical-align:top;  }
+  th.org-right  { text-align: center;  }
+  th.org-left   { text-align: center;   }
+  th.org-center { text-align: center; }
+  td.org-right  { text-align: right;  }
+  td.org-left   { text-align: left;   }
+  td.org-center { text-align: center; }
+  dt { font-weight: bold; }
+  .footpara { display: inline; }
+  .footdef  { margin-bottom: 1em; }
+  .figure { padding: 1em; }
+  .figure p { text-align: center; }
+  .equation-container {
+    display: table;
+    text-align: center;
+    width: 100%;
+  }
+  .equation {
+    vertical-align: middle;
+  }
+  .equation-label {
+    display: table-cell;
+    text-align: right;
+    vertical-align: middle;
+  }
+  .inlinetask {
+    padding: 10px;
+    border: 2px solid gray;
+    margin: 10px;
+    background: #ffffcc;
+  }
+  #org-div-home-and-up
+   { text-align: right; font-size: 70%; white-space: nowrap; }
+  textarea { overflow-x: auto; }
+  .linenr { font-size: smaller }
+  .code-highlighted { background-color: #ffff00; }
+  .org-info-js_info-navigation { border-style: none; }
+  #org-info-js_console-label
+    { font-size: 10px; font-weight: bold; white-space: nowrap; }
+  .org-info-js_search-highlight
+    { background-color: #ffff00; color: #000000; font-weight: bold; }
+  .org-svg { width: 90%; }
+  /*]]>*/-->
+</style>
+<script type="text/javascript">
+/*
+@licstart  The following is the entire license notice for the
+JavaScript code in this tag.
+Copyright (C) 2012-2020 Free Software Foundation, Inc.
+The JavaScript code in this tag is free software: you can
+redistribute it and/or modify it under the terms of the GNU
+General Public License (GNU GPL) as published by the Free Software
+Foundation, either version 3 of the License, or (at your option)
+any later version.  The code is distributed WITHOUT ANY WARRANTY;
+without even the implied warranty of MERCHANTABILITY or FITNESS
+FOR A PARTICULAR PURPOSE.  See the GNU GPL for more details.
+As additional permission under GNU GPL version 3 section 7, you
+may distribute non-source (e.g., minimized or compacted) forms of
+that code without the copy of the GNU GPL normally required by
+section 4, provided you include this license notice and a URL
+through which recipients can access the Corresponding Source.
+@licend  The above is the entire license notice
+for the JavaScript code in this tag.
+*/
+<!--/*--><![CDATA[/*><!--*/
+ function CodeHighlightOn(elem, id)
+ {
+   var target = document.getElementById(id);
+   if(null != target) {
+     elem.cacheClassElem = elem.className;
+     elem.cacheClassTarget = target.className;
+     target.className = "code-highlighted";
+     elem.className   = "code-highlighted";
+   }
+ }
+ function CodeHighlightOff(elem, id)
+ {
+   var target = document.getElementById(id);
+   if(elem.cacheClassElem)
+     elem.className = elem.cacheClassElem;
+   if(elem.cacheClassTarget)
+     target.className = elem.cacheClassTarget;
+ }
+/*]]>*///-->
+</script>
+</head>
+<body>
+<div id="content">
+<h1 class="title">Worldwide covid evolution in February 2021</h1>
+<div id="table-of-contents">
+<h2>Table of Contents</h2>
+<div id="text-table-of-contents">
+<ul>
+<li><a href="#org3e6a389">1. Dataset</a></li>
+<li><a href="#orgcf424dd">2. Statistics</a></li>
+<li><a href="#org9103eb8">3. Plotting</a></li>
+</ul>
+</div>
+</div>
+<div id="outline-container-org3e6a389" class="outline-2">
+<h2 id="org3e6a389"><span class="section-number-2">1</span> Dataset</h2>
+<div class="outline-text-2" id="text-1">
+<p>
+We want to introduce here <a href="https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/">this dataset</a>, taken on the 03/03/2021. We
+chose to study the per country daily dataset so we have some
+preprocessing work to do, and more fine grained statistical analysis.
+</p>
+<div class="org-src-container">
+<pre class="src src-python"><span style="color: #a020f0;">import</span> pandas <span style="color: #a020f0;">as</span> pd
+<span style="color: #a0522d;">data</span> = pd.read_csv(<span style="color: #8b2252;">'./coronavirus.politologue.com-pays-2021-03-03.csv'</span>, skiprows=7, sep=<span style="color: #8b2252;">';'</span>)
+data.head()
+</pre>
+</div>
+<pre class="example">
+         Date                 Pays  ...  TauxGuerison  TauxInfection
+0  2021-03-03              Andorre  ...         96.27           2.72
+1  2021-03-03  Émirats Arabes Unis  ...         96.78           2.90
+2  2021-03-03          Afghanistan  ...         88.50           7.11
+3  2021-03-03   Antigua-et-Barbuda  ...         39.92          58.26
+4  2021-03-03              Albanie  ...         65.40          32.91
+[5 rows x 8 columns]
+</pre>
+<p>
+Let's see how big the data is, and the date range it covers.
+</p>
+<div class="org-src-container">
+<pre class="src src-python"><span style="color: #a020f0;">print</span>(data.shape)
+<span style="color: #a0522d;">data</span>[<span style="color: #8b2252;">'Date'</span>] = pd.to_datetime(data[<span style="color: #8b2252;">'Date'</span>])
+<span style="color: #a020f0;">print</span>(<span style="color: #483d8b;">min</span>(data[<span style="color: #8b2252;">'Date'</span>]))
+<span style="color: #a020f0;">print</span>(<span style="color: #483d8b;">max</span>(data[<span style="color: #8b2252;">'Date'</span>]))
+</pre>
+</div>
+<pre class="example">
+(6293, 8)
+2021-02-01 00:00:00
+2021-03-03 00:00:00
+</pre>
+<p>
+So it's a pretty small dataset, so the computations should be
+fast. Let's look at the columns
+</p>
+<div class="org-src-container">
+<pre class="src src-python"><span style="color: #a020f0;">print</span>(data.columns)
+</pre>
+</div>
+<pre class="example">
+Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces',
+       'TauxGuerison', 'TauxInfection'],
+      dtype='object')
+</pre>
+<p>
+Interesting. So we have multivariate time series for each countries,
+regarding different daily metrics. Looking at TauxGuerison or
+TauxDeces could give us a sense of the quality of each country's
+medical care. The sum of the rates always gives roughly 1 (100%) :
+</p>
+<div class="org-src-container">
+<pre class="src src-python"><span style="color: #a0522d;">rate_columns</span> = data.columns[-3:]
+<span style="color: #a020f0;">print</span>(data[rate_columns].<span style="color: #483d8b;">sum</span>(1).unique())
+</pre>
+</div>
+<pre class="example">
+[100.    99.99  99.99 100.01 100.   100.   100.01 100.01  99.99]
+</pre>
+</div>
+</div>
+<div id="outline-container-orgcf424dd" class="outline-2">
+<h2 id="orgcf424dd"><span class="section-number-2">2</span> Statistics</h2>
+<div class="outline-text-2" id="text-2">
+<p>
+We want to compute statistics over February, per country, so we can
+start by aggregating the data per country. First, we compute the
+average value for each metric for each country for rates.
+</p>
+<div class="org-src-container">
+<pre class="src src-python"><span style="color: #a0522d;">count_columns</span> = data.columns[2:-3]
+<span style="color: #a0522d;">data_grouped</span> = data.groupby(<span style="color: #8b2252;">'Pays'</span>)
+<span style="color: #a0522d;">mean_rates_per_country</span> = data_grouped[rate_columns].mean()
+mean_rates_per_country.head()
+</pre>
+</div>
+<pre class="example">
+                TauxDeces  TauxGuerison  TauxInfection
+Pays                                                  
+Afghanistan      4.370968     87.556452       8.071935
+Afrique du Sud   3.217419     93.025806       3.757097
+Albanie          1.689032     62.334839      35.976774
+Algérie          2.656774     68.720645      28.625161
+Allemagne        2.785484     91.093226       6.121290
+</pre>
+<p>
+Let's see what are the countries with most elevated death rate over
+the month of February. We expect them to be poor countries, meaning
+they have less means to heal their patients.
+</p>
+<div class="org-src-container">
+<pre class="src src-python"><span style="color: #a020f0;">print</span>(mean_rates_per_country.sort_values(<span style="color: #8b2252;">'TauxDeces'</span>, ascending=<span style="color: #008b8b;">False</span>).head(10))
+</pre>
+</div>
+<pre class="example">
+TauxDeces  TauxGuerison  TauxInfection
+Pays                                               
+Yémen        28.496452     65.724194       5.779677
+Mexique       8.748710     77.767742      13.484194
+Syrie         6.580968     59.050000      34.369677
+Soudan        6.193226     74.704839      19.101613
+Égypte        5.763226     77.599355      16.636774
+Équateur      5.721935     84.903548       9.375806
+Chine         5.163226     94.075484       0.760645
+Bolivie       4.720968     75.581613      19.697419
+Afghanistan   4.370968     87.556452       8.071935
+Libéria       4.273226     91.757419       3.965806
+</pre>
+<p>
+Indeed some of these countries can be qualified as poor. Yemen seems
+extremely hit by the epidemic and it seems that 30% of his infected
+people died in February. Yemen is a very poor country, but let's inspect this number, which seems very
+high compared to the other countries.
+</p>
+<div class="org-src-container">
+<pre class="src src-python">data_grouped.mean()[count_columns].loc[<span style="color: #8b2252;">'Y&#233;men'</span>]
+</pre>
+</div>
+<pre class="example">
+Infections    2178.806452
+Deces          620.516129
+Guerisons     1430.709677
+Name: Yémen, dtype: float64
+</pre>
+<p>
+Now let's compare to median countries for each metric.
+</p>
+<div class="org-src-container">
+<pre class="src src-python">data_grouped.mean()[count_columns].median(0)
+</pre>
+</div>
+<pre class="example">
+Infections    50333.935484
+Deces           613.387097
+Guerisons     23364.000000
+dtype: float64
+</pre>
+<p>
+Yemen seems to have as many deaths as the median country does, while
+having way less contaminations. This can either be due to the lack of
+testing in the country, or awful medical care conditions. This
+highlights the growing poverty of the country, aggravated by war.
+</p>
+</div>
+</div>
+<div id="outline-container-org9103eb8" class="outline-2">
+<h2 id="org9103eb8"><span class="section-number-2">3</span> Plotting</h2>
+<div class="outline-text-2" id="text-3">
+<p>
+Now we can plot many thing. We can for instance inspect a country of
+interest, and try to see how it behaves over the month of
+February. Let's see how the US were impacted.
+</p>
+<div class="org-src-container">
+<pre class="src src-python"><span style="color: #a020f0;">import</span> matplotlib.pyplot <span style="color: #a020f0;">as</span> plt
+<span style="color: #a0522d;">country_data</span> = data[data[<span style="color: #8b2252;">'Pays'</span>] == <span style="color: #8b2252;">'&#201;tats-Unis'</span>]
+country_data.plot(<span style="color: #8b2252;">'Date'</span>, [<span style="color: #8b2252;">'Infections'</span>, <span style="color: #8b2252;">'Deces'</span>, <span style="color: #8b2252;">'Guerisons'</span>])
+plt.savefig(matplot_lib_filename)
+matplot_lib_filename
+</pre>
+</div>
+<div class="figure">
+<p><img src="file:///var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figuregXPsVe.png" alt="figuregXPsVe.png" />
+</p>
+</div>
+<p>
+We can't see much on this type of plot, because for most countries,
+metrics are on different scales, and this data is only the evolution
+during one month, which is small for epidemic data. Also this data
+only shows the evolution of contaminated people. Let's look quickly at
+the number of new cases per day, for the US. (We add the 
+</p>
+<div class="org-src-container">
+<pre class="src src-python"><span style="color: #a020f0;">import</span> numpy <span style="color: #a020f0;">as</span> np
+<span style="color: #a0522d;">country_data</span>[<span style="color: #8b2252;">'NewInfections'</span>] = np.array([0] + (country_data[<span style="color: #8b2252;">'Infections'</span>].values[1:] - country_data[<span style="color: #8b2252;">'Infections'</span>].values[:-1]).tolist()) + 117903
+country_data.plot(<span style="color: #8b2252;">'Date'</span>, <span style="color: #8b2252;">'NewInfections'</span>)
+plt.savefig(matplot_lib_filename)
+matplot_lib_filename
+</pre>
+</div>
+<div class="figure">
+<p><img src="file:///var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figureKiPXSr.png" alt="figureKiPXSr.png" />
+</p>
+</div>
+<p>
+This shows that the number of new cases has grown every day during
+February in the US, which indicates that the epidemic is not slowing
+there. 
+</p>
+<p>
+So let's try other visualisations. We can try to plot the mean distribution for each rate
+metrics for instance.
+</p>
+<div class="org-src-container">
+<pre class="src src-python">mean_rates_per_country.hist(rate_columns, bins=20)
+plt.savefig(matplot_lib_filename)
+matplot_lib_filename
+</pre>
+</div>
+<p>
+<img src="file:///var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figure6Z7VPf.png" alt="figure6Z7VPf.png" />
+This plot show that overall, February was not the worst month for the
+world : most countries show a high recovery rate, and small death
+rate, meaning that the medical services were not to much
+overwhelmed. TauxInfection is not very meaningful, because it only
+shows the proportion of people not recovered and not dead.
+</p>
+</div>
+</div>
+</div>
+<div id="postamble" class="status">
+<p class="author">Author: Corentin Ambroise</p>
+<p class="date">Created: 2021-03-03 Mer 15:39</p>
+<p class="validation"><a href="http://validator.w3.org/check?uri=referer">Validate</a></p>
+</div>
+</body>
+</html>
--- a/exo4/exercice_python_fr.org
+++ b/exo4/exercice_python_fr.org
+#+TITLE: Worldwide covid evolution in February 2021
+* Dataset
+We want to introduce here [[https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/][this dataset]], taken on the 03/03/2021. We
+chose to study the per country daily dataset so we have some
+preprocessing work to do, and more fine grained statistical analysis.
+#+begin_src python :results value :session :exports both
+import pandas as pd
+data = pd.read_csv('./coronavirus.politologue.com-pays-2021-03-03.csv', skiprows=7, sep=';')
+data.head()
+#+end_src
+#+RESULTS:
+:          Date                 Pays  ...  TauxGuerison  TauxInfection
+: 0  2021-03-03              Andorre  ...         96.27           2.72
+: 1  2021-03-03  Émirats Arabes Unis  ...         96.78           2.90
+: 2  2021-03-03          Afghanistan  ...         88.50           7.11
+: 3  2021-03-03   Antigua-et-Barbuda  ...         39.92          58.26
+: 4  2021-03-03              Albanie  ...         65.40          32.91
+: 
+: [5 rows x 8 columns]
+Let's see how big the data is, and the date range it covers.
+#+begin_src python :results output :session :exports both
+print(data.shape)
+data['Date'] = pd.to_datetime(data['Date'])
+print(min(data['Date']))
+print(max(data['Date']))
+#+end_src
+#+RESULTS:
+: (6293, 8)
+: 2021-02-01 00:00:00
+: 2021-03-03 00:00:00
+So it's a pretty small dataset, so the computations should be
+fast. Let's look at the columns
+#+begin_src python :results output :session :exports both
+print(data.columns)
+#+end_src
+#+RESULTS:
+: Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces',
+:        'TauxGuerison', 'TauxInfection'],
+:       dtype='object')
+Interesting. So we have multivariate time series for each countries,
+regarding different daily metrics. Looking at TauxGuerison or
+TauxDeces could give us a sense of the quality of each country's
+medical care. The sum of the rates always gives roughly 1 (100%) :
+#+begin_src python :results output :session :exports both
+rate_columns = data.columns[-3:]
+print(data[rate_columns].sum(1).unique())
+#+end_src
+#+RESULTS:
+: [100.    99.99  99.99 100.01 100.   100.   100.01 100.01  99.99]
+* Statistics
+We want to compute statistics over February, per country, so we can
+start by aggregating the data per country. First, we compute the
+average value for each metric for each country for rates.
+#+begin_src python :results value :session :exports both
+count_columns = data.columns[2:-3]
+data_grouped = data.groupby('Pays')
+mean_rates_per_country = data_grouped[rate_columns].mean()
+mean_rates_per_country.head()
+#+end_src
+#+RESULTS:
+:                 TauxDeces  TauxGuerison  TauxInfection
+: Pays                                                  
+: Afghanistan      4.370968     87.556452       8.071935
+: Afrique du Sud   3.217419     93.025806       3.757097
+: Albanie          1.689032     62.334839      35.976774
+: Algérie          2.656774     68.720645      28.625161
+: Allemagne        2.785484     91.093226       6.121290
+Let's see what are the countries with most elevated death rate over
+the month of February. We expect them to be poor countries, meaning
+they have less means to heal their patients.
+#+begin_src python :results output :session :exports both
+print(mean_rates_per_country.sort_values('TauxDeces', ascending=False).head(10))
+#+end_src
+#+RESULTS:
+#+begin_example
+TauxDeces  TauxGuerison  TauxInfection
+Pays                                               
+Yémen        28.496452     65.724194       5.779677
+Mexique       8.748710     77.767742      13.484194
+Syrie         6.580968     59.050000      34.369677
+Soudan        6.193226     74.704839      19.101613
+Égypte        5.763226     77.599355      16.636774
+Équateur      5.721935     84.903548       9.375806
+Chine         5.163226     94.075484       0.760645
+Bolivie       4.720968     75.581613      19.697419
+Afghanistan   4.370968     87.556452       8.071935
+Libéria       4.273226     91.757419       3.965806
+#+end_example
+Indeed some of these countries can be qualified as poor. Yemen seems
+extremely hit by the epidemic and it seems that 30% of his infected
+people died in February. Yemen is a very poor country, but let's inspect this number, which seems very
+high compared to the other countries.
+#+begin_src python :results value :session :exports both
+data_grouped.mean()[count_columns].loc['Yémen']
+#+End_src
+#+RESULTS:
+: Infections    2178.806452
+: Deces          620.516129
+: Guerisons     1430.709677
+: Name: Yémen, dtype: float64
+Now let's compare to median countries for each metric.
+#+begin_src python :results value :session :exports both
+data_grouped.mean()[count_columns].median(0)
+#+end_src
+#+RESULTS:
+: Infections    50333.935484
+: Deces           613.387097
+: Guerisons     23364.000000
+: dtype: float64
+Yemen seems to have as many deaths as the median country does, while
+having way less contaminations. This can either be due to the lack of
+testing in the country, or awful medical care conditions. This
+highlights the growing poverty of the country, aggravated by war.
+* Plotting
+Now we can plot many thing. We can for instance inspect a country of
+interest, and try to see how it behaves over the month of
+February. Let's see how the US were impacted.
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+import matplotlib.pyplot as plt
+country_data = data[data['Pays'] == 'États-Unis']
+country_data.plot('Date', ['Infections', 'Deces', 'Guerisons'])
+plt.savefig(matplot_lib_filename)
+matplot_lib_filename
+#+end_src
+#+RESULTS:
+[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figureJY8iHB.png]]
+We can't see much on this type of plot, because for most countries,
+metrics are on different scales, and this data is only the evolution
+during one month, which is small for epidemic data. Also this data
+only shows the evolution of contaminated people. Let's look quickly at
+the number of new cases per day, for the US. (We add the 
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+import numpy as np
+country_data['NewInfections'] = np.array([0] + (country_data['Infections'].values[1:] - country_data['Infections'].values[:-1]).tolist()) + 117903
+country_data.plot('Date', 'NewInfections')
+plt.savefig(matplot_lib_filename)
+matplot_lib_filename
+#+end_src
+#+RESULTS:
+[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figurevihCAv.png]]
+This shows that the number of new cases has grown every day during
+February in the US, which indicates that the epidemic is not slowing
+there. 
+So let's try other visualisations. We can try to plot the mean distribution for each rate
+metrics for instance.
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+mean_rates_per_country.hist(rate_columns, bins=20)
+plt.savefig(matplot_lib_filename)
+matplot_lib_filename
+#+end_src
+#+RESULTS:
+[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figurefBoBMc.png]]
+This plot show that overall, February was not the worst month for the
+world : most countries show a high recovery rate, and small death
+rate, meaning that the medical services were not to much
+overwhelmed. TauxInfection is not very meaningful, because it only
+shows the proportion of people not recovered and not dead.
--- a/exo4/exercice_python_fr.org~
+++ b/exo4/exercice_python_fr.org~
+#+TITLE: Worldwide covid evolution in February 2021
+* Dataset
+We want to introduce here [[https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/][this dataset]], taken on the 03/03/2021. We
+chose to study the per country daily dataset so we have some
+preprocessing work to do, and more fine grained statistical analysis.
+#+begin_src python :results value :session :exports both
+import pandas as pd
+data = pd.read_csv('./coronavirus.politologue.com-pays-2021-03-03.csv', skiprows=7, sep=';')
+data.head()
+#+end_src
+#+RESULTS:
+:          Date                 Pays  ...  TauxGuerison  TauxInfection
+: 0  2021-03-03              Andorre  ...         96.27           2.72
+: 1  2021-03-03  Émirats Arabes Unis  ...         96.78           2.90
+: 2  2021-03-03          Afghanistan  ...         88.50           7.11
+: 3  2021-03-03   Antigua-et-Barbuda  ...         39.92          58.26
+: 4  2021-03-03              Albanie  ...         65.40          32.91
+: 
+: [5 rows x 8 columns]
+Let's see how big the data is, and the date range it covers.
+#+begin_src python :results output :session :exports both
+print(data.shape)
+data['Date'] = pd.to_datetime(data['Date'])
+print(min(data['Date']))
+print(max(data['Date']))
+#+end_src
+So it's a pretty small dataset, so the computations should be
+fast. Let's look at the columns
+#+begin_src python :results output :session :exports both
+print(data.columns)
+#+end_src
+#+RESULTS:
+: Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces',
+:        'TauxGuerison', 'TauxInfection'],
+:       dtype='object')
+Interesting. So we have multivariate time series for each countries,
+regarding different daily metrics. Looking at TauxGuerison or
+TauxDeces could give us a sense of the quality of each country's
+medical care. The sum of the rates always gives roughly 1 (100%) :
+#+begin_src python :results output :session :exports both
+rate_columns = data.columns[-3:]
+print(data[rate_columns].sum(1).unique())
+#+end_src
+#+RESULTS:
+: [100.    99.99  99.99 100.01 100.   100.   100.01 100.01  99.99]
+* Statistics
+We want to compute statistics over February, per country, so we can
+start by aggregating the data per country. First, we compute the
+average value for each metric for each country for rates.
+#+begin_src python :results value :session :exports both
+count_columns = data.columns[2:-3]
+data_grouped = data.groupby('Pays')
+mean_rates_per_country = data_grouped[rate_columns].mean()
+mean_rates_per_country.head()
+#+end_src
+#+RESULTS:
+:                 TauxDeces  TauxGuerison  TauxInfection
+: Pays                                                  
+: Afghanistan      4.370968     87.556452       8.071935
+: Afrique du Sud   3.217419     93.025806       3.757097
+: Albanie          1.689032     62.334839      35.976774
+: Algérie          2.656774     68.720645      28.625161
+: Allemagne        2.785484     91.093226       6.121290
+Let's see what are the countries with most elevated death rate over
+the month of February. We expect them to be poor countries, meaning
+they have less means to heal their patients.
+#+begin_src python :results output :session :exports both
+print(mean_rates_per_country.sort_values('TauxDeces', ascending=False).head(10))
+#+end_src
+#+RESULTS:
+#+begin_example
+TauxDeces  TauxGuerison  TauxInfection
+Pays                                               
+Yémen        28.496452     65.724194       5.779677
+Mexique       8.748710     77.767742      13.484194
+Syrie         6.580968     59.050000      34.369677
+Soudan        6.193226     74.704839      19.101613
+Égypte        5.763226     77.599355      16.636774
+Équateur      5.721935     84.903548       9.375806
+Chine         5.163226     94.075484       0.760645
+Bolivie       4.720968     75.581613      19.697419
+Afghanistan   4.370968     87.556452       8.071935
+Libéria       4.273226     91.757419       3.965806
+#+end_example
+Indeed some of these countries can be qualified as poor. Yemen seems
+extremely hit by the epidemic and it seems that 30% of his infected
+people died in February. Yemen is a very poor country, but let's inspect this number, which seems very
+high compared to the other countries.
+#+begin_src python :results value :session :exports both
+data_grouped.mean()[count_columns].loc['Yémen']
+#+End_src
+#+RESULTS:
+: Infections    2178.806452
+: Deces          620.516129
+: Guerisons     1430.709677
+: Name: Yémen, dtype: float64
+Now let's compare to median countries for each metric.
+#+begin_src python :results value :session :exports both
+data_grouped.mean()[count_columns].median(0)
+#+end_src
+#+RESULTS:
+: Infections    50333.935484
+: Deces           613.387097
+: Guerisons     23364.000000
+: dtype: float64
+Yemen seems to have as many deaths as the median country does, while
+having way less contaminations. This can either be due to the lack of
+testing in the country, or awful medical care conditions. This
+highlights the growing poverty of the country, aggravated by war.
+* Plotting
+Now we can plot many thing. We can for instance inspect a country of
+interest, and try to see how it behaves over the month of
+February. Let's see how the US were impacted.
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+import matplotlib.pyplot as plt
+us_data = data[data['Pays'] == 'États-Unis']
+us_data.plot('Date', ['Infections', 'Deces', 'Guerisons'])
+plt.savefig(matplot_lib_filename)
+matplot_lib_filename
+#+end_src
+#+RESULTS:
+[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-fXehm0/figurelXQ45J.png]]