Commit 7fbc3cae authored by Corentin Ambroise's avatar Corentin Ambroise

exo4

parent 67fa5272
This source diff could not be displayed because it is too large. You can view the blob instead.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2021-03-03 Mer 15:39 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Worldwide covid evolution in February 2021</title>
<meta name="generator" content="Org mode" />
<meta name="author" content="Corentin Ambroise" />
<style type="text/css">
<!--/*--><![CDATA[/*><!--*/
.title { text-align: center;
margin-bottom: .2em; }
.subtitle { text-align: center;
font-size: medium;
font-weight: bold;
margin-top:0; }
.todo { font-family: monospace; color: red; }
.done { font-family: monospace; color: green; }
.priority { font-family: monospace; color: orange; }
.tag { background-color: #eee; font-family: monospace;
padding: 2px; font-size: 80%; font-weight: normal; }
.timestamp { color: #bebebe; }
.timestamp-kwd { color: #5f9ea0; }
.org-right { margin-left: auto; margin-right: 0px; text-align: right; }
.org-left { margin-left: 0px; margin-right: auto; text-align: left; }
.org-center { margin-left: auto; margin-right: auto; text-align: center; }
.underline { text-decoration: underline; }
#postamble p, #preamble p { font-size: 90%; margin: .2em; }
p.verse { margin-left: 3%; }
pre {
border: 1px solid #ccc;
box-shadow: 3px 3px 3px #eee;
padding: 8pt;
font-family: monospace;
overflow: auto;
margin: 1.2em;
}
pre.src {
position: relative;
overflow: visible;
padding-top: 1.2em;
}
pre.src:before {
display: none;
position: absolute;
background-color: white;
top: -10px;
right: 10px;
padding: 3px;
border: 1px solid black;
}
pre.src:hover:before { display: inline;}
/* Languages per Org manual */
pre.src-asymptote:before { content: 'Asymptote'; }
pre.src-awk:before { content: 'Awk'; }
pre.src-C:before { content: 'C'; }
/* pre.src-C++ doesn't work in CSS */
pre.src-clojure:before { content: 'Clojure'; }
pre.src-css:before { content: 'CSS'; }
pre.src-D:before { content: 'D'; }
pre.src-ditaa:before { content: 'ditaa'; }
pre.src-dot:before { content: 'Graphviz'; }
pre.src-calc:before { content: 'Emacs Calc'; }
pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
pre.src-fortran:before { content: 'Fortran'; }
pre.src-gnuplot:before { content: 'gnuplot'; }
pre.src-haskell:before { content: 'Haskell'; }
pre.src-hledger:before { content: 'hledger'; }
pre.src-java:before { content: 'Java'; }
pre.src-js:before { content: 'Javascript'; }
pre.src-latex:before { content: 'LaTeX'; }
pre.src-ledger:before { content: 'Ledger'; }
pre.src-lisp:before { content: 'Lisp'; }
pre.src-lilypond:before { content: 'Lilypond'; }
pre.src-lua:before { content: 'Lua'; }
pre.src-matlab:before { content: 'MATLAB'; }
pre.src-mscgen:before { content: 'Mscgen'; }
pre.src-ocaml:before { content: 'Objective Caml'; }
pre.src-octave:before { content: 'Octave'; }
pre.src-org:before { content: 'Org mode'; }
pre.src-oz:before { content: 'OZ'; }
pre.src-plantuml:before { content: 'Plantuml'; }
pre.src-processing:before { content: 'Processing.js'; }
pre.src-python:before { content: 'Python'; }
pre.src-R:before { content: 'R'; }
pre.src-ruby:before { content: 'Ruby'; }
pre.src-sass:before { content: 'Sass'; }
pre.src-scheme:before { content: 'Scheme'; }
pre.src-screen:before { content: 'Gnu Screen'; }
pre.src-sed:before { content: 'Sed'; }
pre.src-sh:before { content: 'shell'; }
pre.src-sql:before { content: 'SQL'; }
pre.src-sqlite:before { content: 'SQLite'; }
/* additional languages in org.el's org-babel-load-languages alist */
pre.src-forth:before { content: 'Forth'; }
pre.src-io:before { content: 'IO'; }
pre.src-J:before { content: 'J'; }
pre.src-makefile:before { content: 'Makefile'; }
pre.src-maxima:before { content: 'Maxima'; }
pre.src-perl:before { content: 'Perl'; }
pre.src-picolisp:before { content: 'Pico Lisp'; }
pre.src-scala:before { content: 'Scala'; }
pre.src-shell:before { content: 'Shell Script'; }
pre.src-ebnf2ps:before { content: 'ebfn2ps'; }
/* additional language identifiers per "defun org-babel-execute"
in ob-*.el */
pre.src-cpp:before { content: 'C++'; }
pre.src-abc:before { content: 'ABC'; }
pre.src-coq:before { content: 'Coq'; }
pre.src-groovy:before { content: 'Groovy'; }
/* additional language identifiers from org-babel-shell-names in
ob-shell.el: ob-shell is the only babel language using a lambda to put
the execution function name together. */
pre.src-bash:before { content: 'bash'; }
pre.src-csh:before { content: 'csh'; }
pre.src-ash:before { content: 'ash'; }
pre.src-dash:before { content: 'dash'; }
pre.src-ksh:before { content: 'ksh'; }
pre.src-mksh:before { content: 'mksh'; }
pre.src-posh:before { content: 'posh'; }
/* Additional Emacs modes also supported by the LaTeX listings package */
pre.src-ada:before { content: 'Ada'; }
pre.src-asm:before { content: 'Assembler'; }
pre.src-caml:before { content: 'Caml'; }
pre.src-delphi:before { content: 'Delphi'; }
pre.src-html:before { content: 'HTML'; }
pre.src-idl:before { content: 'IDL'; }
pre.src-mercury:before { content: 'Mercury'; }
pre.src-metapost:before { content: 'MetaPost'; }
pre.src-modula-2:before { content: 'Modula-2'; }
pre.src-pascal:before { content: 'Pascal'; }
pre.src-ps:before { content: 'PostScript'; }
pre.src-prolog:before { content: 'Prolog'; }
pre.src-simula:before { content: 'Simula'; }
pre.src-tcl:before { content: 'tcl'; }
pre.src-tex:before { content: 'TeX'; }
pre.src-plain-tex:before { content: 'Plain TeX'; }
pre.src-verilog:before { content: 'Verilog'; }
pre.src-vhdl:before { content: 'VHDL'; }
pre.src-xml:before { content: 'XML'; }
pre.src-nxml:before { content: 'XML'; }
/* add a generic configuration mode; LaTeX export needs an additional
(add-to-list 'org-latex-listings-langs '(conf " ")) in .emacs */
pre.src-conf:before { content: 'Configuration File'; }
table { border-collapse:collapse; }
caption.t-above { caption-side: top; }
caption.t-bottom { caption-side: bottom; }
td, th { vertical-align:top; }
th.org-right { text-align: center; }
th.org-left { text-align: center; }
th.org-center { text-align: center; }
td.org-right { text-align: right; }
td.org-left { text-align: left; }
td.org-center { text-align: center; }
dt { font-weight: bold; }
.footpara { display: inline; }
.footdef { margin-bottom: 1em; }
.figure { padding: 1em; }
.figure p { text-align: center; }
.equation-container {
display: table;
text-align: center;
width: 100%;
}
.equation {
vertical-align: middle;
}
.equation-label {
display: table-cell;
text-align: right;
vertical-align: middle;
}
.inlinetask {
padding: 10px;
border: 2px solid gray;
margin: 10px;
background: #ffffcc;
}
#org-div-home-and-up
{ text-align: right; font-size: 70%; white-space: nowrap; }
textarea { overflow-x: auto; }
.linenr { font-size: smaller }
.code-highlighted { background-color: #ffff00; }
.org-info-js_info-navigation { border-style: none; }
#org-info-js_console-label
{ font-size: 10px; font-weight: bold; white-space: nowrap; }
.org-info-js_search-highlight
{ background-color: #ffff00; color: #000000; font-weight: bold; }
.org-svg { width: 90%; }
/*]]>*/-->
</style>
<script type="text/javascript">
/*
@licstart The following is the entire license notice for the
JavaScript code in this tag.
Copyright (C) 2012-2020 Free Software Foundation, Inc.
The JavaScript code in this tag is free software: you can
redistribute it and/or modify it under the terms of the GNU
General Public License (GNU GPL) as published by the Free Software
Foundation, either version 3 of the License, or (at your option)
any later version. The code is distributed WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU GPL for more details.
As additional permission under GNU GPL version 3 section 7, you
may distribute non-source (e.g., minimized or compacted) forms of
that code without the copy of the GNU GPL normally required by
section 4, provided you include this license notice and a URL
through which recipients can access the Corresponding Source.
@licend The above is the entire license notice
for the JavaScript code in this tag.
*/
<!--/*--><![CDATA[/*><!--*/
function CodeHighlightOn(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.cacheClassElem = elem.className;
elem.cacheClassTarget = target.className;
target.className = "code-highlighted";
elem.className = "code-highlighted";
}
}
function CodeHighlightOff(elem, id)
{
var target = document.getElementById(id);
if(elem.cacheClassElem)
elem.className = elem.cacheClassElem;
if(elem.cacheClassTarget)
target.className = elem.cacheClassTarget;
}
/*]]>*///-->
</script>
</head>
<body>
<div id="content">
<h1 class="title">Worldwide covid evolution in February 2021</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#org3e6a389">1. Dataset</a></li>
<li><a href="#orgcf424dd">2. Statistics</a></li>
<li><a href="#org9103eb8">3. Plotting</a></li>
</ul>
</div>
</div>
<div id="outline-container-org3e6a389" class="outline-2">
<h2 id="org3e6a389"><span class="section-number-2">1</span> Dataset</h2>
<div class="outline-text-2" id="text-1">
<p>
We want to introduce here <a href="https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/">this dataset</a>, taken on the 03/03/2021. We
chose to study the per country daily dataset so we have some
preprocessing work to do, and more fine grained statistical analysis.
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #a020f0;">import</span> pandas <span style="color: #a020f0;">as</span> pd
<span style="color: #a0522d;">data</span> = pd.read_csv(<span style="color: #8b2252;">'./coronavirus.politologue.com-pays-2021-03-03.csv'</span>, skiprows=7, sep=<span style="color: #8b2252;">';'</span>)
data.head()
</pre>
</div>
<pre class="example">
Date Pays ... TauxGuerison TauxInfection
0 2021-03-03 Andorre ... 96.27 2.72
1 2021-03-03 Émirats Arabes Unis ... 96.78 2.90
2 2021-03-03 Afghanistan ... 88.50 7.11
3 2021-03-03 Antigua-et-Barbuda ... 39.92 58.26
4 2021-03-03 Albanie ... 65.40 32.91
[5 rows x 8 columns]
</pre>
<p>
Let's see how big the data is, and the date range it covers.
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #a020f0;">print</span>(data.shape)
<span style="color: #a0522d;">data</span>[<span style="color: #8b2252;">'Date'</span>] = pd.to_datetime(data[<span style="color: #8b2252;">'Date'</span>])
<span style="color: #a020f0;">print</span>(<span style="color: #483d8b;">min</span>(data[<span style="color: #8b2252;">'Date'</span>]))
<span style="color: #a020f0;">print</span>(<span style="color: #483d8b;">max</span>(data[<span style="color: #8b2252;">'Date'</span>]))
</pre>
</div>
<pre class="example">
(6293, 8)
2021-02-01 00:00:00
2021-03-03 00:00:00
</pre>
<p>
So it's a pretty small dataset, so the computations should be
fast. Let's look at the columns
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #a020f0;">print</span>(data.columns)
</pre>
</div>
<pre class="example">
Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces',
'TauxGuerison', 'TauxInfection'],
dtype='object')
</pre>
<p>
Interesting. So we have multivariate time series for each countries,
regarding different daily metrics. Looking at TauxGuerison or
TauxDeces could give us a sense of the quality of each country's
medical care. The sum of the rates always gives roughly 1 (100%) :
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #a0522d;">rate_columns</span> = data.columns[-3:]
<span style="color: #a020f0;">print</span>(data[rate_columns].<span style="color: #483d8b;">sum</span>(1).unique())
</pre>
</div>
<pre class="example">
[100. 99.99 99.99 100.01 100. 100. 100.01 100.01 99.99]
</pre>
</div>
</div>
<div id="outline-container-orgcf424dd" class="outline-2">
<h2 id="orgcf424dd"><span class="section-number-2">2</span> Statistics</h2>
<div class="outline-text-2" id="text-2">
<p>
We want to compute statistics over February, per country, so we can
start by aggregating the data per country. First, we compute the
average value for each metric for each country for rates.
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #a0522d;">count_columns</span> = data.columns[2:-3]
<span style="color: #a0522d;">data_grouped</span> = data.groupby(<span style="color: #8b2252;">'Pays'</span>)
<span style="color: #a0522d;">mean_rates_per_country</span> = data_grouped[rate_columns].mean()
mean_rates_per_country.head()
</pre>
</div>
<pre class="example">
TauxDeces TauxGuerison TauxInfection
Pays
Afghanistan 4.370968 87.556452 8.071935
Afrique du Sud 3.217419 93.025806 3.757097
Albanie 1.689032 62.334839 35.976774
Algérie 2.656774 68.720645 28.625161
Allemagne 2.785484 91.093226 6.121290
</pre>
<p>
Let's see what are the countries with most elevated death rate over
the month of February. We expect them to be poor countries, meaning
they have less means to heal their patients.
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #a020f0;">print</span>(mean_rates_per_country.sort_values(<span style="color: #8b2252;">'TauxDeces'</span>, ascending=<span style="color: #008b8b;">False</span>).head(10))
</pre>
</div>
<pre class="example">
TauxDeces TauxGuerison TauxInfection
Pays
Yémen 28.496452 65.724194 5.779677
Mexique 8.748710 77.767742 13.484194
Syrie 6.580968 59.050000 34.369677
Soudan 6.193226 74.704839 19.101613
Égypte 5.763226 77.599355 16.636774
Équateur 5.721935 84.903548 9.375806
Chine 5.163226 94.075484 0.760645
Bolivie 4.720968 75.581613 19.697419
Afghanistan 4.370968 87.556452 8.071935
Libéria 4.273226 91.757419 3.965806
</pre>
<p>
Indeed some of these countries can be qualified as poor. Yemen seems
extremely hit by the epidemic and it seems that 30% of his infected
people died in February. Yemen is a very poor country, but let's inspect this number, which seems very
high compared to the other countries.
</p>
<div class="org-src-container">
<pre class="src src-python">data_grouped.mean()[count_columns].loc[<span style="color: #8b2252;">'Y&#233;men'</span>]
</pre>
</div>
<pre class="example">
Infections 2178.806452
Deces 620.516129
Guerisons 1430.709677
Name: Yémen, dtype: float64
</pre>
<p>
Now let's compare to median countries for each metric.
</p>
<div class="org-src-container">
<pre class="src src-python">data_grouped.mean()[count_columns].median(0)
</pre>
</div>
<pre class="example">
Infections 50333.935484
Deces 613.387097
Guerisons 23364.000000
dtype: float64
</pre>
<p>
Yemen seems to have as many deaths as the median country does, while
having way less contaminations. This can either be due to the lack of
testing in the country, or awful medical care conditions. This
highlights the growing poverty of the country, aggravated by war.
</p>
</div>
</div>
<div id="outline-container-org9103eb8" class="outline-2">
<h2 id="org9103eb8"><span class="section-number-2">3</span> Plotting</h2>
<div class="outline-text-2" id="text-3">
<p>
Now we can plot many thing. We can for instance inspect a country of
interest, and try to see how it behaves over the month of
February. Let's see how the US were impacted.
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #a020f0;">import</span> matplotlib.pyplot <span style="color: #a020f0;">as</span> plt
<span style="color: #a0522d;">country_data</span> = data[data[<span style="color: #8b2252;">'Pays'</span>] == <span style="color: #8b2252;">'&#201;tats-Unis'</span>]
country_data.plot(<span style="color: #8b2252;">'Date'</span>, [<span style="color: #8b2252;">'Infections'</span>, <span style="color: #8b2252;">'Deces'</span>, <span style="color: #8b2252;">'Guerisons'</span>])
plt.savefig(matplot_lib_filename)
matplot_lib_filename
</pre>
</div>
<div class="figure">
<p><img src="file:///var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figuregXPsVe.png" alt="figuregXPsVe.png" />
</p>
</div>
<p>
We can't see much on this type of plot, because for most countries,
metrics are on different scales, and this data is only the evolution
during one month, which is small for epidemic data. Also this data
only shows the evolution of contaminated people. Let's look quickly at
the number of new cases per day, for the US. (We add the
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #a020f0;">import</span> numpy <span style="color: #a020f0;">as</span> np
<span style="color: #a0522d;">country_data</span>[<span style="color: #8b2252;">'NewInfections'</span>] = np.array([0] + (country_data[<span style="color: #8b2252;">'Infections'</span>].values[1:] - country_data[<span style="color: #8b2252;">'Infections'</span>].values[:-1]).tolist()) + 117903
country_data.plot(<span style="color: #8b2252;">'Date'</span>, <span style="color: #8b2252;">'NewInfections'</span>)
plt.savefig(matplot_lib_filename)
matplot_lib_filename
</pre>
</div>
<div class="figure">
<p><img src="file:///var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figureKiPXSr.png" alt="figureKiPXSr.png" />
</p>
</div>
<p>
This shows that the number of new cases has grown every day during
February in the US, which indicates that the epidemic is not slowing
there.
</p>
<p>
So let's try other visualisations. We can try to plot the mean distribution for each rate
metrics for instance.
</p>
<div class="org-src-container">
<pre class="src src-python">mean_rates_per_country.hist(rate_columns, bins=20)
plt.savefig(matplot_lib_filename)
matplot_lib_filename
</pre>
</div>
<p>
<img src="file:///var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figure6Z7VPf.png" alt="figure6Z7VPf.png" />
This plot show that overall, February was not the worst month for the
world : most countries show a high recovery rate, and small death
rate, meaning that the medical services were not to much
overwhelmed. TauxInfection is not very meaningful, because it only
shows the proportion of people not recovered and not dead.
</p>
</div>
</div>
</div>
<div id="postamble" class="status">
<p class="author">Author: Corentin Ambroise</p>
<p class="date">Created: 2021-03-03 Mer 15:39</p>
<p class="validation"><a href="http://validator.w3.org/check?uri=referer">Validate</a></p>
</div>
</body>
</html>
#+TITLE: Worldwide covid evolution in February 2021
* Dataset
We want to introduce here [[https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/][this dataset]], taken on the 03/03/2021. We
chose to study the per country daily dataset so we have some
preprocessing work to do, and more fine grained statistical analysis.
#+begin_src python :results value :session :exports both
import pandas as pd
data = pd.read_csv('./coronavirus.politologue.com-pays-2021-03-03.csv', skiprows=7, sep=';')
data.head()
#+end_src
#+RESULTS:
: Date Pays ... TauxGuerison TauxInfection
: 0 2021-03-03 Andorre ... 96.27 2.72
: 1 2021-03-03 Émirats Arabes Unis ... 96.78 2.90
: 2 2021-03-03 Afghanistan ... 88.50 7.11
: 3 2021-03-03 Antigua-et-Barbuda ... 39.92 58.26
: 4 2021-03-03 Albanie ... 65.40 32.91
:
: [5 rows x 8 columns]
Let's see how big the data is, and the date range it covers.
#+begin_src python :results output :session :exports both
print(data.shape)
data['Date'] = pd.to_datetime(data['Date'])
print(min(data['Date']))
print(max(data['Date']))
#+end_src
#+RESULTS:
: (6293, 8)
: 2021-02-01 00:00:00
: 2021-03-03 00:00:00
So it's a pretty small dataset, so the computations should be
fast. Let's look at the columns
#+begin_src python :results output :session :exports both
print(data.columns)
#+end_src
#+RESULTS:
: Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces',
: 'TauxGuerison', 'TauxInfection'],
: dtype='object')
Interesting. So we have multivariate time series for each countries,
regarding different daily metrics. Looking at TauxGuerison or
TauxDeces could give us a sense of the quality of each country's
medical care. The sum of the rates always gives roughly 1 (100%) :
#+begin_src python :results output :session :exports both
rate_columns = data.columns[-3:]
print(data[rate_columns].sum(1).unique())
#+end_src
#+RESULTS:
: [100. 99.99 99.99 100.01 100. 100. 100.01 100.01 99.99]
* Statistics
We want to compute statistics over February, per country, so we can
start by aggregating the data per country. First, we compute the
average value for each metric for each country for rates.
#+begin_src python :results value :session :exports both
count_columns = data.columns[2:-3]
data_grouped = data.groupby('Pays')
mean_rates_per_country = data_grouped[rate_columns].mean()
mean_rates_per_country.head()
#+end_src
#+RESULTS:
: TauxDeces TauxGuerison TauxInfection
: Pays
: Afghanistan 4.370968 87.556452 8.071935
: Afrique du Sud 3.217419 93.025806 3.757097
: Albanie 1.689032 62.334839 35.976774
: Algérie 2.656774 68.720645 28.625161
: Allemagne 2.785484 91.093226 6.121290
Let's see what are the countries with most elevated death rate over
the month of February. We expect them to be poor countries, meaning
they have less means to heal their patients.
#+begin_src python :results output :session :exports both
print(mean_rates_per_country.sort_values('TauxDeces', ascending=False).head(10))
#+end_src
#+RESULTS:
#+begin_example
TauxDeces TauxGuerison TauxInfection
Pays
Yémen 28.496452 65.724194 5.779677
Mexique 8.748710 77.767742 13.484194
Syrie 6.580968 59.050000 34.369677
Soudan 6.193226 74.704839 19.101613
Égypte 5.763226 77.599355 16.636774
Équateur 5.721935 84.903548 9.375806
Chine 5.163226 94.075484 0.760645
Bolivie 4.720968 75.581613 19.697419
Afghanistan 4.370968 87.556452 8.071935
Libéria 4.273226 91.757419 3.965806
#+end_example
Indeed some of these countries can be qualified as poor. Yemen seems
extremely hit by the epidemic and it seems that 30% of his infected
people died in February. Yemen is a very poor country, but let's inspect this number, which seems very
high compared to the other countries.
#+begin_src python :results value :session :exports both
data_grouped.mean()[count_columns].loc['Yémen']
#+End_src
#+RESULTS:
: Infections 2178.806452
: Deces 620.516129
: Guerisons 1430.709677
: Name: Yémen, dtype: float64
Now let's compare to median countries for each metric.
#+begin_src python :results value :session :exports both
data_grouped.mean()[count_columns].median(0)
#+end_src
#+RESULTS:
: Infections 50333.935484
: Deces 613.387097
: Guerisons 23364.000000
: dtype: float64
Yemen seems to have as many deaths as the median country does, while
having way less contaminations. This can either be due to the lack of
testing in the country, or awful medical care conditions. This
highlights the growing poverty of the country, aggravated by war.
* Plotting
Now we can plot many thing. We can for instance inspect a country of
interest, and try to see how it behaves over the month of
February. Let's see how the US were impacted.
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
import matplotlib.pyplot as plt
country_data = data[data['Pays'] == 'États-Unis']
country_data.plot('Date', ['Infections', 'Deces', 'Guerisons'])
plt.savefig(matplot_lib_filename)
matplot_lib_filename
#+end_src
#+RESULTS:
[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figureJY8iHB.png]]
We can't see much on this type of plot, because for most countries,
metrics are on different scales, and this data is only the evolution
during one month, which is small for epidemic data. Also this data
only shows the evolution of contaminated people. Let's look quickly at
the number of new cases per day, for the US. (We add the
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
import numpy as np
country_data['NewInfections'] = np.array([0] + (country_data['Infections'].values[1:] - country_data['Infections'].values[:-1]).tolist()) + 117903
country_data.plot('Date', 'NewInfections')
plt.savefig(matplot_lib_filename)
matplot_lib_filename
#+end_src
#+RESULTS:
[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figurevihCAv.png]]
This shows that the number of new cases has grown every day during
February in the US, which indicates that the epidemic is not slowing
there.
So let's try other visualisations. We can try to plot the mean distribution for each rate
metrics for instance.
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
mean_rates_per_country.hist(rate_columns, bins=20)
plt.savefig(matplot_lib_filename)
matplot_lib_filename
#+end_src
#+RESULTS:
[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figurefBoBMc.png]]
This plot show that overall, February was not the worst month for the
world : most countries show a high recovery rate, and small death
rate, meaning that the medical services were not to much
overwhelmed. TauxInfection is not very meaningful, because it only
shows the proportion of people not recovered and not dead.
#+TITLE: Worldwide covid evolution in February 2021
* Dataset
We want to introduce here [[https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/][this dataset]], taken on the 03/03/2021. We
chose to study the per country daily dataset so we have some
preprocessing work to do, and more fine grained statistical analysis.
#+begin_src python :results value :session :exports both
import pandas as pd
data = pd.read_csv('./coronavirus.politologue.com-pays-2021-03-03.csv', skiprows=7, sep=';')
data.head()
#+end_src
#+RESULTS:
: Date Pays ... TauxGuerison TauxInfection
: 0 2021-03-03 Andorre ... 96.27 2.72
: 1 2021-03-03 Émirats Arabes Unis ... 96.78 2.90
: 2 2021-03-03 Afghanistan ... 88.50 7.11
: 3 2021-03-03 Antigua-et-Barbuda ... 39.92 58.26
: 4 2021-03-03 Albanie ... 65.40 32.91
:
: [5 rows x 8 columns]
Let's see how big the data is, and the date range it covers.
#+begin_src python :results output :session :exports both
print(data.shape)
data['Date'] = pd.to_datetime(data['Date'])
print(min(data['Date']))
print(max(data['Date']))
#+end_src
So it's a pretty small dataset, so the computations should be
fast. Let's look at the columns
#+begin_src python :results output :session :exports both
print(data.columns)
#+end_src
#+RESULTS:
: Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces',
: 'TauxGuerison', 'TauxInfection'],
: dtype='object')
Interesting. So we have multivariate time series for each countries,
regarding different daily metrics. Looking at TauxGuerison or
TauxDeces could give us a sense of the quality of each country's
medical care. The sum of the rates always gives roughly 1 (100%) :
#+begin_src python :results output :session :exports both
rate_columns = data.columns[-3:]
print(data[rate_columns].sum(1).unique())
#+end_src
#+RESULTS:
: [100. 99.99 99.99 100.01 100. 100. 100.01 100.01 99.99]
* Statistics
We want to compute statistics over February, per country, so we can
start by aggregating the data per country. First, we compute the
average value for each metric for each country for rates.
#+begin_src python :results value :session :exports both
count_columns = data.columns[2:-3]
data_grouped = data.groupby('Pays')
mean_rates_per_country = data_grouped[rate_columns].mean()
mean_rates_per_country.head()
#+end_src
#+RESULTS:
: TauxDeces TauxGuerison TauxInfection
: Pays
: Afghanistan 4.370968 87.556452 8.071935
: Afrique du Sud 3.217419 93.025806 3.757097
: Albanie 1.689032 62.334839 35.976774
: Algérie 2.656774 68.720645 28.625161
: Allemagne 2.785484 91.093226 6.121290
Let's see what are the countries with most elevated death rate over
the month of February. We expect them to be poor countries, meaning
they have less means to heal their patients.
#+begin_src python :results output :session :exports both
print(mean_rates_per_country.sort_values('TauxDeces', ascending=False).head(10))
#+end_src
#+RESULTS:
#+begin_example
TauxDeces TauxGuerison TauxInfection
Pays
Yémen 28.496452 65.724194 5.779677
Mexique 8.748710 77.767742 13.484194
Syrie 6.580968 59.050000 34.369677
Soudan 6.193226 74.704839 19.101613
Égypte 5.763226 77.599355 16.636774
Équateur 5.721935 84.903548 9.375806
Chine 5.163226 94.075484 0.760645
Bolivie 4.720968 75.581613 19.697419
Afghanistan 4.370968 87.556452 8.071935
Libéria 4.273226 91.757419 3.965806
#+end_example
Indeed some of these countries can be qualified as poor. Yemen seems
extremely hit by the epidemic and it seems that 30% of his infected
people died in February. Yemen is a very poor country, but let's inspect this number, which seems very
high compared to the other countries.
#+begin_src python :results value :session :exports both
data_grouped.mean()[count_columns].loc['Yémen']
#+End_src
#+RESULTS:
: Infections 2178.806452
: Deces 620.516129
: Guerisons 1430.709677
: Name: Yémen, dtype: float64
Now let's compare to median countries for each metric.
#+begin_src python :results value :session :exports both
data_grouped.mean()[count_columns].median(0)
#+end_src
#+RESULTS:
: Infections 50333.935484
: Deces 613.387097
: Guerisons 23364.000000
: dtype: float64
Yemen seems to have as many deaths as the median country does, while
having way less contaminations. This can either be due to the lack of
testing in the country, or awful medical care conditions. This
highlights the growing poverty of the country, aggravated by war.
* Plotting
Now we can plot many thing. We can for instance inspect a country of
interest, and try to see how it behaves over the month of
February. Let's see how the US were impacted.
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
import matplotlib.pyplot as plt
us_data = data[data['Pays'] == 'États-Unis']
us_data.plot('Date', ['Infections', 'Deces', 'Guerisons'])
plt.savefig(matplot_lib_filename)
matplot_lib_filename
#+end_src
#+RESULTS:
[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-fXehm0/figurelXQ45J.png]]
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment