Commit 7fbc3cae authored by Corentin Ambroise's avatar Corentin Ambroise

exo4

parent 67fa5272
This diff is collapsed.
This diff is collapsed.
#+TITLE: Worldwide covid evolution in February 2021
* Dataset
We want to introduce here [[https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/][this dataset]], taken on the 03/03/2021. We
chose to study the per country daily dataset so we have some
preprocessing work to do, and more fine grained statistical analysis.
#+begin_src python :results value :session :exports both
import pandas as pd
data = pd.read_csv('./coronavirus.politologue.com-pays-2021-03-03.csv', skiprows=7, sep=';')
data.head()
#+end_src
#+RESULTS:
: Date Pays ... TauxGuerison TauxInfection
: 0 2021-03-03 Andorre ... 96.27 2.72
: 1 2021-03-03 Émirats Arabes Unis ... 96.78 2.90
: 2 2021-03-03 Afghanistan ... 88.50 7.11
: 3 2021-03-03 Antigua-et-Barbuda ... 39.92 58.26
: 4 2021-03-03 Albanie ... 65.40 32.91
:
: [5 rows x 8 columns]
Let's see how big the data is, and the date range it covers.
#+begin_src python :results output :session :exports both
print(data.shape)
data['Date'] = pd.to_datetime(data['Date'])
print(min(data['Date']))
print(max(data['Date']))
#+end_src
#+RESULTS:
: (6293, 8)
: 2021-02-01 00:00:00
: 2021-03-03 00:00:00
So it's a pretty small dataset, so the computations should be
fast. Let's look at the columns
#+begin_src python :results output :session :exports both
print(data.columns)
#+end_src
#+RESULTS:
: Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces',
: 'TauxGuerison', 'TauxInfection'],
: dtype='object')
Interesting. So we have multivariate time series for each countries,
regarding different daily metrics. Looking at TauxGuerison or
TauxDeces could give us a sense of the quality of each country's
medical care. The sum of the rates always gives roughly 1 (100%) :
#+begin_src python :results output :session :exports both
rate_columns = data.columns[-3:]
print(data[rate_columns].sum(1).unique())
#+end_src
#+RESULTS:
: [100. 99.99 99.99 100.01 100. 100. 100.01 100.01 99.99]
* Statistics
We want to compute statistics over February, per country, so we can
start by aggregating the data per country. First, we compute the
average value for each metric for each country for rates.
#+begin_src python :results value :session :exports both
count_columns = data.columns[2:-3]
data_grouped = data.groupby('Pays')
mean_rates_per_country = data_grouped[rate_columns].mean()
mean_rates_per_country.head()
#+end_src
#+RESULTS:
: TauxDeces TauxGuerison TauxInfection
: Pays
: Afghanistan 4.370968 87.556452 8.071935
: Afrique du Sud 3.217419 93.025806 3.757097
: Albanie 1.689032 62.334839 35.976774
: Algérie 2.656774 68.720645 28.625161
: Allemagne 2.785484 91.093226 6.121290
Let's see what are the countries with most elevated death rate over
the month of February. We expect them to be poor countries, meaning
they have less means to heal their patients.
#+begin_src python :results output :session :exports both
print(mean_rates_per_country.sort_values('TauxDeces', ascending=False).head(10))
#+end_src
#+RESULTS:
#+begin_example
TauxDeces TauxGuerison TauxInfection
Pays
Yémen 28.496452 65.724194 5.779677
Mexique 8.748710 77.767742 13.484194
Syrie 6.580968 59.050000 34.369677
Soudan 6.193226 74.704839 19.101613
Égypte 5.763226 77.599355 16.636774
Équateur 5.721935 84.903548 9.375806
Chine 5.163226 94.075484 0.760645
Bolivie 4.720968 75.581613 19.697419
Afghanistan 4.370968 87.556452 8.071935
Libéria 4.273226 91.757419 3.965806
#+end_example
Indeed some of these countries can be qualified as poor. Yemen seems
extremely hit by the epidemic and it seems that 30% of his infected
people died in February. Yemen is a very poor country, but let's inspect this number, which seems very
high compared to the other countries.
#+begin_src python :results value :session :exports both
data_grouped.mean()[count_columns].loc['Yémen']
#+End_src
#+RESULTS:
: Infections 2178.806452
: Deces 620.516129
: Guerisons 1430.709677
: Name: Yémen, dtype: float64
Now let's compare to median countries for each metric.
#+begin_src python :results value :session :exports both
data_grouped.mean()[count_columns].median(0)
#+end_src
#+RESULTS:
: Infections 50333.935484
: Deces 613.387097
: Guerisons 23364.000000
: dtype: float64
Yemen seems to have as many deaths as the median country does, while
having way less contaminations. This can either be due to the lack of
testing in the country, or awful medical care conditions. This
highlights the growing poverty of the country, aggravated by war.
* Plotting
Now we can plot many thing. We can for instance inspect a country of
interest, and try to see how it behaves over the month of
February. Let's see how the US were impacted.
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
import matplotlib.pyplot as plt
country_data = data[data['Pays'] == 'États-Unis']
country_data.plot('Date', ['Infections', 'Deces', 'Guerisons'])
plt.savefig(matplot_lib_filename)
matplot_lib_filename
#+end_src
#+RESULTS:
[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figureJY8iHB.png]]
We can't see much on this type of plot, because for most countries,
metrics are on different scales, and this data is only the evolution
during one month, which is small for epidemic data. Also this data
only shows the evolution of contaminated people. Let's look quickly at
the number of new cases per day, for the US. (We add the
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
import numpy as np
country_data['NewInfections'] = np.array([0] + (country_data['Infections'].values[1:] - country_data['Infections'].values[:-1]).tolist()) + 117903
country_data.plot('Date', 'NewInfections')
plt.savefig(matplot_lib_filename)
matplot_lib_filename
#+end_src
#+RESULTS:
[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figurevihCAv.png]]
This shows that the number of new cases has grown every day during
February in the US, which indicates that the epidemic is not slowing
there.
So let's try other visualisations. We can try to plot the mean distribution for each rate
metrics for instance.
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
mean_rates_per_country.hist(rate_columns, bins=20)
plt.savefig(matplot_lib_filename)
matplot_lib_filename
#+end_src
#+RESULTS:
[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figurefBoBMc.png]]
This plot show that overall, February was not the worst month for the
world : most countries show a high recovery rate, and small death
rate, meaning that the medical services were not to much
overwhelmed. TauxInfection is not very meaningful, because it only
shows the proportion of people not recovered and not dead.
#+TITLE: Worldwide covid evolution in February 2021
* Dataset
We want to introduce here [[https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/][this dataset]], taken on the 03/03/2021. We
chose to study the per country daily dataset so we have some
preprocessing work to do, and more fine grained statistical analysis.
#+begin_src python :results value :session :exports both
import pandas as pd
data = pd.read_csv('./coronavirus.politologue.com-pays-2021-03-03.csv', skiprows=7, sep=';')
data.head()
#+end_src
#+RESULTS:
: Date Pays ... TauxGuerison TauxInfection
: 0 2021-03-03 Andorre ... 96.27 2.72
: 1 2021-03-03 Émirats Arabes Unis ... 96.78 2.90
: 2 2021-03-03 Afghanistan ... 88.50 7.11
: 3 2021-03-03 Antigua-et-Barbuda ... 39.92 58.26
: 4 2021-03-03 Albanie ... 65.40 32.91
:
: [5 rows x 8 columns]
Let's see how big the data is, and the date range it covers.
#+begin_src python :results output :session :exports both
print(data.shape)
data['Date'] = pd.to_datetime(data['Date'])
print(min(data['Date']))
print(max(data['Date']))
#+end_src
So it's a pretty small dataset, so the computations should be
fast. Let's look at the columns
#+begin_src python :results output :session :exports both
print(data.columns)
#+end_src
#+RESULTS:
: Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces',
: 'TauxGuerison', 'TauxInfection'],
: dtype='object')
Interesting. So we have multivariate time series for each countries,
regarding different daily metrics. Looking at TauxGuerison or
TauxDeces could give us a sense of the quality of each country's
medical care. The sum of the rates always gives roughly 1 (100%) :
#+begin_src python :results output :session :exports both
rate_columns = data.columns[-3:]
print(data[rate_columns].sum(1).unique())
#+end_src
#+RESULTS:
: [100. 99.99 99.99 100.01 100. 100. 100.01 100.01 99.99]
* Statistics
We want to compute statistics over February, per country, so we can
start by aggregating the data per country. First, we compute the
average value for each metric for each country for rates.
#+begin_src python :results value :session :exports both
count_columns = data.columns[2:-3]
data_grouped = data.groupby('Pays')
mean_rates_per_country = data_grouped[rate_columns].mean()
mean_rates_per_country.head()
#+end_src
#+RESULTS:
: TauxDeces TauxGuerison TauxInfection
: Pays
: Afghanistan 4.370968 87.556452 8.071935
: Afrique du Sud 3.217419 93.025806 3.757097
: Albanie 1.689032 62.334839 35.976774
: Algérie 2.656774 68.720645 28.625161
: Allemagne 2.785484 91.093226 6.121290
Let's see what are the countries with most elevated death rate over
the month of February. We expect them to be poor countries, meaning
they have less means to heal their patients.
#+begin_src python :results output :session :exports both
print(mean_rates_per_country.sort_values('TauxDeces', ascending=False).head(10))
#+end_src
#+RESULTS:
#+begin_example
TauxDeces TauxGuerison TauxInfection
Pays
Yémen 28.496452 65.724194 5.779677
Mexique 8.748710 77.767742 13.484194
Syrie 6.580968 59.050000 34.369677
Soudan 6.193226 74.704839 19.101613
Égypte 5.763226 77.599355 16.636774
Équateur 5.721935 84.903548 9.375806
Chine 5.163226 94.075484 0.760645
Bolivie 4.720968 75.581613 19.697419
Afghanistan 4.370968 87.556452 8.071935
Libéria 4.273226 91.757419 3.965806
#+end_example
Indeed some of these countries can be qualified as poor. Yemen seems
extremely hit by the epidemic and it seems that 30% of his infected
people died in February. Yemen is a very poor country, but let's inspect this number, which seems very
high compared to the other countries.
#+begin_src python :results value :session :exports both
data_grouped.mean()[count_columns].loc['Yémen']
#+End_src
#+RESULTS:
: Infections 2178.806452
: Deces 620.516129
: Guerisons 1430.709677
: Name: Yémen, dtype: float64
Now let's compare to median countries for each metric.
#+begin_src python :results value :session :exports both
data_grouped.mean()[count_columns].median(0)
#+end_src
#+RESULTS:
: Infections 50333.935484
: Deces 613.387097
: Guerisons 23364.000000
: dtype: float64
Yemen seems to have as many deaths as the median country does, while
having way less contaminations. This can either be due to the lack of
testing in the country, or awful medical care conditions. This
highlights the growing poverty of the country, aggravated by war.
* Plotting
Now we can plot many thing. We can for instance inspect a country of
interest, and try to see how it behaves over the month of
February. Let's see how the US were impacted.
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
import matplotlib.pyplot as plt
us_data = data[data['Pays'] == 'États-Unis']
us_data.plot('Date', ['Infections', 'Deces', 'Guerisons'])
plt.savefig(matplot_lib_filename)
matplot_lib_filename
#+end_src
#+RESULTS:
[[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-fXehm0/figurelXQ45J.png]]
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment