#+TITLE: Worldwide covid evolution in February 2021 * Dataset We want to introduce here [[https://www.data.gouv.fr/fr/datasets/coronavirus-covid19-evolution-par-pays-et-dans-le-monde-maj-quotidienne/][this dataset]], taken on the 03/03/2021. We chose to study the per country daily dataset so we have some preprocessing work to do, and more fine grained statistical analysis. #+begin_src python :results value :session :exports both import pandas as pd data = pd.read_csv('./coronavirus.politologue.com-pays-2021-03-03.csv', skiprows=7, sep=';') data.head() #+end_src #+RESULTS: : Date Pays ... TauxGuerison TauxInfection : 0 2021-03-03 Andorre ... 96.27 2.72 : 1 2021-03-03 Émirats Arabes Unis ... 96.78 2.90 : 2 2021-03-03 Afghanistan ... 88.50 7.11 : 3 2021-03-03 Antigua-et-Barbuda ... 39.92 58.26 : 4 2021-03-03 Albanie ... 65.40 32.91 : : [5 rows x 8 columns] Let's see how big the data is, and the date range it covers. #+begin_src python :results output :session :exports both print(data.shape) data['Date'] = pd.to_datetime(data['Date']) print(min(data['Date'])) print(max(data['Date'])) #+end_src #+RESULTS: : (6293, 8) : 2021-02-01 00:00:00 : 2021-03-03 00:00:00 So it's a pretty small dataset, so the computations should be fast. Let's look at the columns #+begin_src python :results output :session :exports both print(data.columns) #+end_src #+RESULTS: : Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces', : 'TauxGuerison', 'TauxInfection'], : dtype='object') Interesting. So we have multivariate time series for each countries, regarding different daily metrics. Looking at TauxGuerison or TauxDeces could give us a sense of the quality of each country's medical care. The sum of the rates always gives roughly 1 (100%) : #+begin_src python :results output :session :exports both rate_columns = data.columns[-3:] print(data[rate_columns].sum(1).unique()) #+end_src #+RESULTS: : [100. 99.99 99.99 100.01 100. 100. 100.01 100.01 99.99] * Statistics We want to compute statistics over February, per country, so we can start by aggregating the data per country. First, we compute the average value for each metric for each country for rates. #+begin_src python :results value :session :exports both count_columns = data.columns[2:-3] data_grouped = data.groupby('Pays') mean_rates_per_country = data_grouped[rate_columns].mean() mean_rates_per_country.head() #+end_src #+RESULTS: : TauxDeces TauxGuerison TauxInfection : Pays : Afghanistan 4.370968 87.556452 8.071935 : Afrique du Sud 3.217419 93.025806 3.757097 : Albanie 1.689032 62.334839 35.976774 : Algérie 2.656774 68.720645 28.625161 : Allemagne 2.785484 91.093226 6.121290 Let's see what are the countries with most elevated death rate over the month of February. We expect them to be poor countries, meaning they have less means to heal their patients. #+begin_src python :results output :session :exports both print(mean_rates_per_country.sort_values('TauxDeces', ascending=False).head(10)) #+end_src #+RESULTS: #+begin_example TauxDeces TauxGuerison TauxInfection Pays Yémen 28.496452 65.724194 5.779677 Mexique 8.748710 77.767742 13.484194 Syrie 6.580968 59.050000 34.369677 Soudan 6.193226 74.704839 19.101613 Égypte 5.763226 77.599355 16.636774 Équateur 5.721935 84.903548 9.375806 Chine 5.163226 94.075484 0.760645 Bolivie 4.720968 75.581613 19.697419 Afghanistan 4.370968 87.556452 8.071935 Libéria 4.273226 91.757419 3.965806 #+end_example Indeed some of these countries can be qualified as poor. Yemen seems extremely hit by the epidemic and it seems that 30% of his infected people died in February. Yemen is a very poor country, but let's inspect this number, which seems very high compared to the other countries. #+begin_src python :results value :session :exports both data_grouped.mean()[count_columns].loc['Yémen'] #+End_src #+RESULTS: : Infections 2178.806452 : Deces 620.516129 : Guerisons 1430.709677 : Name: Yémen, dtype: float64 Now let's compare to median countries for each metric. #+begin_src python :results value :session :exports both data_grouped.mean()[count_columns].median(0) #+end_src #+RESULTS: : Infections 50333.935484 : Deces 613.387097 : Guerisons 23364.000000 : dtype: float64 Yemen seems to have as many deaths as the median country does, while having way less contaminations. This can either be due to the lack of testing in the country, or awful medical care conditions. This highlights the growing poverty of the country, aggravated by war. * Plotting Now we can plot many thing. We can for instance inspect a country of interest, and try to see how it behaves over the month of February. Let's see how the US were impacted. #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both import matplotlib.pyplot as plt country_data = data[data['Pays'] == 'États-Unis'] country_data.plot('Date', ['Infections', 'Deces', 'Guerisons']) plt.savefig(matplot_lib_filename) matplot_lib_filename #+end_src #+RESULTS: [[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figureJY8iHB.png]] We can't see much on this type of plot, because for most countries, metrics are on different scales, and this data is only the evolution during one month, which is small for epidemic data. Also this data only shows the evolution of contaminated people. Let's look quickly at the number of new cases per day, for the US. (We add the #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both import numpy as np country_data['NewInfections'] = np.array([0] + (country_data['Infections'].values[1:] - country_data['Infections'].values[:-1]).tolist()) + 117903 country_data.plot('Date', 'NewInfections') plt.savefig(matplot_lib_filename) matplot_lib_filename #+end_src #+RESULTS: [[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figurevihCAv.png]] This shows that the number of new cases has grown every day during February in the US, which indicates that the epidemic is not slowing there. So let's try other visualisations. We can try to plot the mean distribution for each rate metrics for instance. #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both mean_rates_per_country.hist(rate_columns, bins=20) plt.savefig(matplot_lib_filename) matplot_lib_filename #+end_src #+RESULTS: [[file:/var/folders/87/c7x20gt17rjfzcgh427wbtpr0000gn/T/babel-QilKUh/figurefBoBMc.png]] This plot show that overall, February was not the worst month for the world : most countries show a high recovery rate, and small death rate, meaning that the medical services were not to much overwhelmed. TauxInfection is not very meaningful, because it only shows the proportion of people not recovered and not dead.