Worldwide covid evolution in February 2021
Table of Contents
1 Dataset
We want to introduce here this dataset, taken on the 03/03/2021. We chose to study the per country daily dataset so we have some preprocessing work to do, and more fine grained statistical analysis.
import pandas as pd data = pd.read_csv('./coronavirus.politologue.com-pays-2021-03-03.csv', skiprows=7, sep=';') data.head()
Date Pays ... TauxGuerison TauxInfection 0 2021-03-03 Andorre ... 96.27 2.72 1 2021-03-03 Émirats Arabes Unis ... 96.78 2.90 2 2021-03-03 Afghanistan ... 88.50 7.11 3 2021-03-03 Antigua-et-Barbuda ... 39.92 58.26 4 2021-03-03 Albanie ... 65.40 32.91 [5 rows x 8 columns]
Let's see how big the data is, and the date range it covers.
print(data.shape) data['Date'] = pd.to_datetime(data['Date']) print(min(data['Date'])) print(max(data['Date']))
(6293, 8) 2021-02-01 00:00:00 2021-03-03 00:00:00
So it's a pretty small dataset, so the computations should be fast. Let's look at the columns
print(data.columns)
Index(['Date', 'Pays', 'Infections', 'Deces', 'Guerisons', 'TauxDeces', 'TauxGuerison', 'TauxInfection'], dtype='object')
Interesting. So we have multivariate time series for each countries, regarding different daily metrics. Looking at TauxGuerison or TauxDeces could give us a sense of the quality of each country's medical care. The sum of the rates always gives roughly 1 (100%) :
rate_columns = data.columns[-3:] print(data[rate_columns].sum(1).unique())
[100. 99.99 99.99 100.01 100. 100. 100.01 100.01 99.99]
2 Statistics
We want to compute statistics over February, per country, so we can start by aggregating the data per country. First, we compute the average value for each metric for each country for rates.
count_columns = data.columns[2:-3] data_grouped = data.groupby('Pays') mean_rates_per_country = data_grouped[rate_columns].mean() mean_rates_per_country.head()
TauxDeces TauxGuerison TauxInfection Pays Afghanistan 4.370968 87.556452 8.071935 Afrique du Sud 3.217419 93.025806 3.757097 Albanie 1.689032 62.334839 35.976774 Algérie 2.656774 68.720645 28.625161 Allemagne 2.785484 91.093226 6.121290
Let's see what are the countries with most elevated death rate over the month of February. We expect them to be poor countries, meaning they have less means to heal their patients.
print(mean_rates_per_country.sort_values('TauxDeces', ascending=False).head(10))
TauxDeces TauxGuerison TauxInfection Pays Yémen 28.496452 65.724194 5.779677 Mexique 8.748710 77.767742 13.484194 Syrie 6.580968 59.050000 34.369677 Soudan 6.193226 74.704839 19.101613 Égypte 5.763226 77.599355 16.636774 Équateur 5.721935 84.903548 9.375806 Chine 5.163226 94.075484 0.760645 Bolivie 4.720968 75.581613 19.697419 Afghanistan 4.370968 87.556452 8.071935 Libéria 4.273226 91.757419 3.965806
Indeed some of these countries can be qualified as poor. Yemen seems extremely hit by the epidemic and it seems that 30% of his infected people died in February. Yemen is a very poor country, but let's inspect this number, which seems very high compared to the other countries.
data_grouped.mean()[count_columns].loc['Yémen']
Infections 2178.806452 Deces 620.516129 Guerisons 1430.709677 Name: Yémen, dtype: float64
Now let's compare to median countries for each metric.
data_grouped.mean()[count_columns].median(0)
Infections 50333.935484 Deces 613.387097 Guerisons 23364.000000 dtype: float64
Yemen seems to have as many deaths as the median country does, while having way less contaminations. This can either be due to the lack of testing in the country, or awful medical care conditions. This highlights the growing poverty of the country, aggravated by war.
3 Plotting
Now we can plot many thing. We can for instance inspect a country of interest, and try to see how it behaves over the month of February. Let's see how the US were impacted.
import matplotlib.pyplot as plt country_data = data[data['Pays'] == 'États-Unis'] country_data.plot('Date', ['Infections', 'Deces', 'Guerisons']) plt.savefig(matplot_lib_filename) matplot_lib_filename
We can't see much on this type of plot, because for most countries, metrics are on different scales, and this data is only the evolution during one month, which is small for epidemic data. Also this data only shows the evolution of contaminated people. Let's look quickly at the number of new cases per day, for the US. (We add the
import numpy as np country_data['NewInfections'] = np.array([0] + (country_data['Infections'].values[1:] - country_data['Infections'].values[:-1]).tolist()) + 117903 country_data.plot('Date', 'NewInfections') plt.savefig(matplot_lib_filename) matplot_lib_filename
This shows that the number of new cases has grown every day during February in the US, which indicates that the epidemic is not slowing there.
So let's try other visualisations. We can try to plot the mean distribution for each rate metrics for instance.
mean_rates_per_country.hist(rate_columns, bins=20) plt.savefig(matplot_lib_filename) matplot_lib_filename
This plot show that overall, February was not the worst month for the
world : most countries show a high recovery rate, and small death
rate, meaning that the medical services were not to much
overwhelmed. TauxInfection is not very meaningful, because it only
shows the proportion of people not recovered and not dead.