The goal here is to reproduce graphs similar to those of the South China Morning Post (SCMP), on the The Coronavirus Pandemic page and which show for different countries the cumulative number (i.e. the total number of cases since beginning of the epidemic) of people with coronavirus disease 2019.
## Data preprocessing
The data that we will use initially are compiled by the [Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)](https://systems.jhu.edu/) and are made available on [GitHub.](https://github.com/CSSEGISandData/COVID-19) It is more particularly on the data `time_series_covid19_confirmed_global.csv` (chronological suites in [csv](https://fr.wikipedia.org/wiki/Comma-separated_values) format) available at the address: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_s_cn_series_id_series_id_series_virtual_series_virtual_series_video_series_video_series_video_series_video_series_video_series_video_social , which we will focus on.
The time series table that we use is for the global confirmed cases. Australia, Canada and China are reported at the province/state level. Dependencies of the Netherlands, the UK, France and Denmark are listed under the province/state level. The US and other countries are at the country level.
na_records = apply(full.data, 1, function (x) any(is.na(x)))
sum(na_records)
# full.data[na_records,]
```
```
## Quelques explications
There are `r sum(na_records)` missing values in this dataset.
Ceci est un document R markdown que vous pouvez aisément exporter au format HTML, PDF, et MS Word. Pour plus de détails sur R Markdown consultez <http://rmarkdown.rstudio.com>.
## Formatting data
Lorsque vous cliquerez sur le bouton **Knit** ce document sera compilé afin de ré-exécuter le code R et d'inclure les résultats dans un document final. Comme nous vous l'avons montré dans la vidéo, on inclue du code R de la façon suivante:
For this analysis, we need to perform several transformations. First of all, we keep the raw data in the variable **full.data** then by using the verbs of dplyr we remove the columns on latitude and longitude because they are not useful to us. Then, we keep only the countries that we want to analyze in this project (see list of countries in the variable **countries**, knowing that for France, the United Kingdom and Netherlands, we are not interested in the colonies, so we filter to keep only the metropolitan territories.
```{r cars}
```{r}
summary(cars)
# Define the countries we want to subset for the analysis
countries <- c("Belgium", "China", "Hong Kong", "France", "Germany",
Et on peut aussi aisément inclure des figures. Par exemple:
In this study, China is a special case. We need to isolate Hong Kong from its country. This is why we voluntarily modify the variable **Country.Region** for the region of Hong Kong and we replace "China" by "China, Hong Kong"
Then we need to bring all the provinces of China together in one piece of information. We have chosen to sum up all of these provinces and bring them together in a single row.
```{r, fig.dim=c(10,6)}
# Rename country for the specific case of Hong Kong
# Summarize the information for all the provinces of China, except for Hong Kong
plot(pressure)
data <- data[,2:ncol(data)] %>%
group_by(Country.Region) %>%
summarise_all(funs(sum))
```
```
Vous remarquerez le paramètre `echo = FALSE` qui indique que le code ne doit pas apparaître dans la version finale du document. Nous vous recommandons dans le cadre de ce MOOC de ne pas utiliser ce paramètre car l'objectif est que vos analyses de données soient parfaitement transparentes pour être reproductibles.
### Inspection
Comme les résultats ne sont pas stockés dans les fichiers Rmd, pour faciliter la relecture de vos analyses par d'autres personnes, vous aurez donc intérêt à générer un HTML ou un PDF et à le commiter.
Finally, we can look at our data by plotting.
The first plot that we propose is to look at the number of cumulative cases for all the countries that we have decided to observe. Here we can clearly see that the US has a very high number of cases compared to other countries. It will however be necessary to wait for the end of the epidemic to make a plot which will take into account the population of each country, by normalizing for example per 1000 inhabitants.
Finally, a graph with the date on the abscissa and the cumulative number of cases on this date on the ordinate. The first graph is made with a linear scale and the second with a logarithmic scale.
```{r, fig.dim=c(10,6)}
# Multiple line plot
ggplot(mini.data, aes(x = Date, y = Nb.cases)) +
geom_line(aes(color = Country.Region), group = 1) +
The logarithmic scale is more suitable for comparing the progression of the disease between countries. We are more able to see the rapid progression or not of the disease.