Update exercice_fr.Rmd

663a1954 · 79e802aa17c93ff182cdea377fb6b01e · 4841b938 · 663a1954
Commit 663a1954 authored Jul 21, 2020 by 79e802aa17c93ff182cdea377fb6b01e
Show whitespace changes
Inline Side-by-side

Showing with 150 additions and 17 deletions

exercice_fr.Rmd module3/exo3/exercice_fr.Rmd +150 -17

No files found.
--- a/module3/exo3/exercice_fr.Rmd
+++ b/module3/exo3/exercice_fr.Rmd
 ---
-title: "Votre titre"
+title: "Autour du SARS-CoV-2 (Covid-19)"
-author: "Votre nom"
+author: "Franck Bonardi"
-date: "La date du jour"
+output:
-output: html_document
+  pdf_document:
+    toc: true
 ---
 ```{r setup, include=FALSE}
-knitr::opts_chunk$set(echo = TRUE)
+knitr::opts_chunk$set(echo = TRUE, message=F, warning = F)
+```
+## Subject
+The goal here is to reproduce graphs similar to those of the South China Morning Post (SCMP), on the The Coronavirus Pandemic page and which show for different countries the cumulative number (i.e. the total number of cases since beginning of the epidemic) of people with coronavirus disease 2019.
+## Data preprocessing
+The data that we will use initially are compiled by the [Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)](https://systems.jhu.edu/) and are made available on [GitHub.](https://github.com/CSSEGISandData/COVID-19) It is more particularly on the data `time_series_covid19_confirmed_global.csv` (chronological suites in [csv](https://fr.wikipedia.org/wiki/Comma-separated_values) format) available at the address: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_s_cn_series_id_series_id_series_virtual_series_virtual_series_video_series_video_series_video_series_video_series_video_series_video_social , which we will focus on.
+```{r}
+#Load librairies
+library(dplyr)
+library(tidyr)
+library(ggplot2)
+library(scales)
+library(directlabels)
+library(magrittr)
+```
+```{r}
+data_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
+```
+The time series table that we use is for the global confirmed cases. Australia, Canada and China are reported at the province/state level. Dependencies of the Netherlands, the UK, France and Denmark are listed under the province/state level. The US and other countries are at the country level.
+This is the quick documentation of the data :
+| Column name  | Description                                                                                                               |
+|--------------+---------------------------------------------------------------------------------------------------------------------------|
+| `Province/State`       | If completed, state or province name.                                                                |
+| `Country/Region`  | Name of the country/Region                               |
+| `Lat`        | Latitude of country, state or province                                                      |
+| `Long`    | Longitude of country, state or province                                                           |
+Then, each column represents a day from January 25th 2020 until now.
+### Download
+Reading the CSV file and convert all dates in each column to **mm/dd/yy** format
+```{r}
+full.data = read.csv(data_url, stringsAsFactors = F)
+names(full.data)[-1:-4] <- format(as.Date(names(full.data)[-1:-4], 
+                                          format = "X%m.%d.%y"), "%m/%d/%y")
+```
+## Exploring data
+Let's have a look at what we got:
+```{r}
+head(full.data[,1:10])
+tail(full.data[,1:10])
+```
+Are there missing data points?
+```{r}
+na_records = apply(full.data, 1, function (x) any(is.na(x)))
+sum(na_records)
+# full.data[na_records,]
 ```
-## Quelques explications
+There are `r sum(na_records)` missing values in this dataset.
-Ceci est un document R markdown que vous pouvez aisément exporter au format HTML, PDF, et MS Word. Pour plus de détails sur R Markdown consultez <http://rmarkdown.rstudio.com>.
+## Formatting data
-Lorsque vous cliquerez sur le bouton **Knit** ce document sera compilé afin de ré-exécuter le code R et d'inclure les résultats dans un document final. Comme nous vous l'avons montré dans la vidéo, on inclue du code R de la façon suivante:
+For this analysis, we need to perform several transformations. First of all, we keep the raw data in the variable **full.data** then by using the verbs of dplyr we remove the columns on latitude and longitude because they are not useful to us. Then, we keep only the countries that we want to analyze in this project (see list of countries in the variable **countries**, knowing that for France, the United Kingdom and Netherlands, we are not interested in the colonies, so we filter to keep only the metropolitan territories.
-```{r cars}
+```{r}
-summary(cars)
+# Define the countries we want to subset for the analysis
+countries <- c("Belgium", "China", "Hong Kong", "France", "Germany", 
+               "Iran", "Italy", "Japan", "Korea, South", "Netherlands", 
+               "Portugal", "Spain", "United Kingdom", "US")
+# Select desired countries for the analysis and remove latitude and longitude informations 
+# and apply a custom filter for the countries
+data <- full.data %>% 
+  select(-starts_with("Lat"), -starts_with("Long")) %>% 
+  filter(Country.Region %in% countries) %>%
+  filter(!(Country.Region %in% c("France", "United Kingdom", "Netherlands") 
+           & Province.State != ""))
 ```
-Et on peut aussi aisément inclure des figures. Par exemple:
+In this study, China is a special case. We need to isolate Hong Kong from its country. This is why we voluntarily modify the variable **Country.Region** for the region of Hong Kong and we replace "China" by "China, Hong Kong"
+Then we need to bring all the provinces of China together in one piece of information. We have chosen to sum up all of these provinces and bring them together in a single row.
+```{r,  fig.dim=c(10,6)}
+# Rename country for the specific case of Hong Kong
+data$Country.Region[which(data$Province.State == "Hong Kong" 
+                          & data$Country.Region == "China")] <- "China, Hong-Kong"
+data$Country.Region <- as.factor(data$Country.Region)
+data$Province.State <- as.factor(data$Province.State)
-```{r pressure, echo=FALSE}
+# Summarize the information for all the provinces of China, except for Hong Kong
-plot(pressure)
+data <- data[,2:ncol(data)] %>% 
+     group_by(Country.Region) %>% 
+     summarise_all(funs(sum))
 ```
-Vous remarquerez le paramètre `echo = FALSE` qui indique que le code ne doit pas apparaître dans la version finale du document. Nous vous recommandons dans le cadre de ce MOOC de ne pas utiliser ce paramètre car l'objectif est que vos analyses de données soient parfaitement transparentes pour être reproductibles. 
+### Inspection
-Comme les résultats ne sont pas stockés dans les fichiers Rmd, pour faciliter la relecture de vos analyses par d'autres personnes, vous aurez donc intérêt à générer un HTML ou un PDF et à le commiter.
+Finally, we can look at our data by plotting.
+The first plot that we propose is to look at the number of cumulative cases for all the countries that we have decided to observe. Here we can clearly see that the US has a very high number of cases compared to other countries. It will however be necessary to wait for the end of the epidemic to make a plot which will take into account the population of each country, by normalizing for example per 1000 inhabitants.
+```{r,  fig.dim=c(10,6)}
+mini.data <- gather(data, "Date", "Nb.cases", 2:ncol(data))
+mini.data$Date <- as.Date(mini.data$Date,format="%m/%d/%Y")
+class(mini.data$Date)
+last.data <- mini.data %>%
+   group_by(Country.Region) %>%
+   summarise_each(funs(max))
+ggplot(last.data, aes(x = reorder(Country.Region, Nb.cases), y = Nb.cases)) +
+  geom_col( aes(fill = Country.Region)) +
+  coord_flip()
+```
-Maintenant, à vous de jouer! Vous pouvez effacer toutes ces informations et les remplacer par votre document computationnel.
+Another plot shows this time the number of cases accumulated over time in all countries (among those chosen at the start of this study)
+```{r,  fig.dim=c(10,6)}
+# Cumulative histogram for the number of cases accross selected countries
+ggplot(mini.data, aes(x=Date, y=Nb.cases, by= Country.Region)) +
+  geom_bar(stat="identity", fill="steelblue")+
+  theme_minimal()
+```
+Finally, a graph with the date on the abscissa and the cumulative number of cases on this date on the ordinate. The first graph is made with a linear scale and the second with a logarithmic scale.
+```{r,  fig.dim=c(10,6)}
+# Multiple line plot
+ggplot(mini.data, aes(x = Date, y = Nb.cases)) + 
+  geom_line(aes(color = Country.Region), group = 1) +
+  scale_x_date(breaks = "week", labels=date_format("%y/%m/%d"))+
+  theme(axis.text.x = element_text(angle = 45, hjust = 1))
+# Multiple line plot with log-scale
+ggplot(mini.data, aes(x = Date, y = log(Nb.cases+1))) + 
+  geom_line(aes(color = Country.Region)) +
+  scale_x_date(breaks = "week", labels=date_format("%y/%m/%d"))+
+  geom_dl(aes(label = Country.Region, colour=Country.Region), 
+          method = list(dl.combine("last.points"), 
+                        cex=0.8, rot = 0, vjust=-0.3, hjust=0.6)) +
+  theme(axis.text.x = element_text(angle = 45, hjust = 1))
+```
+The logarithmic scale is more suitable for comparing the progression of the disease between countries. We are more able to see the rapid progression or not of the disease.
+```{r}
+sessionInfo()
+```