--- title: "Autour du SARS-CoV-2 (Covid-19)" author: "Franck Bonardi" output: pdf_document: toc: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message=F, warning = F) ``` ## Subject The goal here is to reproduce graphs similar to those of the South China Morning Post (SCMP), on the The Coronavirus Pandemic page and which show for different countries the cumulative number (i.e. the total number of cases since beginning of the epidemic) of people with coronavirus disease 2019. ## Data preprocessing The data that we will use initially are compiled by the [Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)](https://systems.jhu.edu/) and are made available on [GitHub.](https://github.com/CSSEGISandData/COVID-19) It is more particularly on the data `time_series_covid19_confirmed_global.csv` (chronological suites in [csv](https://fr.wikipedia.org/wiki/Comma-separated_values) format) available at the address: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_s_cn_series_id_series_id_series_virtual_series_virtual_series_video_series_video_series_video_series_video_series_video_series_video_social , which we will focus on. ```{r} #Load librairies library(dplyr) library(tidyr) library(ggplot2) library(scales) library(directlabels) library(magrittr) ``` ```{r} data_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv" ``` The time series table that we use is for the global confirmed cases. Australia, Canada and China are reported at the province/state level. Dependencies of the Netherlands, the UK, France and Denmark are listed under the province/state level. The US and other countries are at the country level. This is the quick documentation of the data : | Column name | Description | |--------------+---------------------------------------------------------------------------------------------------------------------------| | `Province/State` | If completed, state or province name. | | `Country/Region` | Name of the country/Region | | `Lat` | Latitude of country, state or province | | `Long` | Longitude of country, state or province | Then, each column represents a day from January 25th 2020 until now. ### Download Reading the CSV file and convert all dates in each column to **mm/dd/yy** format ```{r} full.data = read.csv(data_url, stringsAsFactors = F) names(full.data)[-1:-4] <- format(as.Date(names(full.data)[-1:-4], format = "X%m.%d.%y"), "%m/%d/%y") ``` ## Exploring data Let's have a look at what we got: ```{r} head(full.data[,1:10]) tail(full.data[,1:10]) ``` Are there missing data points? ```{r} na_records = apply(full.data, 1, function (x) any(is.na(x))) sum(na_records) # full.data[na_records,] ``` There are `r sum(na_records)` missing values in this dataset. ## Formatting data For this analysis, we need to perform several transformations. First of all, we keep the raw data in the variable **full.data** then by using the verbs of dplyr we remove the columns on latitude and longitude because they are not useful to us. Then, we keep only the countries that we want to analyze in this project (see list of countries in the variable **countries**, knowing that for France, the United Kingdom and Netherlands, we are not interested in the colonies, so we filter to keep only the metropolitan territories. ```{r} # Define the countries we want to subset for the analysis countries <- c("Belgium", "China", "Hong Kong", "France", "Germany", "Iran", "Italy", "Japan", "Korea, South", "Netherlands", "Portugal", "Spain", "United Kingdom", "US") # Select desired countries for the analysis and remove latitude and longitude informations # and apply a custom filter for the countries data <- full.data %>% select(-starts_with("Lat"), -starts_with("Long")) %>% filter(Country.Region %in% countries) %>% filter(!(Country.Region %in% c("France", "United Kingdom", "Netherlands") & Province.State != "")) ``` In this study, China is a special case. We need to isolate Hong Kong from its country. This is why we voluntarily modify the variable **Country.Region** for the region of Hong Kong and we replace "China" by "China, Hong Kong" Then we need to bring all the provinces of China together in one piece of information. We have chosen to sum up all of these provinces and bring them together in a single row. ```{r, fig.dim=c(10,6)} # Rename country for the specific case of Hong Kong data$Country.Region[which(data$Province.State == "Hong Kong" & data$Country.Region == "China")] <- "China, Hong-Kong" data$Country.Region <- as.factor(data$Country.Region) data$Province.State <- as.factor(data$Province.State) # Summarize the information for all the provinces of China, except for Hong Kong data <- data[,2:ncol(data)] %>% group_by(Country.Region) %>% summarise_all(funs(sum)) ``` ### Inspection Finally, we can look at our data by plotting. The first plot that we propose is to look at the number of cumulative cases for all the countries that we have decided to observe. Here we can clearly see that the US has a very high number of cases compared to other countries. It will however be necessary to wait for the end of the epidemic to make a plot which will take into account the population of each country, by normalizing for example per 1000 inhabitants. ```{r, fig.dim=c(10,6)} mini.data <- gather(data, "Date", "Nb.cases", 2:ncol(data)) mini.data$Date <- as.Date(mini.data$Date,format="%m/%d/%Y") class(mini.data$Date) last.data <- mini.data %>% group_by(Country.Region) %>% summarise_each(funs(max)) ggplot(last.data, aes(x = reorder(Country.Region, Nb.cases), y = Nb.cases)) + geom_col( aes(fill = Country.Region)) + coord_flip() ``` Another plot shows this time the number of cases accumulated over time in all countries (among those chosen at the start of this study) ```{r, fig.dim=c(10,6)} # Cumulative histogram for the number of cases accross selected countries ggplot(mini.data, aes(x=Date, y=Nb.cases, by= Country.Region)) + geom_bar(stat="identity", fill="steelblue")+ theme_minimal() ``` Finally, a graph with the date on the abscissa and the cumulative number of cases on this date on the ordinate. The first graph is made with a linear scale and the second with a logarithmic scale. ```{r, fig.dim=c(10,6)} # Multiple line plot ggplot(mini.data, aes(x = Date, y = Nb.cases)) + geom_line(aes(color = Country.Region), group = 1) + scale_x_date(breaks = "week", labels=date_format("%y/%m/%d"))+ theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Multiple line plot with log-scale ggplot(mini.data, aes(x = Date, y = log(Nb.cases+1))) + geom_line(aes(color = Country.Region)) + scale_x_date(breaks = "week", labels=date_format("%y/%m/%d"))+ geom_dl(aes(label = Country.Region, colour=Country.Region), method = list(dl.combine("last.points"), cex=0.8, rot = 0, vjust=-0.3, hjust=0.6)) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` The logarithmic scale is more suitable for comparing the progression of the disease between countries. We are more able to see the rapid progression or not of the disease. ```{r} sessionInfo() ```