exomd4

89fa46f7 · Val · 46d783f7 · 89fa46f7 · 89fa46f7 · 89fa46f7
Commit 89fa46f7 authored May 26, 2023 by Val
Showing with 651 additions and 0 deletions

exo_pair_covid_MOOC.Rmd exo_pair_covid_MOOC.Rmd +216 -0

src_R_challenger.Rmd module4/src_R_challenger.Rmd +145 -0

time_series_covid19_confirmed_global.csv time_series_covid19_confirmed_global.csv +290 -0

No files found.
--- a/exo_pair_covid_MOOC.Rmd
+++ b/exo_pair_covid_MOOC.Rmd
+---
+title: "MOOC_COVID_Analysis"
+author: "VB"
+date: "2023-05-24"
+output: html_document
+---
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+# Analyse de l'incidence du syndrôme grippal avec une copie locale des données
+Le but est ici de reproduire des graphes semblables à ceux du [South China Morning Post]<https://www.scmp.com/> (SCMP), sur la page [The Coronavirus Pandemic](https://www.scmp.com/coronavirus?src=homepage_covid_widget) et qui montrent pour différents pays le nombre cumulé (c'est-à-dire le nombre total de cas depuis le début de l'épidémie) de personnes atteintes de la maladie à coronavirus 2019.
+Les données que nous utiliserons dans un premier temps sont compilées par le [Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)](https://systems.jhu.edu/) et sont mises à disposition sur [GitHub](https://github.com/CSSEGISandData/COVID-19). C'est plus particulièrement sur les données `time_series_covid19_confirmed_global.csv` (des suites chronologiques au format csv) disponibles à l'adresse : <https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv>, que nous allons nous concentrer.
+Vous commencerez par télécharger les données pour créer un graphe montrant l'évolution du nombre de cas cumulé au cours du temps pour les pays suivants : la Belgique (*Belgium*), la Chine - toutes les provinces sauf Hong-Kong (*China*), Hong Kong (*China, Hong-Kong*), la France métropolitaine (*France*), l'Allemagne (*Germany*), l'Iran (*Iran*), l'Italie (*Italy*), le Japon (*Japan*), la Corée du Sud (*Korea, South*), la Hollande sans les colonies (*Netherlands*), le Portugal (*Portugal*), l'Espagne (*Spain*), le Royaume-Unis sans les colonies (*United Kingdom*), les États-Unis (*US*).
+Le nom entre parenthèses est le nom du « pays » tel qu'il apparaît dans le fichier `time_series_covid19_confirmed_global.csv`. Les données de la Chine apparaissent par province et nous avons séparé Hong-Kong, non pour prendre parti dans les différences entre cette province et l'état chinois, mais parce que c'est ainsi qu'apparaissent les données sur le site du SCMP. Les données pour la France, la Hollande et le Royaume-Uni excluent les territoires d'outre-mer et autres « résidus coloniaux ».
+Ensuite vous ferez un graphe avec la date en abscisse et le nombre cumulé de cas à cette date en ordonnée. Nous vous proposons de faire deux versions de ce graphe, une avec une échelle linéaire et une avec une échelle logarithmique.
+**The rest of this RMarkdown file will be in english. Documentation for some of the functions is available as commentaries in the code.**
+## Installing and loading required packages
+In this analysis, I will use 3 packages : **tidyverse** to format the data and **ggplot2/ggrepel** for the graphical representation. The next R lines will detect if the packages are installed, and if not, it should install them
+```{r}
+list.of.packages <- c("ggplot2", "tidyverse", "ggrepel")
+new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
+if(length(new.packages)) install.packages(new.packages)
+```
+## Downloading and loading the data in R
+The next chunk of code will load the data in R. If no local copy of the csv file is present, it will be downloaded
+```{r}
+data_url= "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
+data_file = "time_series_covid19_confirmed_global.csv"	
+if (!file.exists(data_file)) {	
+    download.file(data_url, data_file, method="auto")}
+data=read.csv("time_series_covid19_confirmed_global.csv",sep=",") #sep= allows the loading of "," separated csv
+```
+I then check that the loading is correct, since the csv is "," separated
+```{r}
+head(colnames(data)) # Allows us to see that the columns were well separated
+head(data$Country.Region)
+```
+It is correctly loaded, but the data frame will need some manipulations before I can use it with ggplot, the package I will use for the graphical representation
+## Building the data frame to show the number of cases over time
+Since I want to exclude all "colonial territories" except China's regions, I will **create a data frame with rows containing an empty "Province.State"**. The only purpose of this data frame is to simplify the selection of the countries that I want for the analysis
+```{r}
+data_noProvince<-data[(data$Province.State==""),]
+```
+I will **check that the selection is correct** by comparing the number of rows in the two data frame and verifying that France is only shown once
+```{r}
+nrow(data_noProvince)
+nrow(data)
+head(data_noProvince[data_noProvince$Country.Region=="France",c(1,2)])
+```
+Now that the separation is complete, **I will select the countries I want**. China and Hong Kong are exceptions. I will need to add up the numbers of all the regions of China, except Hong-Kong since it will have it's own category.
+```{r}
+list_of_countries <- c("France","Belgium","Germany","Iran","Italy","Japan","Netherlands","Portugal","Spain","United Kingdom","US","Korea, South")
+dt_countries<-data_noProvince[data_noProvince$Country.Region%in%list_of_countries,] # Here, I use the prepared data frame without the non-metropolitan territories
+china_data<-data[data$Country.Region=="China"&data$Province.State!="Hong Kong",] #I get all the numbers associated with China, except Hong-Kong
+hong_kong<-data[data$Country.Region=="China"&data$Province.State=="Hong Kong",]
+hong_kong$Country.Region<-"Hong Kong" #I isolated the numbers associated with Hong-Kong
+```
+**For China, I will add up the infected-per-day numbers**, which will give me a single row for the final data frame.
+```{r}
+china<-cbind(china_data[1,c(1,2)],t(data.frame(list(colSums(china_data[,-c(1,2)]))))) #cbind() allows the fusion of the columns of multiple data frame with a same number of rows. t() is used to transpose the axes of the data frame, for formatting purpose for tidyverse. colSums provides the sums of all the columns
+```
+Finally, I will **assemble the 3 data frames** (China, Hong-Kong, and the other countries). I will put the country names as row_names, for formatting purpose.
+```{r}
+dt_countries<-rbind(china, hong_kong,dt_countries)#rbind is similar to cbind, but to fuse rows
+colnames(dt_countries)<-colnames(data)
+rownames(dt_countries)<-dt_countries$Country.Region
+```
+To allow the graphical representation, **I will remove all the unnecessary variables** (State, Latitude and Longitude). **I transpose the data frame so that countries are now the variables and the date are discriminating values**. Again it is mostly for formatting. 
+```{r}
+dt_countries_onlydata<-as.data.frame(t(dt_countries[,c(5:ncol(dt_countries))]))
+dt_countries_onlydata$date<-rownames(dt_countries_onlydata)
+```
+Using tidyverse, I will prepare the data for processing by ggplot.
+```{r}
+library(tidyverse)
+df<-dt_countries_onlydata %>%
+  select(colnames(dt_countries_onlydata)) %>%
+  gather(key="Country",value="Infected",-date)
+```
+We can compare the two data frames to see the different layouts
+```{r}
+head(dt_countries_onlydata)
+head(df)
+```
+I will finally convert the obscure date format to one that R can recognize
+```{r}
+df$date<-as.Date(str_sub(df$date,2,-1),format="%m.%d.%y")# str_sub is used to remove the "X" before the date
+```
+Now, the data are ready to be drawn by ggplot
+## Graphical representation
+```{r}
+library(ggplot2)
+library(ggrepel)#package used to ad the label shown below
+ggplot(df, aes(x = date, y = Infected)) + 
+  geom_line(aes(color = Country, group = Country))
+```
+I will add a label to identify the most infected country
+```{r}
+label = if_else(df$date == max(df$date), as.character(df$Country), NA_character_)
+  ggplot(df, aes(x = date, y = Infected)) + 
+  geom_line(aes(color = Country, group = Country))+
+  geom_label_repel(aes(label = label,color=Country),
+                  nudge_x = 0, nudge_y=0, max.overlaps=1, direction = "y",
+                  na.rm = TRUE,show.legend=F)
+```
+It is not easy to see the other countries. ggrepel indicates that there is no room to put the labels of the other countries. Let's mask the US and see how it looks like now
+```{r}
+zoomed_df<-df[df$Country!="US",]
+label = if_else(zoomed_df$date == max(zoomed_df$date), as.character(zoomed_df$Country), NA_character_)
+  ggplot(zoomed_df, aes(x = date, y = Infected)) + 
+  geom_line(aes(color = Country, group = Country))+
+  geom_label_repel(aes(label = label,color=Country),
+                  nudge_x = 200, nudge_y=50, max.overlaps=20, direction = "y", force=10,
+                  na.rm = TRUE,show.legend=F)+ theme(legend.position = "none")
+```
+Something is wrong about the number of cases from China. Let's check the total number of cases in the world and then I'll come back to it
+## Total number of case over time
+I will use the data frame imported from the website. First, let's change its name.
+```{r}
+dtpart2<-data
+```
+I need to observe the total number of cases over time, so I will add up all the columns to give me the sum of fall cases at a given day.
+```{r}
+dtpart2$total_cases<-rowSums(dtpart2)
+```
+Now, I will produce the data frame to give to ggplot I also convert the date to a better format, using the same code than above
+```{r}
+df2<-data.frame(date=rownames(dtpart2),total_cases=dtpart2$total_cases)
+df2$date<-as.Date(str_sub(df2$date,2,-1),format="%m.%d.%y")
+```
+Let's do the graphical representation
+```{r}
+ggplot(df2, aes(x = date, y = total_cases)) + 
+  geom_line(color="blue")
+```
+Let's see how it looks after a logarithmic transformation
+```{r}
+ggplot(df2, aes(x = date, y = log(total_cases))) + 
+  geom_line(color="blue")
+```
+plot(colSums(data[,-c(1:4)]))
--- a/module4/src_R_challenger.Rmd
+++ b/module4/src_R_challenger.Rmd
+---
+title: "Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure"
+author: "Arnaud Legrand"
+date: "25 October 2018"
+output: pdf_document
+---
+In this document we reperform some of the analysis provided in *Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at <http://www.jstor.org/stable/2290069>.
+On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
+# Technical information on the computer on which the analysis is run
+We will be using the R language using the ggplot2 library.
+```{r}
+library(ggplot2)
+sessionInfo()
+```
+Here are the available libraries
+```{r}
+devtools::session_info()
+```
+# Loading and inspecting data
+Let's start by reading data:
+```{r}
+data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv?inline=false",header=T)
+data
+```
+We know from our previous experience on this data set that filtering data is a really bad idea. We will therefore process it as such.
+Let's visually inspect how temperature affects malfunction:
+```{r}
+plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
+```
+# Logistic regression
+Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature.
+```{r}
+logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count,
+                   family=binomial(link='logit'))
+summary(logistic_reg)
+```
+The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal *et al.* article**.
+# Predicting failure probability
+The temperature when launching the shuttle was 31°F. Let's try to estimate the failure probability for such temperature using our model.:
+```{r}
+# shuttle=shuttle[shuttle$r!=0,]
+tempv = seq(from=30, to=90, by = .5)
+rmv <- predict(logistic_reg,list(Temperature=tempv),type="response")
+plot(tempv,rmv,type="l",ylim=c(0,1))
+points(data=data, Malfunction/Count ~ Temperature)
+```
+This figure is very similar to the Figure 4 of Dalal et al. **I have managed to replicate the Figure 4 of the Dalal *et al.* article.**
+# Confidence on the prediction
+Let's try to plot confidence intervals with ggplot2.
+```{r, fig.height=3.3}
+ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") +
+  geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
+  xlim(30,90) + ylim(0,1) + theme_bw()
+```
+Mmmh, I have a warning from ggplot2 indicating *"non-integer #successes in a binomial glm!"*. This seems fishy. Furthermore, this confidence region seems huge... It seems strange to me that the uncertainty grows so large for higher temperatures. And compared to my previous call to glm, I haven't indicated the weight which accounts for the fact that each ratio Malfunction/Count corresponds to Count observations (if someone knows how to do this...). There must be something wrong.
+So let's provide the "raw" data to ggplot2.
+```{r}
+data_flat=data.frame()
+for(i in 1:nrow(data)) {
+  temperature = data[i,"Temperature"];
+  malfunction = data[i,"Malfunction"];
+  d = data.frame(Temperature=temperature,Malfunction=rep(0,times = data[i,"Count"]))
+  if(malfunction>0) {
+      d[1:malfunction, "Malfunction"]=1;
+  }
+  data_flat=rbind(data_flat,d)
+}
+dim(data_flat)
+str(data_flat)
+```
+Let's check whether I obtain the same regression or not:
+```{r}
+logistic_reg_flat = glm(data=data_flat, Malfunction ~ Temperature, family=binomial(link='logit'))
+summary(logistic_reg)
+```
+Perfect. The estimates and the standard errors are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to ratios. Let's use plot the regression for *data_flat* along with the ratios (*data*).
+```{r, fig.height=3.3}
+ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
+  geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
+  geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") +
+  geom_point(alpha=.5, size = .5) +
+  xlim(30,90) + ylim(0,1) + theme_bw()
+```
+This confidence interval seems much more reasonable (in accordance with the data) than the previous one. Let's check whether it corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function.
+Here is the "direct" (response) version I used in my very first plot:
+```{r}
+pred = predict(logistic_reg_flat,list(Temperature=30),type="response",se.fit = T)
+pred
+```
+The estimated Failure probability for 30° is thus $0.834$. However, the $se.fit$ value seems pretty hard to use as I can obviously not simply add $\pm 2 se.fit$ to $fit$ to compute a confidence interval.
+Here is the "link" version:
+```{r}
+pred_link = predict(logistic_reg_flat,list(Temperature=30),type="link",se.fit = T)
+pred_link
+logistic_reg$family$linkinv(pred_link$fit)
+```
+I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$:
+```{r}
+critval = 1.96
+logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
+                              pred_link$fit+critval*pred_link$se.fit))
+```
+The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent.
+**I am now rather confident that I have managed to correctly compute and plot the uncertainty of my prediction.** Let's be honnest, it took me a while. My first attempts were plainly wrong (I didn't know how to do this so I trusted ggplot2, which I was misusing) and did not use the correct statistical method. I also feel confident now because this has been somehow validated by other colleagues but it will be interesting that you collect other kind of plots values that you obtained, that differ and that you would probably have kept if you didn't have a reference to compare to. Please provide us with as many versions as you can.
--- a/time_series_covid19_confirmed_global.csv
+++ b/time_series_covid19_confirmed_global.csv