--- title: "Chalenger" author: "Pierre Lacoste" date: "14/12/2021" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Information technique pour lesquelles l'analyse est executé: Librairie `ggplot2` ```{r} library(ggplot2) sessionInfo() ``` ## Chargement et inspection des données ```{r} data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv?inline=false") data ``` Trion les données même si nous savons que ce n'est pas une bonne idée. On commence par inspecter les données visuellement. ```{r} plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1)) ``` ## Regression logistique On assume que le seul parmètre qui influence le risque de défaillance est la température. On peut donc faire une regression logistique pour estimer l'effet de la température sur le risque de défaillance. ```{r} logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count, family=binomial(link='logit')) summary(logistic_reg) ``` The maximum likelyhood estimator of the intercept and of Temperature are thus $\alpha = 5.0849$ and $\beta = −0.1156$ and their standard errors are $s_\alpha = 3.052$ and $s_\beta = 0.04702$. The Residual deviance corresponds to the Goodness of fit G2 = 18.086 with 21 degrees of freedom. I have therefore managed to replicate the results of the Dalal et al. article. ## Prediction de la probabilité de defaillance: ```{r} # shuttle=shuttle[shuttle$r!=0,] tempv = seq(from=30, to=90, by = .5) rmv <- predict(logistic_reg,list(Temperature=tempv),type="response") plot(tempv,rmv,type="l",ylim=c(0,1)) points(data=data, Malfunction/Count ~ Temperature) ``` Figure similaire à la figure 4 de l'article. ## Intervale de confiance de la prédiction ```{r} ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + xlim(30,90) + ylim(0,1) + theme_bw() ``` Message d'erreur de ggplot. On donne les donnée brute à ggplot ```{r} data_flat=data.frame() for(i in 1:nrow(data)) { temperature = data[i,"Temperature"]; malfunction = data[i,"Malfunction"]; d = data.frame(Temperature=temperature,Malfunction=rep(0,times = data[i,"Count"])) if(malfunction>0) { d[1:malfunction, "Malfunction"]=1; } data_flat=rbind(data_flat,d) } dim(data_flat) ``` ```{r} str(data_flat) ``` ```{r} logistic_reg_flat = glm(data=data_flat, Malfunction ~ Temperature, family=binomial(link='logit')) summary(logistic_reg) ``` ```{r} ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") + geom_point(alpha=.5, size = .5) + xlim(30,90) + ylim(0,1) + theme_bw() ``` ```{r} pred = predict(logistic_reg_flat,list(Temperature=30),type="response",se.fit = T) pred ``` ```{r} pred_link = predict(logistic_reg_flat,list(Temperature=30),type="link",se.fit = T) pred_link ``` ```{r} logistic_reg$family$linkinv(pred_link$fit) ``` ```{r} critval = 1.96 logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, pred_link$fit+critval*pred_link$se.fit)) ```