From 6fda66c2ae3c635ea7dee8eaa3ab95325fe144e8 Mon Sep 17 00:00:00 2001 From: Lisa Fourtune Date: Wed, 12 Jan 2022 16:10:09 +0100 Subject: [PATCH] =?UTF-8?q?Modification=20du=20fichier=20Rmd=20fourni=20po?= =?UTF-8?q?ur=20calculer=20de=20nouveau=20les=20param=C3=A8tres=20du=20glm?= =?UTF-8?q?.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- module4/MonAnalyse.Rmd | 96 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 module4/MonAnalyse.Rmd diff --git a/module4/MonAnalyse.Rmd b/module4/MonAnalyse.Rmd new file mode 100644 index 0000000..aa1bf10 --- /dev/null +++ b/module4/MonAnalyse.Rmd @@ -0,0 +1,96 @@ +--- +title: 'Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure' +author: "Lisa Ftn, based on Arnaud Legrand's work" +date: "12/01/2022" +output: + html_document: + df_print: paged +--- + +In this document we reperform some of the analysis provided in +*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available [here](http://www.jstor.org/stable/2290069). + +On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way. + +# Technical information on the computer on which the analysis is run +We will be using the R language using the ggplot2 library. +```{r} +library(ggplot2) +sessionInfo() +``` + +Here are the available libraries +```{r} +devtools::session_info() +``` + + +# Loading and inspecting data +Let's start by reading data: +```{r} +data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv?inline=false",header=T) +data +``` + +Let's visually inspect how temperature affects malfunction: +```{r} +plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1)) +``` + +# Logistic regression + +I'll only work on the raw data, because I find it more elegant than working on ratios. +Let's provide the "raw" data to R. +```{r} +data_flat=data.frame() +for(i in 1:nrow(data)) { + temperature = data[i,"Temperature"]; + malfunction = data[i,"Malfunction"]; + d = data.frame(Temperature=temperature,Malfunction=rep(0,times = data[i,"Count"])) + if(malfunction>0) { + d[1:malfunction, "Malfunction"]=1; + } + data_flat=rbind(data_flat,d) +} +dim(data_flat) +str(data_flat) +``` + +Computing regression parameters. +```{r} +logistic_reg_flat = glm(data=data_flat, Malfunction ~ Temperature, family=binomial(link='logit')) +summary(logistic_reg_flat) +``` +Perfect. The estimates and the standard errors are the same as in the 1989 paper. + +```{r, fig.height=3.3} +ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + + geom_point(alpha=.5, size = .5) + + xlim(30,90) + ylim(0,1) + theme_bw() +``` + +Let's check whether this confidence interval corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function. + +Here is the "direct" (response) version I used in my very first plot: +```{r} +pred = predict(logistic_reg_flat,list(Temperature=30),type="response",se.fit = T) +pred +``` +The estimated Failure probability for 30° is thus $0.834$. However, the $se.fit$ value seems pretty hard to use as I can obviously not simply add $\pm 2 se.fit$ to $fit$ to compute a confidence interval. + +Here is the "link" version: +```{r} +pred_link = predict(logistic_reg_flat,list(Temperature=30), type="link", se.fit = T) +pred_link +logistic_reg_flat$family$linkinv(pred_link$fit) +``` +I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$: +```{r} +critval = 1.96 +logistic_reg_flat$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, + pred_link$fit+critval*pred_link$se.fit)) +``` +The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent. + +**I am now rather confident that I have managed to correctly compute and plot the uncertainty of my prediction.** -- 2.18.1