diff --git a/module4/MonAnalyse.Rmd b/module4/MonAnalyse.Rmd new file mode 100644 index 0000000000000000000000000000000000000000..aa1bf10330ca9f23b79dbedb0ebe8ce96b5d482b --- /dev/null +++ b/module4/MonAnalyse.Rmd @@ -0,0 +1,96 @@ +--- +title: 'Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure' +author: "Lisa Ftn, based on Arnaud Legrand's work" +date: "12/01/2022" +output: + html_document: + df_print: paged +--- + +In this document we reperform some of the analysis provided in +*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available [here](http://www.jstor.org/stable/2290069). + +On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way. + +# Technical information on the computer on which the analysis is run +We will be using the R language using the ggplot2 library. +```{r} +library(ggplot2) +sessionInfo() +``` + +Here are the available libraries +```{r} +devtools::session_info() +``` + + +# Loading and inspecting data +Let's start by reading data: +```{r} +data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv?inline=false",header=T) +data +``` + +Let's visually inspect how temperature affects malfunction: +```{r} +plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1)) +``` + +# Logistic regression + +I'll only work on the raw data, because I find it more elegant than working on ratios. +Let's provide the "raw" data to R. +```{r} +data_flat=data.frame() +for(i in 1:nrow(data)) { + temperature = data[i,"Temperature"]; + malfunction = data[i,"Malfunction"]; + d = data.frame(Temperature=temperature,Malfunction=rep(0,times = data[i,"Count"])) + if(malfunction>0) { + d[1:malfunction, "Malfunction"]=1; + } + data_flat=rbind(data_flat,d) +} +dim(data_flat) +str(data_flat) +``` + +Computing regression parameters. +```{r} +logistic_reg_flat = glm(data=data_flat, Malfunction ~ Temperature, family=binomial(link='logit')) +summary(logistic_reg_flat) +``` +Perfect. The estimates and the standard errors are the same as in the 1989 paper. + +```{r, fig.height=3.3} +ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + + geom_point(alpha=.5, size = .5) + + xlim(30,90) + ylim(0,1) + theme_bw() +``` + +Let's check whether this confidence interval corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function. + +Here is the "direct" (response) version I used in my very first plot: +```{r} +pred = predict(logistic_reg_flat,list(Temperature=30),type="response",se.fit = T) +pred +``` +The estimated Failure probability for 30° is thus $0.834$. However, the $se.fit$ value seems pretty hard to use as I can obviously not simply add $\pm 2 se.fit$ to $fit$ to compute a confidence interval. + +Here is the "link" version: +```{r} +pred_link = predict(logistic_reg_flat,list(Temperature=30), type="link", se.fit = T) +pred_link +logistic_reg_flat$family$linkinv(pred_link$fit) +``` +I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$: +```{r} +critval = 1.96 +logistic_reg_flat$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, + pred_link$fit+critval*pred_link$se.fit)) +``` +The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent. + +**I am now rather confident that I have managed to correctly compute and plot the uncertainty of my prediction.**