Modification du fichier Rmd fourni pour calculer de nouveau les paramètres du glm.

6fda66c2 · Lisa Fourtune · dfdf45ed · 6fda66c2
Commit 6fda66c2 authored Jan 12, 2022 by Lisa Fourtune
Hide whitespace changes
Inline Side-by-side

Showing with 96 additions and 0 deletions

MonAnalyse.Rmd module4/MonAnalyse.Rmd +96 -0

No files found.
--- a/module4/MonAnalyse.Rmd
+++ b/module4/MonAnalyse.Rmd
+---
+title: 'Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure'
+author: "Lisa Ftn, based on Arnaud Legrand's work"
+date: "12/01/2022"
+output:
+  html_document:
+    df_print: paged
+---
+In this document we reperform some of the analysis provided in
+*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available [here](http://www.jstor.org/stable/2290069).
+On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
+# Technical information on the computer on which the analysis is run
+We will be using the R language using the ggplot2 library.
+```{r}
+library(ggplot2)
+sessionInfo()
+```
+Here are the available libraries
+```{r}
+devtools::session_info()
+```
+# Loading and inspecting data
+Let's start by reading data:
+```{r}
+data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv?inline=false",header=T)
+data
+```
+Let's visually inspect how temperature affects malfunction:
+```{r}
+plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
+```
+# Logistic regression
+I'll only work on the raw data, because I find it more elegant than working on ratios.
+Let's provide the "raw" data to R.
+```{r}
+data_flat=data.frame()
+for(i in 1:nrow(data)) {
+  temperature = data[i,"Temperature"];
+  malfunction = data[i,"Malfunction"];
+  d = data.frame(Temperature=temperature,Malfunction=rep(0,times = data[i,"Count"]))
+  if(malfunction>0) {
+      d[1:malfunction, "Malfunction"]=1;
+  }
+  data_flat=rbind(data_flat,d)
+}
+dim(data_flat)
+str(data_flat)
+```
+Computing regression parameters.
+```{r}
+logistic_reg_flat = glm(data=data_flat, Malfunction ~ Temperature, family=binomial(link='logit'))
+summary(logistic_reg_flat)
+```
+Perfect. The estimates and the standard errors are the same as in the 1989 paper.
+```{r, fig.height=3.3}
+ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
+  geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
+  geom_point(alpha=.5, size = .5) +
+  xlim(30,90) + ylim(0,1) + theme_bw()
+```
+Let's check whether this confidence interval corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function.
+Here is the "direct" (response) version I used in my very first plot:
+```{r}
+pred = predict(logistic_reg_flat,list(Temperature=30),type="response",se.fit = T)
+pred
+```
+The estimated Failure probability for 30° is thus $0.834$. However, the $se.fit$ value seems pretty hard to use as I can obviously not simply add $\pm 2 se.fit$ to $fit$ to compute a confidence interval.
+Here is the "link" version:
+```{r}
+pred_link = predict(logistic_reg_flat,list(Temperature=30), type="link", se.fit = T)
+pred_link
+logistic_reg_flat$family$linkinv(pred_link$fit)
+```
+I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$:
+```{r}
+critval = 1.96
+logistic_reg_flat$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
+                              pred_link$fit+critval*pred_link$se.fit))
+```
+The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent.
+**I am now rather confident that I have managed to correctly compute and plot the uncertainty of my prediction.**