From 6fda66c2ae3c635ea7dee8eaa3ab95325fe144e8 Mon Sep 17 00:00:00 2001
From: Lisa Fourtune <lfourtun@ad.envt.fr>
Date: Wed, 12 Jan 2022 16:10:09 +0100
Subject: [PATCH] =?UTF-8?q?Modification=20du=20fichier=20Rmd=20fourni=20po?=
 =?UTF-8?q?ur=20calculer=20de=20nouveau=20les=20param=C3=A8tres=20du=20glm?=
 =?UTF-8?q?.?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 module4/MonAnalyse.Rmd | 96 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 96 insertions(+)
 create mode 100644 module4/MonAnalyse.Rmd

diff --git a/module4/MonAnalyse.Rmd b/module4/MonAnalyse.Rmd
new file mode 100644
index 0000000..aa1bf10
--- /dev/null
+++ b/module4/MonAnalyse.Rmd
@@ -0,0 +1,96 @@
+---
+title: 'Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure'
+author: "Lisa Ftn, based on Arnaud Legrand's work"
+date: "12/01/2022"
+output:
+  html_document:
+    df_print: paged
+---
+
+In this document we reperform some of the analysis provided in
+*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available [here](http://www.jstor.org/stable/2290069).
+
+On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
+
+# Technical information on the computer on which the analysis is run
+We will be using the R language using the ggplot2 library.
+```{r}
+library(ggplot2)
+sessionInfo()
+```
+
+Here are the available libraries
+```{r}
+devtools::session_info()
+```
+
+
+# Loading and inspecting data
+Let's start by reading data:
+```{r}
+data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv?inline=false",header=T)
+data
+```
+
+Let's visually inspect how temperature affects malfunction:
+```{r}
+plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
+```
+
+# Logistic regression
+
+I'll only work on the raw data, because I find it more elegant than working on ratios.
+Let's provide the "raw" data to R.
+```{r}
+data_flat=data.frame()
+for(i in 1:nrow(data)) {
+  temperature = data[i,"Temperature"];
+  malfunction = data[i,"Malfunction"];
+  d = data.frame(Temperature=temperature,Malfunction=rep(0,times = data[i,"Count"]))
+  if(malfunction>0) {
+      d[1:malfunction, "Malfunction"]=1;
+  }
+  data_flat=rbind(data_flat,d)
+}
+dim(data_flat)
+str(data_flat)
+```
+
+Computing regression parameters.
+```{r}
+logistic_reg_flat = glm(data=data_flat, Malfunction ~ Temperature, family=binomial(link='logit'))
+summary(logistic_reg_flat)
+```
+Perfect. The estimates and the standard errors are the same as in the 1989 paper.
+
+```{r, fig.height=3.3}
+ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
+  geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
+  geom_point(alpha=.5, size = .5) +
+  xlim(30,90) + ylim(0,1) + theme_bw()
+```
+
+Let's check whether this confidence interval corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function.
+
+Here is the "direct" (response) version I used in my very first plot:
+```{r}
+pred = predict(logistic_reg_flat,list(Temperature=30),type="response",se.fit = T)
+pred
+```
+The estimated Failure probability for 30° is thus $0.834$. However, the $se.fit$ value seems pretty hard to use as I can obviously not simply add $\pm 2 se.fit$ to $fit$ to compute a confidence interval.
+
+Here is the "link" version:
+```{r}
+pred_link = predict(logistic_reg_flat,list(Temperature=30), type="link", se.fit = T)
+pred_link
+logistic_reg_flat$family$linkinv(pred_link$fit)
+```
+I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$:
+```{r}
+critval = 1.96
+logistic_reg_flat$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
+                              pred_link$fit+critval*pred_link$se.fit))
+```
+The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent.
+
+**I am now rather confident that I have managed to correctly compute and plot the uncertainty of my prediction.** 
-- 
2.18.1