Replace challenger.Rmd

parent f88d413a
--- ---
title: "Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure" title: "Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure"
author: "Arnaud Legrand" author: "Tegegne"
date: "25 October 2018 - New Execution 2020-06-11" date: "June 16, 2020"
output: pdf_document output: html_document
--- ---
In this document we reperform some of the analysis provided in
*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
# Technical information on the computer on which the analysis is run
We will be using the R language using the ggplot2 library.
```{r} ```{r}
library(ggplot2) library(ggplot2)
sessionInfo() sessionInfo()
...@@ -24,12 +17,8 @@ devtools::session_info() ...@@ -24,12 +17,8 @@ devtools::session_info()
# Loading and inspecting data # Loading and inspecting data
Let's start by reading data:
```{r} ```{r}
#data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv",header=T) data <- read.csv("~/R/data.csv")
# Link to the dataset modified
data <- read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv")
data data
``` ```
...@@ -45,8 +34,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1)) ...@@ -45,8 +34,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature. Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature.
```{r} ```{r}
logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count, logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count, family=binomial(link='logit'))
family=binomial(link='logit'))
summary(logistic_reg) summary(logistic_reg)
``` ```
...@@ -108,6 +96,7 @@ ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + ...@@ -108,6 +96,7 @@ ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
This confidence interval seems much more reasonable (in accordance with the data) than the previous one. Let's check whether it corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function. This confidence interval seems much more reasonable (in accordance with the data) than the previous one. Let's check whether it corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function.
Here is the "direct" (response) version I used in my very first plot: Here is the "direct" (response) version I used in my very first plot:
```{r} ```{r}
pred = predict(logistic_reg_flat,list(Temperature=30),type="response",se.fit = T) pred = predict(logistic_reg_flat,list(Temperature=30),type="response",se.fit = T)
...@@ -115,18 +104,21 @@ pred ...@@ -115,18 +104,21 @@ pred
``` ```
The estimated Failure probability for 30° is thus $0.834$. However, the $se.fit$ value seems pretty hard to use as I can obviously not simply add $\pm 2 se.fit$ to $fit$ to compute a confidence interval. The estimated Failure probability for 30° is thus $0.834$. However, the $se.fit$ value seems pretty hard to use as I can obviously not simply add $\pm 2 se.fit$ to $fit$ to compute a confidence interval.
Here is the "link" version: Here is the "link" version:
```{r} ```{r}
pred_link = predict(logistic_reg_flat,list(Temperature=30),type="link",se.fit = T) pred_link = predict(logistic_reg_flat,list(Temperature=30),type="link",se.fit = T)
pred_link pred_link
logistic_reg$family$linkinv(pred_link$fit) logistic_reg$family$linkinv(pred_link$fit)
``` ```
I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$: I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$:
```{r} ```{r}
critval = 1.96 critval = 1.96
logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
pred_link$fit+critval*pred_link$se.fit)) pred_link$fit+critval*pred_link$se.fit))
``` ```
The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent. The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent.
**I am now rather confident that I have managed to correctly compute and plot the uncertainty of my prediction.** Let's be honnest, it took me a while. My first attempts were plainly wrong (I didn't know how to do this so I trusted ggplot2, which I was misusing) and did not use the correct statistical method. I also feel confident now because this has been somehow validated by other colleagues but it will be interesting that you collect other kind of plots values that you obtained, that differ and that you would probably have kept if you didn't have a reference to compare to. Please provide us with as many versions as you can. **I am now rather confident that I have managed to correctly compute and plot the uncertainty of my prediction.** Let's be honnest, it took me a while. My first attempts were plainly wrong (I didn't know how to do this so I trusted ggplot2, which I was misusing) and did not use the correct statistical method. I also feel confident now because this has been somehow validated by other colleagues but it will be interesting that you collect other kind of plots values that you obtained, that differ and that you would probably have kept if you didn't have a reference to compare to. Please provide us with as many versions as you can.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment