Initial import with a tentative replication of Dalal et al.

27db58c7 · Arnaud Legrand · 27db58c7 · 27db58c7 · 27db58c7 · 27db58c7
Commit 27db58c7 authored Sep 23, 2018 by Arnaud Legrand
Showing with 121 additions and 0 deletions

README.md README.md +31 -0

challenger.pdf challenger.pdf +0 -0

results.org results.org +20 -0

challenger.ipynb src/Python3/challenger.ipynb +0 -0

challenger.Rmd src/R/challenger.Rmd +70 -0

No files found.
--- a/README.md
+++ b/README.md
+In this project, we gather reproduction attempts from the Challenger
+study. In particular, we try to reperform some of the analysis
+provided in *Risk Analysis of the Space Shuttle: Pre-Challenger
+Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes,
+Bruce Hoadley* published in *Journal of the American Statistical
+Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available
+at
+[https://studies2.hec.fr/jahia/webdav/site/hec/shared/sites/czellarv/acces_anonyme/OringJASA_1989.pdf](here)
+(here is [http://www.jstor.org/stable/2290069](the official JASA
+webpage)).
+On the fourth page of this article, they indicate that the maximum
+likelihood estimates of the logistic regression using only temperature
+are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their
+asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and
+$s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model
+was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce
+the computation behind these values and the Figure 4 of this article,
+possibly in a nicer looking way.
+[**Here is our successful replication of Dalal et al. results using
+R**](file:challenger.pdf).
+In case it helps, we provide you with two implementations of this case
+study but we encourage you to **reimplement them by yourself** using both
+your favourite language and an other language you do not know yet.
+- A [Jupyter Python3 notebook](file:src/Python3/challenger.ipynb)
+- An [Rmarkdown document](file:src/R/challenger.Rmd)
+Then **update the [meta-study result table available
+here](file:results.org) with your own results**.
--- a/challenger.pdf
+++ b/challenger.pdf
--- a/results.org
+++ b/results.org
+Update the following table with your own results by indicating in each
+column:
+- Language: R, Python3, Julia, Perl, C, ...
+- Language version:
+- Main libraries: please indicate the versions of all the loaded libraries
+- Operating System: Linux, Mac OS X, Windows, Android, ... along with its version
+- $\hat{\alpha}$ and $\hat{\\beta}: Identical, Similar, Different, Non
+  functional (expected values are $5.085$ and $-0.1156$)
+- $s_{\hat{\alpha}}$ and $s_{\hat{\\beta}}: Identical, Similar, Different, Non
+  functional (expected values are $3.052$ and $0.047$)
+- $G^2$ and degree of freedom: Identical, Similar, Different, Non
+  functional (expected values are $18.086$ and $21$).
+- Figure: Similar, Different, Non functional
+- Confidence region: Similar, Different, Non functional
+| Language | Language version | Main libraries                                                | Operating System            | $\hat{\alpha}$ and $\hat{\\beta}$ | $s_{\hat{\alpha}$ and $s_{\hat{\beta}$ | $G^{2}$           | Figure    | Confidence Region | Link to the document                |
+|----------+------------------+---------------------------------------------------------------+-----------------------------+--------------------------+-------------------------------+----------------+-----------+-------------------+-------------------------------------|
+| R        | 3.5.1            | ggplot2 3.0.0                                                 | Debian GNU/Linux buster/sid | Identical                | Identical                     | Identical      | Identical | Identical         | [file:src/R/challenger.Rmd]         |
+| Python   | 3.6.5rc1         | statsmodels 0.9.0 numpy 1.14.5 pandas 0.22.0 matplotlib 2.1.1 | Linux Debian 4.15.11-1      | Identical                | *Different*                     | *Non Functional* | Identical | *Non Functional*    | [file:src/Python3/challenger.ipynb] |
--- a/src/Python3/challenger.ipynb
+++ b/src/Python3/challenger.ipynb
--- a/src/R/challenger.Rmd
+++ b/src/R/challenger.Rmd
+---
+title: "Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure"
+author: "Arnaud Legrand"
+date: "23 September 2018"
+output: pdf_document
+---
+In this document we reperform some of the analysis provided in 
+*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069. 
+On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
+# Technical information on the computer on which the analysis is run
+We will be using the R language using the ggplot2 library.
+```{r}
+library(ggplot2)
+sessionInfo()
+```
+Here are the available libraries
+```{r}
+devtools::session_info()
+```
+# Loading and inspecting data
+Let's start by reading data:
+```{r}
+data = read.csv("../../data/shuttle.csv",header=T)
+data
+```
+We know from our previous experience on this data set that filtering data is a really bad idea. We will therefore process it as such.
+Let's visually inspect how temperature affects malfunction:
+```{r}
+plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
+```
+# Logistic regression
+Let's assume O-rings indpendently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature.
+```{r}
+logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count, 
+                   family=binomial(link='logit'))
+summary(logistic_reg)
+```
+The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal et. al. article**.
+# Predicting failure probability
+The temperature when launching the shuttle was 31°F. Let's try to 
+estimate the failure probability for such temperature using our model.:
+```{r}
+# shuttle=shuttle[shuttle$r!=0,] 
+tempv = seq(from=30, to=90, by = .5)
+rmv <- predict(logistic_reg,list(Temperature=tempv),type="response")
+plot(tempv,rmv,type="l",ylim=c(0,1))
+points(data=data, Malfunction/Count ~ Temperature)
+```
+This figure is very similar to the Figure 4 of Dalal et al. **I have managed to replicate the Figure 4 of the Dalal et al. article.**
+Let's try to plot confidence intervals although I am not sure exactly how they are computed.
+```{r}
+ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2) + 
+  geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
+  xlim(30,90) + ylim(0,1) + theme_bw()
+```
+No confidence region was given in the original article. **Let's hope this confidence region estimation is correct.**