"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
In this document we reperform some of the analysis provided in
*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069.
In this document we reperform some of the analysis provided in
*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
...
...
@@ -26,7 +26,7 @@ devtools::session_info()
# Loading and inspecting data
Let's start by reading data:
```{r}
data = read.csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv",header=T)
data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv",header=T)
Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature.
The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal *et al.* article**.
# Predicting failure probability
The temperature when launching the shuttle was 31°F. Let's try to
The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model.:
Perfect. The estimates and the standard errors are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to ratios. Let's use plot the regression for *data_flat* along with the ratios (*data*).
: Fehler in stats::model.frame(formula = Malfunction/Count - Temperature, :
: Fehler in stats::model.frame(formula = Malfunction/Count - Temperature, :
: Objekt 'Malfunction' nicht gefunden
:
:
: Fehler in summary(logistic_reg) : Objekt 'logistic_reg' nicht gefunden
The maximum likelyhood estimatator of the intercept and of Temperature are thus \hat{\alpha}=? and \hat{\beta}=? and their standard errors are s_{\hat{\alpha}}=? and s_{\hat{\beta}}=?. The Residual deviance corresponds to the Goodness of fit G^2 = ? with ? degrees of freedom. Since some function does not operate as in the example given by Arnaud Legrand: *I have therefore _not yet_ managed to replicate the results of the Dalal /et.al./ article*.
...
...
@@ -150,7 +150,7 @@ The maximum likelyhood estimatator of the intercept and of Temperature are thus
* Predicting failure probability
The temperature when launching the shuttle was 31°F. Let'strytoestimatethefailureprobabilityforsuchtemperatureusingourmodel:
@@ -175,7 +175,7 @@ ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, siz
Apparently I don'thavethewarningArnaudLegrandmentionsfrom~ggplot2~indicating/"non-integer #successes in a binomial glm!"/.Thisseemsfishyforhim,butnotforme.But:yes,thisconfidenceregionseemshuge...Itseemsstrangetomethattheuncertaintygrowssolargeforhighertemperatures.Andcomparedtomypreviouscallto~glm~,Ihaven't indicated the weight which accounts for the fact that each ration Malfunction/Count corresponds to ~Count~ observations (if someone knows how to do this ...) There must be something wrong.
So let'sprovidethe"raw"datato~ggplot2~.
So let'sprovidethe"raw"datato~ggplot2~.
#+begin_srcR:resultsoutput:session*R*:exportsboth
data_flat=data.frame()
for(iin1:nrow(data)){
...
...
@@ -191,7 +191,7 @@ dim(data_flat)
#+end_src
#+RESULTS:
:
:
:[1]1382
#+begin_srcR:resultsoutput:session*R*:exportsboth
...
...
@@ -210,20 +210,20 @@ summary(logistic_reg)
#+end_src
#+RESULTS:
: Fehler in stats::model.frame(formula = Malfunction - Temperature, data = data_flat, :
: Fehler in stats::model.frame(formula = Malfunction - Temperature, data = data_flat, :
: Objekt 'Malfunction' nicht gefunden
:
:
: Fehler in summary(logistic_reg) : Objekt 'logistic_reg' nicht gefunden
So, again these objects are not here (same error as above, probably). So, for Arnaud Legrand this is perfect because he sees a result. For me it is not. The estimates and the standard errors for him are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to rations. Let'suseplottheregressionfor/~data_flat~/alongwiththeratios(/~data~/).
Thisconfidenceintervalseemsmuchmorereasonable(inaccordancewiththedata)thanthepreviousone.Let's check whether it corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function.
Thisconfidenceintervalseemsmuchmorereasonable(inaccordancewiththedata)thanthepreviousone.Let's check whether it corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function.
Here is the "direct" (response) version I used in my very first plot:
#+begin_src R :results output :session *R* :exports both
...
...
@@ -232,9 +232,9 @@ pred
#+end_src
#+RESULTS:
: Fehler in predict(logistic_reg_flat, list(Temperature = 30), type = "response", :
: Fehler in predict(logistic_reg_flat, list(Temperature = 30), type = "response", :
: Objekt 'logistic_reg_flat' nicht gefunden
:
:
: Fehler: Objekt 'pred' nicht gefunden
Again, in my version of this document I cannot find the above defined object anymore. So, I cannot replicate what Arnaud Legrand wrote: The estimated Failure probability for 30° is thus ??. However the /se.fit/ value seems pretty hard to use as I can obviously not simply add \pm2 /se.fit/ to /fit/ to compute a confidence interval.
...
...
@@ -246,9 +246,9 @@ pred.link
#+end_src
#+RESULTS:
: Fehler in predict(logistic_reg_flat, list(Temperature = 39), type = "link", :
: Fehler in predict(logistic_reg_flat, list(Temperature = 39), type = "link", :
: Objekt 'logistic_reg_flat' nicht gefunden
:
:
: Fehler: Objekt 'pred.link' nicht gefunden
#+begin_src R :results output :session *R* :exports both
The 95% confidence interval for our estimation is thus [??,??]. This is what ~ggplot2~ just plotted me. This seems coherent.
*I am now _not yet_ rather confident that I have managed to correctly compute and plot uncertainty of my prediction.* Let'sbehonnest,ittookmeawile.Myfirstattemptswereplainlywrong(Ididn't know how to do this so I trusted ~ggplot2~, which I was misusing) and did not use the correct statistical method. I also feel confident now becuase this has been somehow validated by other colleagues but it will be interesting that you collect other kind of plot values that you obtained ,that differ and that you would probably have kept if you didn'thaveareferencetocompareto.Please,provideuswithasmanyversionsasyoucan.
*I am now _not yet_ rather confident that I have managed to correctly compute and plot uncertainty of my prediction.* Let'sbehonnest,ittookmeawile.Myfirstattemptswereplainlywrong(Ididn't know how to do this so I trusted ~ggplot2~, which I was misusing) and did not use the correct statistical method. I also feel confident now becuase this has been somehow validated by other colleagues but it will be interesting that you collect other kind of plot values that you obtained ,that differ and that you would probably have kept if you didn'thaveareferencetocompareto.Please,provideuswithasmanyversionsasyoucan.
So,I'm disappointed because some error in R or in my config leads to the fact that some objects are forgotten between blocks. I will try to export the whole document in pdf to see if that changes. Unfortunately, it doesn't.RightnowIdonothavethetimetofigureoutbymyselfhowtochangethis.SoIwillonlyuploadthisdocumentandhopeitstillcontributestothedatabaseinsomeway.