Commit 8a5d424d authored by Helene31's avatar Helene31
parents 4afaffe3 527f6951
......@@ -49,4 +49,8 @@ Non functional (expected values are $`5.085`$ and $`-0.1156`$)
| -------- | ---------------- | ------------------------------------------------------------- | ------- | ---------------------------- | ------------- | ---------- | ---------- | --------- | ---------- | ----------------------------------------------------------------- | ----------- |
| R | 3.5.1 | ggplot2 3.0.0 | RStudio | Debian GNU/Linux buster/sid | Identical | Identical | Identical | Identical | Identical | [Rmd](src/R/challenger.Rmd), [pdf](src/R/challenger_debian_alegrand.pdf) | A. Legrand |
| Python | 3.6.4 | statsmodels 0.9.0 numpy 1.13.3 pandas 0.22.0 matplotlib 2.2.2 | Jupyter | Linux Ubuntu 4.4.0-116-generic | Identical | Identical | Identical | Identical | Similar | [ipynb](src/Python3/challenger.ipynb), [pdf](src/Python3/challenger_ubuntuMOOC_alegrand.pdf) | A. Legrand |
| R | 3.5.1 | ggplot2 3.0.0 | RSrudio | Windows >= 8 x64 (build 9200) | Identical | Identical | Identical | Identical | Similar | [Rmd](https://app-learninglab.inria.fr/moocrr/gitlab/8517fa92e97b3a318e653caefbfde6b5/mooc-rr/blob/master/module4/MOOC_exercice_module4.Rmd), [Pdf](https://app-learninglab.inria.fr/moocrr/gitlab/8517fa92e97b3a318e653caefbfde6b5/mooc-rr/blob/master/module4/MOOC_exercice_module4.pdf) | M. Saubin |
| Python | 3.6.4 | statsmodels 0.9.0 numpy 1.15.2 pandas 0.22.0 matplotlib 2.2.3 | Jupyter | Linux Ubuntu 4.4.0-164-generic | Identical | Identical | Identical | Identical | Similar | [ipynb](module4/challenger_Python_ipynb.ipynb), [pdf](module4/challenger_Python_ipynb.pdf) |2992438755465b7fe3afd7856bde0599|
| R | 3.4.4 | ggplot2_3.3.0 | RStudio | Linux Mint 19 | Identical | Identical | Identical | Identical | Identical | [Rmd](https://app-learninglab.inria.fr/moocrr/gitlab/b2c48a7ab4afbff5f4d26650b09eb6b4/mooc-rr/blob/master/module4/challenger_reexecuted.Rmd), [html](https://app-learninglab.inria.fr/moocrr/gitlab/b2c48a7ab4afbff5f4d26650b09eb6b4/mooc-rr/blob/master/module4/challenger_reexecuted.html) | b2c48a7ab4afbff5f4d26650b09eb6b4 |
| Python | 3.6.4 | statsmodels 0.9.0 numpy 1.15.2 pandas 0.22.0 matplotlib 2.2.3 | Jupyter | Linux Ubuntu 4.4.0-164-generic | Identical | Identical | Identical | Identical | Similar | [ipynb](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.ipynb), [pdf](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.pdf) | 34ea1ee296fc8711adf020d9cc2cb571 |
| Matlab | 9.6.0.1072779 (R2019a) | | Matlab Live Script | Windows 10.0.18362 | Identical | Identical | Non Functionnal | Similar | Did not succeed | [mlx](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.mlx), [pdf](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger_matlab.pdf) | 34ea1ee296fc8711adf020d9cc2cb571 |
\ No newline at end of file
......@@ -434,7 +434,7 @@
}
],
"source": [
"data = pd.read_csv(\"https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n",
"data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/data/shuttle.csv\")\n",
"data"
]
},
......@@ -751,7 +751,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
]
}
],
......
......@@ -424,7 +424,7 @@
}
],
"source": [
"data = pd.read_csv(\"https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n",
"data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/data/data/shuttle.csv\")\n",
"data"
]
},
......@@ -833,7 +833,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
]
}
],
......
......@@ -5,12 +5,12 @@
* Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
In this document we reperform some of the analysis provided in
In this document we reperform some of the analysis provided in
/Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
published in /Journal of the American Statistical Association/, Vol. 84,
No. 408 (Dec., 1989), pp. 945-957 and available at
http://www.jstor.org/stable/2290069.
http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum
likelihood estimates of the logistic regression using only temperature
......@@ -30,7 +30,7 @@ and numpy library.
def print_imported_modules():
import sys
for name, val in sorted(sys.modules.items()):
if(hasattr(val, '__version__')):
if(hasattr(val, '__version__')):
print(val.__name__, val.__version__)
# else:
# print(val.__name__, "(unknown version)")
......@@ -55,7 +55,7 @@ print_imported_modules()
Let's start by reading data.
#+begin_src python :results output :session :exports both
data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv")
data = pd.read_csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv")
print(data)
#+end_src
......@@ -87,7 +87,7 @@ import statsmodels.api as sm
data["Success"]=data.Count-data.Malfunction
data["Intercept"]=1
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
family=sm.families.Binomial(sm.families.links.logit)).fit()
print(logmodel.summary())
......@@ -95,7 +95,7 @@ print(logmodel.summary())
The maximum likelyhood estimator of the intercept and of Temperature
are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
to the values from the article of Dalal /et al./ The standard errors are
to the values from the article of Dalal /et al./ The standard errors are
/$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
/3.01444/ with *21* degrees of freedom. I cannot find any value similar
......@@ -107,7 +107,7 @@ same throughout all experiments, it does not change the estimates of
the fit but it does influence the variance estimates).
#+begin_src python :results output :session :exports both
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
family=sm.families.Binomial(sm.families.links.logit),
var_weights=data['Count']).fit()
......@@ -128,7 +128,7 @@ The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model:
#+begin_src python :results output :session :exports both
data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121),
data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121),
'Intercept': 1})
data_pred['Frequency'] = logmodel.predict(data_pred)
print(data_pred.head())
......@@ -157,7 +157,7 @@ et tracer la courbe :
def logit_inv(x):
return(np.exp(x)/(np.exp(x)+1))
data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] +
data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] +
logmodel.params['Intercept'])
print(data_pred.head())
#+end_src
......@@ -195,7 +195,7 @@ matplot_lib_filename
**I think I have managed to correctly compute and plot the uncertainty
of my prediction.** Although the shaded area seems very similar to
[the one obtained by with
R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf),
R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf),
I can spot a few differences (e.g., the blue point for temperature
63 is outside)... Could this be a numerical error ? Or a difference
in the statistical method ? It is not clear which one is "right".
......@@ -5,8 +5,8 @@ date: "25 October 2018"
output: pdf_document
---
In this document we reperform some of the analysis provided in
*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069.
In this document we reperform some of the analysis provided in
*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
......@@ -26,7 +26,7 @@ devtools::session_info()
# Loading and inspecting data
Let's start by reading data:
```{r}
data = read.csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv",header=T)
data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv",header=T)
data
```
......@@ -42,7 +42,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature.
```{r}
logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count,
logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count,
family=binomial(link='logit'))
summary(logistic_reg)
```
......@@ -50,10 +50,10 @@ summary(logistic_reg)
The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal *et al.* article**.
# Predicting failure probability
The temperature when launching the shuttle was 31°F. Let's try to
The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model.:
```{r}
# shuttle=shuttle[shuttle$r!=0,]
# shuttle=shuttle[shuttle$r!=0,]
tempv = seq(from=30, to=90, by = .5)
rmv <- predict(logistic_reg,list(Temperature=tempv),type="response")
plot(tempv,rmv,type="l",ylim=c(0,1))
......@@ -65,7 +65,7 @@ This figure is very similar to the Figure 4 of Dalal et al. **I have managed to
# Confidence on the prediction
Let's try to plot confidence intervals with ggplot2.
```{r, fig.height=3.3}
ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") +
ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
xlim(30,90) + ylim(0,1) + theme_bw()
```
......@@ -96,10 +96,10 @@ summary(logistic_reg)
Perfect. The estimates and the standard errors are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to ratios. Let's use plot the regression for *data_flat* along with the ratios (*data*).
```{r, fig.height=3.3}
ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") +
geom_point(alpha=.5, size = .5) +
geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") +
geom_point(alpha=.5, size = .5) +
xlim(30,90) + ylim(0,1) + theme_bw()
```
......@@ -121,7 +121,7 @@ logistic_reg$family$linkinv(pred_link$fit)
I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$:
```{r}
critval = 1.96
logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
pred_link$fit+critval*pred_link$se.fit))
```
The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent.
......
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment