Commit 7312310c authored by Laurence Farhi's avatar Laurence Farhi

Erreur repository session1 -> session3

parent 630ad462
...@@ -434,7 +434,7 @@ ...@@ -434,7 +434,7 @@
} }
], ],
"source": [ "source": [
"data = pd.read_csv(\"https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n", "data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/data/shuttle.csv\")\n",
"data" "data"
] ]
}, },
...@@ -751,7 +751,7 @@ ...@@ -751,7 +751,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"." "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
] ]
} }
], ],
......
...@@ -424,7 +424,7 @@ ...@@ -424,7 +424,7 @@
} }
], ],
"source": [ "source": [
"data = pd.read_csv(\"https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n", "data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/data/data/shuttle.csv\")\n",
"data" "data"
] ]
}, },
...@@ -833,7 +833,7 @@ ...@@ -833,7 +833,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"." "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
] ]
} }
], ],
......
...@@ -5,12 +5,12 @@ ...@@ -5,12 +5,12 @@
* Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure * Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
In this document we reperform some of the analysis provided in In this document we reperform some of the analysis provided in
/Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of /Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/ Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
published in /Journal of the American Statistical Association/, Vol. 84, published in /Journal of the American Statistical Association/, Vol. 84,
No. 408 (Dec., 1989), pp. 945-957 and available at No. 408 (Dec., 1989), pp. 945-957 and available at
http://www.jstor.org/stable/2290069. http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum On the fourth page of this article, they indicate that the maximum
likelihood estimates of the logistic regression using only temperature likelihood estimates of the logistic regression using only temperature
...@@ -30,7 +30,7 @@ and numpy library. ...@@ -30,7 +30,7 @@ and numpy library.
def print_imported_modules(): def print_imported_modules():
import sys import sys
for name, val in sorted(sys.modules.items()): for name, val in sorted(sys.modules.items()):
if(hasattr(val, '__version__')): if(hasattr(val, '__version__')):
print(val.__name__, val.__version__) print(val.__name__, val.__version__)
# else: # else:
# print(val.__name__, "(unknown version)") # print(val.__name__, "(unknown version)")
...@@ -55,7 +55,7 @@ print_imported_modules() ...@@ -55,7 +55,7 @@ print_imported_modules()
Let's start by reading data. Let's start by reading data.
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv") data = pd.read_csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv")
print(data) print(data)
#+end_src #+end_src
...@@ -87,7 +87,7 @@ import statsmodels.api as sm ...@@ -87,7 +87,7 @@ import statsmodels.api as sm
data["Success"]=data.Count-data.Malfunction data["Success"]=data.Count-data.Malfunction
data["Intercept"]=1 data["Intercept"]=1
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
family=sm.families.Binomial(sm.families.links.logit)).fit() family=sm.families.Binomial(sm.families.links.logit)).fit()
print(logmodel.summary()) print(logmodel.summary())
...@@ -95,7 +95,7 @@ print(logmodel.summary()) ...@@ -95,7 +95,7 @@ print(logmodel.summary())
The maximum likelyhood estimator of the intercept and of Temperature The maximum likelyhood estimator of the intercept and of Temperature
are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds* are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
to the values from the article of Dalal /et al./ The standard errors are to the values from the article of Dalal /et al./ The standard errors are
/$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from /$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
/3.01444/ with *21* degrees of freedom. I cannot find any value similar /3.01444/ with *21* degrees of freedom. I cannot find any value similar
...@@ -107,7 +107,7 @@ same throughout all experiments, it does not change the estimates of ...@@ -107,7 +107,7 @@ same throughout all experiments, it does not change the estimates of
the fit but it does influence the variance estimates). the fit but it does influence the variance estimates).
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
family=sm.families.Binomial(sm.families.links.logit), family=sm.families.Binomial(sm.families.links.logit),
var_weights=data['Count']).fit() var_weights=data['Count']).fit()
...@@ -128,7 +128,7 @@ The temperature when launching the shuttle was 31°F. Let's try to ...@@ -128,7 +128,7 @@ The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model: estimate the failure probability for such temperature using our model:
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121),
'Intercept': 1}) 'Intercept': 1})
data_pred['Frequency'] = logmodel.predict(data_pred) data_pred['Frequency'] = logmodel.predict(data_pred)
print(data_pred.head()) print(data_pred.head())
...@@ -157,7 +157,7 @@ et tracer la courbe : ...@@ -157,7 +157,7 @@ et tracer la courbe :
def logit_inv(x): def logit_inv(x):
return(np.exp(x)/(np.exp(x)+1)) return(np.exp(x)/(np.exp(x)+1))
data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] +
logmodel.params['Intercept']) logmodel.params['Intercept'])
print(data_pred.head()) print(data_pred.head())
#+end_src #+end_src
...@@ -195,7 +195,7 @@ matplot_lib_filename ...@@ -195,7 +195,7 @@ matplot_lib_filename
**I think I have managed to correctly compute and plot the uncertainty **I think I have managed to correctly compute and plot the uncertainty
of my prediction.** Although the shaded area seems very similar to of my prediction.** Although the shaded area seems very similar to
[the one obtained by with [the one obtained by with
R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf),
I can spot a few differences (e.g., the blue point for temperature I can spot a few differences (e.g., the blue point for temperature
63 is outside)... Could this be a numerical error ? Or a difference 63 is outside)... Could this be a numerical error ? Or a difference
in the statistical method ? It is not clear which one is "right". in the statistical method ? It is not clear which one is "right".
...@@ -5,8 +5,8 @@ date: "25 October 2018" ...@@ -5,8 +5,8 @@ date: "25 October 2018"
output: pdf_document output: pdf_document
--- ---
In this document we reperform some of the analysis provided in In this document we reperform some of the analysis provided in
*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069. *Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way. On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
...@@ -26,7 +26,7 @@ devtools::session_info() ...@@ -26,7 +26,7 @@ devtools::session_info()
# Loading and inspecting data # Loading and inspecting data
Let's start by reading data: Let's start by reading data:
```{r} ```{r}
data = read.csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv",header=T) data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv",header=T)
data data
``` ```
...@@ -42,7 +42,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1)) ...@@ -42,7 +42,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature. Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature.
```{r} ```{r}
logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count, logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count,
family=binomial(link='logit')) family=binomial(link='logit'))
summary(logistic_reg) summary(logistic_reg)
``` ```
...@@ -50,10 +50,10 @@ summary(logistic_reg) ...@@ -50,10 +50,10 @@ summary(logistic_reg)
The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal *et al.* article**. The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal *et al.* article**.
# Predicting failure probability # Predicting failure probability
The temperature when launching the shuttle was 31°F. Let's try to The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model.: estimate the failure probability for such temperature using our model.:
```{r} ```{r}
# shuttle=shuttle[shuttle$r!=0,] # shuttle=shuttle[shuttle$r!=0,]
tempv = seq(from=30, to=90, by = .5) tempv = seq(from=30, to=90, by = .5)
rmv <- predict(logistic_reg,list(Temperature=tempv),type="response") rmv <- predict(logistic_reg,list(Temperature=tempv),type="response")
plot(tempv,rmv,type="l",ylim=c(0,1)) plot(tempv,rmv,type="l",ylim=c(0,1))
...@@ -65,7 +65,7 @@ This figure is very similar to the Figure 4 of Dalal et al. **I have managed to ...@@ -65,7 +65,7 @@ This figure is very similar to the Figure 4 of Dalal et al. **I have managed to
# Confidence on the prediction # Confidence on the prediction
Let's try to plot confidence intervals with ggplot2. Let's try to plot confidence intervals with ggplot2.
```{r, fig.height=3.3} ```{r, fig.height=3.3}
ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") + ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
xlim(30,90) + ylim(0,1) + theme_bw() xlim(30,90) + ylim(0,1) + theme_bw()
``` ```
...@@ -96,10 +96,10 @@ summary(logistic_reg) ...@@ -96,10 +96,10 @@ summary(logistic_reg)
Perfect. The estimates and the standard errors are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to ratios. Let's use plot the regression for *data_flat* along with the ratios (*data*). Perfect. The estimates and the standard errors are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to ratios. Let's use plot the regression for *data_flat* along with the ratios (*data*).
```{r, fig.height=3.3} ```{r, fig.height=3.3}
ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") + geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") +
geom_point(alpha=.5, size = .5) + geom_point(alpha=.5, size = .5) +
xlim(30,90) + ylim(0,1) + theme_bw() xlim(30,90) + ylim(0,1) + theme_bw()
``` ```
...@@ -121,7 +121,7 @@ logistic_reg$family$linkinv(pred_link$fit) ...@@ -121,7 +121,7 @@ logistic_reg$family$linkinv(pred_link$fit)
I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$: I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$:
```{r} ```{r}
critval = 1.96 critval = 1.96
logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
pred_link$fit+critval*pred_link$se.fit)) pred_link$fit+critval*pred_link$se.fit))
``` ```
The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent. The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent.
......
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment