Commit 8a5d424d authored by Helene31's avatar Helene31
parents 4afaffe3 527f6951
...@@ -49,4 +49,8 @@ Non functional (expected values are $`5.085`$ and $`-0.1156`$) ...@@ -49,4 +49,8 @@ Non functional (expected values are $`5.085`$ and $`-0.1156`$)
| -------- | ---------------- | ------------------------------------------------------------- | ------- | ---------------------------- | ------------- | ---------- | ---------- | --------- | ---------- | ----------------------------------------------------------------- | ----------- | | -------- | ---------------- | ------------------------------------------------------------- | ------- | ---------------------------- | ------------- | ---------- | ---------- | --------- | ---------- | ----------------------------------------------------------------- | ----------- |
| R | 3.5.1 | ggplot2 3.0.0 | RStudio | Debian GNU/Linux buster/sid | Identical | Identical | Identical | Identical | Identical | [Rmd](src/R/challenger.Rmd), [pdf](src/R/challenger_debian_alegrand.pdf) | A. Legrand | | R | 3.5.1 | ggplot2 3.0.0 | RStudio | Debian GNU/Linux buster/sid | Identical | Identical | Identical | Identical | Identical | [Rmd](src/R/challenger.Rmd), [pdf](src/R/challenger_debian_alegrand.pdf) | A. Legrand |
| Python | 3.6.4 | statsmodels 0.9.0 numpy 1.13.3 pandas 0.22.0 matplotlib 2.2.2 | Jupyter | Linux Ubuntu 4.4.0-116-generic | Identical | Identical | Identical | Identical | Similar | [ipynb](src/Python3/challenger.ipynb), [pdf](src/Python3/challenger_ubuntuMOOC_alegrand.pdf) | A. Legrand | | Python | 3.6.4 | statsmodels 0.9.0 numpy 1.13.3 pandas 0.22.0 matplotlib 2.2.2 | Jupyter | Linux Ubuntu 4.4.0-116-generic | Identical | Identical | Identical | Identical | Similar | [ipynb](src/Python3/challenger.ipynb), [pdf](src/Python3/challenger_ubuntuMOOC_alegrand.pdf) | A. Legrand |
| R | 3.5.1 | ggplot2 3.0.0 | RSrudio | Windows >= 8 x64 (build 9200) | Identical | Identical | Identical | Identical | Similar | [Rmd](https://app-learninglab.inria.fr/moocrr/gitlab/8517fa92e97b3a318e653caefbfde6b5/mooc-rr/blob/master/module4/MOOC_exercice_module4.Rmd), [Pdf](https://app-learninglab.inria.fr/moocrr/gitlab/8517fa92e97b3a318e653caefbfde6b5/mooc-rr/blob/master/module4/MOOC_exercice_module4.pdf) | M. Saubin |
| Python | 3.6.4 | statsmodels 0.9.0 numpy 1.15.2 pandas 0.22.0 matplotlib 2.2.3 | Jupyter | Linux Ubuntu 4.4.0-164-generic | Identical | Identical | Identical | Identical | Similar | [ipynb](module4/challenger_Python_ipynb.ipynb), [pdf](module4/challenger_Python_ipynb.pdf) |2992438755465b7fe3afd7856bde0599|
| R | 3.4.4 | ggplot2_3.3.0 | RStudio | Linux Mint 19 | Identical | Identical | Identical | Identical | Identical | [Rmd](https://app-learninglab.inria.fr/moocrr/gitlab/b2c48a7ab4afbff5f4d26650b09eb6b4/mooc-rr/blob/master/module4/challenger_reexecuted.Rmd), [html](https://app-learninglab.inria.fr/moocrr/gitlab/b2c48a7ab4afbff5f4d26650b09eb6b4/mooc-rr/blob/master/module4/challenger_reexecuted.html) | b2c48a7ab4afbff5f4d26650b09eb6b4 |
| Python | 3.6.4 | statsmodels 0.9.0 numpy 1.15.2 pandas 0.22.0 matplotlib 2.2.3 | Jupyter | Linux Ubuntu 4.4.0-164-generic | Identical | Identical | Identical | Identical | Similar | [ipynb](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.ipynb), [pdf](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.pdf) | 34ea1ee296fc8711adf020d9cc2cb571 |
| Matlab | 9.6.0.1072779 (R2019a) | | Matlab Live Script | Windows 10.0.18362 | Identical | Identical | Non Functionnal | Similar | Did not succeed | [mlx](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.mlx), [pdf](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger_matlab.pdf) | 34ea1ee296fc8711adf020d9cc2cb571 |
\ No newline at end of file
...@@ -434,7 +434,7 @@ ...@@ -434,7 +434,7 @@
} }
], ],
"source": [ "source": [
"data = pd.read_csv(\"https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n", "data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/data/shuttle.csv\")\n",
"data" "data"
] ]
}, },
...@@ -751,7 +751,7 @@ ...@@ -751,7 +751,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"." "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
] ]
} }
], ],
......
...@@ -424,7 +424,7 @@ ...@@ -424,7 +424,7 @@
} }
], ],
"source": [ "source": [
"data = pd.read_csv(\"https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n", "data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/data/data/shuttle.csv\")\n",
"data" "data"
] ]
}, },
...@@ -833,7 +833,7 @@ ...@@ -833,7 +833,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"." "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
] ]
} }
], ],
......
...@@ -5,12 +5,12 @@ ...@@ -5,12 +5,12 @@
* Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure * Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
In this document we reperform some of the analysis provided in In this document we reperform some of the analysis provided in
/Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of /Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/ Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
published in /Journal of the American Statistical Association/, Vol. 84, published in /Journal of the American Statistical Association/, Vol. 84,
No. 408 (Dec., 1989), pp. 945-957 and available at No. 408 (Dec., 1989), pp. 945-957 and available at
http://www.jstor.org/stable/2290069. http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum On the fourth page of this article, they indicate that the maximum
likelihood estimates of the logistic regression using only temperature likelihood estimates of the logistic regression using only temperature
...@@ -30,7 +30,7 @@ and numpy library. ...@@ -30,7 +30,7 @@ and numpy library.
def print_imported_modules(): def print_imported_modules():
import sys import sys
for name, val in sorted(sys.modules.items()): for name, val in sorted(sys.modules.items()):
if(hasattr(val, '__version__')): if(hasattr(val, '__version__')):
print(val.__name__, val.__version__) print(val.__name__, val.__version__)
# else: # else:
# print(val.__name__, "(unknown version)") # print(val.__name__, "(unknown version)")
...@@ -55,7 +55,7 @@ print_imported_modules() ...@@ -55,7 +55,7 @@ print_imported_modules()
Let's start by reading data. Let's start by reading data.
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv") data = pd.read_csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv")
print(data) print(data)
#+end_src #+end_src
...@@ -87,7 +87,7 @@ import statsmodels.api as sm ...@@ -87,7 +87,7 @@ import statsmodels.api as sm
data["Success"]=data.Count-data.Malfunction data["Success"]=data.Count-data.Malfunction
data["Intercept"]=1 data["Intercept"]=1
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
family=sm.families.Binomial(sm.families.links.logit)).fit() family=sm.families.Binomial(sm.families.links.logit)).fit()
print(logmodel.summary()) print(logmodel.summary())
...@@ -95,7 +95,7 @@ print(logmodel.summary()) ...@@ -95,7 +95,7 @@ print(logmodel.summary())
The maximum likelyhood estimator of the intercept and of Temperature The maximum likelyhood estimator of the intercept and of Temperature
are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds* are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
to the values from the article of Dalal /et al./ The standard errors are to the values from the article of Dalal /et al./ The standard errors are
/$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from /$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
/3.01444/ with *21* degrees of freedom. I cannot find any value similar /3.01444/ with *21* degrees of freedom. I cannot find any value similar
...@@ -107,7 +107,7 @@ same throughout all experiments, it does not change the estimates of ...@@ -107,7 +107,7 @@ same throughout all experiments, it does not change the estimates of
the fit but it does influence the variance estimates). the fit but it does influence the variance estimates).
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
family=sm.families.Binomial(sm.families.links.logit), family=sm.families.Binomial(sm.families.links.logit),
var_weights=data['Count']).fit() var_weights=data['Count']).fit()
...@@ -128,7 +128,7 @@ The temperature when launching the shuttle was 31°F. Let's try to ...@@ -128,7 +128,7 @@ The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model: estimate the failure probability for such temperature using our model:
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121),
'Intercept': 1}) 'Intercept': 1})
data_pred['Frequency'] = logmodel.predict(data_pred) data_pred['Frequency'] = logmodel.predict(data_pred)
print(data_pred.head()) print(data_pred.head())
...@@ -157,7 +157,7 @@ et tracer la courbe : ...@@ -157,7 +157,7 @@ et tracer la courbe :
def logit_inv(x): def logit_inv(x):
return(np.exp(x)/(np.exp(x)+1)) return(np.exp(x)/(np.exp(x)+1))
data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] +
logmodel.params['Intercept']) logmodel.params['Intercept'])
print(data_pred.head()) print(data_pred.head())
#+end_src #+end_src
...@@ -195,7 +195,7 @@ matplot_lib_filename ...@@ -195,7 +195,7 @@ matplot_lib_filename
**I think I have managed to correctly compute and plot the uncertainty **I think I have managed to correctly compute and plot the uncertainty
of my prediction.** Although the shaded area seems very similar to of my prediction.** Although the shaded area seems very similar to
[the one obtained by with [the one obtained by with
R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf),
I can spot a few differences (e.g., the blue point for temperature I can spot a few differences (e.g., the blue point for temperature
63 is outside)... Could this be a numerical error ? Or a difference 63 is outside)... Could this be a numerical error ? Or a difference
in the statistical method ? It is not clear which one is "right". in the statistical method ? It is not clear which one is "right".
...@@ -5,8 +5,8 @@ date: "25 October 2018" ...@@ -5,8 +5,8 @@ date: "25 October 2018"
output: pdf_document output: pdf_document
--- ---
In this document we reperform some of the analysis provided in In this document we reperform some of the analysis provided in
*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069. *Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way. On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.
...@@ -26,7 +26,7 @@ devtools::session_info() ...@@ -26,7 +26,7 @@ devtools::session_info()
# Loading and inspecting data # Loading and inspecting data
Let's start by reading data: Let's start by reading data:
```{r} ```{r}
data = read.csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv",header=T) data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv",header=T)
data data
``` ```
...@@ -42,7 +42,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1)) ...@@ -42,7 +42,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature. Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature.
```{r} ```{r}
logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count, logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count,
family=binomial(link='logit')) family=binomial(link='logit'))
summary(logistic_reg) summary(logistic_reg)
``` ```
...@@ -50,10 +50,10 @@ summary(logistic_reg) ...@@ -50,10 +50,10 @@ summary(logistic_reg)
The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal *et al.* article**. The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal *et al.* article**.
# Predicting failure probability # Predicting failure probability
The temperature when launching the shuttle was 31°F. Let's try to The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model.: estimate the failure probability for such temperature using our model.:
```{r} ```{r}
# shuttle=shuttle[shuttle$r!=0,] # shuttle=shuttle[shuttle$r!=0,]
tempv = seq(from=30, to=90, by = .5) tempv = seq(from=30, to=90, by = .5)
rmv <- predict(logistic_reg,list(Temperature=tempv),type="response") rmv <- predict(logistic_reg,list(Temperature=tempv),type="response")
plot(tempv,rmv,type="l",ylim=c(0,1)) plot(tempv,rmv,type="l",ylim=c(0,1))
...@@ -65,7 +65,7 @@ This figure is very similar to the Figure 4 of Dalal et al. **I have managed to ...@@ -65,7 +65,7 @@ This figure is very similar to the Figure 4 of Dalal et al. **I have managed to
# Confidence on the prediction # Confidence on the prediction
Let's try to plot confidence intervals with ggplot2. Let's try to plot confidence intervals with ggplot2.
```{r, fig.height=3.3} ```{r, fig.height=3.3}
ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") + ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
xlim(30,90) + ylim(0,1) + theme_bw() xlim(30,90) + ylim(0,1) + theme_bw()
``` ```
...@@ -96,10 +96,10 @@ summary(logistic_reg) ...@@ -96,10 +96,10 @@ summary(logistic_reg)
Perfect. The estimates and the standard errors are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to ratios. Let's use plot the regression for *data_flat* along with the ratios (*data*). Perfect. The estimates and the standard errors are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to ratios. Let's use plot the regression for *data_flat* along with the ratios (*data*).
```{r, fig.height=3.3} ```{r, fig.height=3.3}
ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") + geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") +
geom_point(alpha=.5, size = .5) + geom_point(alpha=.5, size = .5) +
xlim(30,90) + ylim(0,1) + theme_bw() xlim(30,90) + ylim(0,1) + theme_bw()
``` ```
...@@ -121,7 +121,7 @@ logistic_reg$family$linkinv(pred_link$fit) ...@@ -121,7 +121,7 @@ logistic_reg$family$linkinv(pred_link$fit)
I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$: I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$:
```{r} ```{r}
critval = 1.96 critval = 1.96
logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
pred_link$fit+critval*pred_link$se.fit)) pred_link$fit+critval*pred_link$se.fit))
``` ```
The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent. The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent.
......
...@@ -30,23 +30,23 @@ BLAS: /usr/lib64/R/lib/libRblas.so ...@@ -30,23 +30,23 @@ BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so LAPACK: /usr/lib64/R/lib/libRlapack.so
locale: locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C [9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages: attached base packages:
[1] stats graphics grDevices utils datasets methods base [1] stats graphics grDevices utils datasets methods base
other attached packages: other attached packages:
[1] ggplot2_3.0.0 [1] ggplot2_3.0.0
loaded via a namespace (and not attached): loaded via a namespace (and not attached):
[1] colorspace_1.3-2 scales_1.0.0 compiler_3.5.1 plyr_1.8.4 [1] colorspace_1.3-2 scales_1.0.0 compiler_3.5.1 plyr_1.8.4
[5] lazyeval_0.2.1 withr_2.1.2 pillar_1.3.0 gtable_0.2.0 [5] lazyeval_0.2.1 withr_2.1.2 pillar_1.3.0 gtable_0.2.0
[9] tibble_1.4.2 crayon_1.3.4 Rcpp_0.12.18 grid_3.5.1 [9] tibble_1.4.2 crayon_1.3.4 Rcpp_0.12.18 grid_3.5.1
[13] rlang_0.2.2 munsell_0.5.0 [13] rlang_0.2.2 munsell_0.5.0
#+end_example #+end_example
...@@ -58,16 +58,16 @@ devtools::session_info() ...@@ -58,16 +58,16 @@ devtools::session_info()
#+RESULTS: #+RESULTS:
#+begin_example #+begin_example
Session info------------------------------------------------------------------- Session info-------------------------------------------------------------------
setting value setting value
version R version 3.5.1 (2018-07-02) version R version 3.5.1 (2018-07-02)
system x86_64, linux-gnu system x86_64, linux-gnu
ui X11 ui X11
language (EN) language (EN)
collate de_DE.UTF-8 collate de_DE.UTF-8
tz Europe/Berlin tz Europe/Berlin
Packages----------------------------------------------------------------------- Packages-----------------------------------------------------------------------
package * version date source package * version date source
colorspace 1.3.2 2016-12-14 CRAN (R 3.5.1) colorspace 1.3.2 2016-12-14 CRAN (R 3.5.1)
crayon 1.3.4 2017-09-16 CRAN (R 3.5.1) crayon 1.3.4 2017-09-16 CRAN (R 3.5.1)
devtools 1.6.1 2014-10-07 CRAN (R 3.5.1) devtools 1.6.1 2014-10-07 CRAN (R 3.5.1)
...@@ -88,7 +88,7 @@ Packages----------------------------------------------------------------------- ...@@ -88,7 +88,7 @@ Packages-----------------------------------------------------------------------
* Loading and inspecting data * Loading and inspecting data
Let's start by reading data: Let's start by reading data:
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
data = read.csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv") data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv")
data data
#+end_src #+end_src
...@@ -124,7 +124,7 @@ data ...@@ -124,7 +124,7 @@ data
We know from our previous experience on this data set that filtering data is a really bad idea. We ill therefore process it as such. We know from our previous experience on this data set that filtering data is a really bad idea. We ill therefore process it as such.
Let's visually inspect how temperature affects malfunction: Let's visually inspect how temperature affects malfunction:
#+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R* #+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R*
plot(data=data, Malfunction/Count - Temperature, ylim = c(0,1)) plot(data=data, Malfunction/Count - Temperature, ylim = c(0,1))
#+end_src #+end_src
...@@ -140,9 +140,9 @@ summary(logistic_reg) ...@@ -140,9 +140,9 @@ summary(logistic_reg)
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: Fehler in stats::model.frame(formula = Malfunction/Count - Temperature, : : Fehler in stats::model.frame(formula = Malfunction/Count - Temperature, :
: Objekt 'Malfunction' nicht gefunden : Objekt 'Malfunction' nicht gefunden
: :
: Fehler in summary(logistic_reg) : Objekt 'logistic_reg' nicht gefunden : Fehler in summary(logistic_reg) : Objekt 'logistic_reg' nicht gefunden
The maximum likelyhood estimatator of the intercept and of Temperature are thus \hat{\alpha}=? and \hat{\beta}=? and their standard errors are s_{\hat{\alpha}}=? and s_{\hat{\beta}}=?. The Residual deviance corresponds to the Goodness of fit G^2 = ? with ? degrees of freedom. Since some function does not operate as in the example given by Arnaud Legrand: *I have therefore _not yet_ managed to replicate the results of the Dalal /et.al./ article*. The maximum likelyhood estimatator of the intercept and of Temperature are thus \hat{\alpha}=? and \hat{\beta}=? and their standard errors are s_{\hat{\alpha}}=? and s_{\hat{\beta}}=?. The Residual deviance corresponds to the Goodness of fit G^2 = ? with ? degrees of freedom. Since some function does not operate as in the example given by Arnaud Legrand: *I have therefore _not yet_ managed to replicate the results of the Dalal /et.al./ article*.
...@@ -150,7 +150,7 @@ The maximum likelyhood estimatator of the intercept and of Temperature are thus ...@@ -150,7 +150,7 @@ The maximum likelyhood estimatator of the intercept and of Temperature are thus
* Predicting failure probability * Predicting failure probability
The temperature when launching the shuttle was 31°F. Let's try to estimate the failure probability for such temperature using our model: The temperature when launching the shuttle was 31°F. Let's try to estimate the failure probability for such temperature using our model:
#+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R* #+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R*
# shuttle=shuttle[shuttle$r!=0,] # shuttle=shuttle[shuttle$r!=0,]
tempv = seq(from=30, to=90, by = .5) tempv = seq(from=30, to=90, by = .5)
rmv <- predict(logistic_reg,list(Temperature=tempv),type="response") rmv <- predict(logistic_reg,list(Temperature=tempv),type="response")
...@@ -166,7 +166,7 @@ For the error mentioned above I have not been able to plot this. This figure is ...@@ -166,7 +166,7 @@ For the error mentioned above I have not been able to plot this. This figure is
* Confidence on the prediction * Confidence on the prediction
Let's try to plot confidence intervals with ~ggplot2~. Let's try to plot confidence intervals with ~ggplot2~.
#+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R* #+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R*
ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + xlim(30,90) + ylim(0,1) + theme_bw() ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + xlim(30,90) + ylim(0,1) + theme_bw()
#+end_src #+end_src
...@@ -175,7 +175,7 @@ ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, siz ...@@ -175,7 +175,7 @@ ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, siz
Apparently I don't have the warning Arnaud Legrand mentions from ~ggplot2~ indicating /"non-integer #successes in a binomial glm!"/. This seems fishy for him, but not for me. But: yes, this confidence region seems huge... It seems strange to me that the uncertainty grows so large for higher temperatures. And compared to my previous call to ~glm~, I haven't indicated the weight which accounts for the fact that each ration Malfunction/Count corresponds to ~Count~ observations (if someone knows how to do this ...) There must be something wrong. Apparently I don't have the warning Arnaud Legrand mentions from ~ggplot2~ indicating /"non-integer #successes in a binomial glm!"/. This seems fishy for him, but not for me. But: yes, this confidence region seems huge... It seems strange to me that the uncertainty grows so large for higher temperatures. And compared to my previous call to ~glm~, I haven't indicated the weight which accounts for the fact that each ration Malfunction/Count corresponds to ~Count~ observations (if someone knows how to do this ...) There must be something wrong.
So let's provide the "raw" data to ~ggplot2~. So let's provide the "raw" data to ~ggplot2~.
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
data_flat = data.frame() data_flat = data.frame()
for(i in 1:nrow(data)) { for(i in 1:nrow(data)) {
...@@ -191,7 +191,7 @@ dim(data_flat) ...@@ -191,7 +191,7 @@ dim(data_flat)
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: :
: [1] 138 2 : [1] 138 2
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
...@@ -210,20 +210,20 @@ summary(logistic_reg) ...@@ -210,20 +210,20 @@ summary(logistic_reg)
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: Fehler in stats::model.frame(formula = Malfunction - Temperature, data = data_flat, : : Fehler in stats::model.frame(formula = Malfunction - Temperature, data = data_flat, :
: Objekt 'Malfunction' nicht gefunden : Objekt 'Malfunction' nicht gefunden
: :
: Fehler in summary(logistic_reg) : Objekt 'logistic_reg' nicht gefunden : Fehler in summary(logistic_reg) : Objekt 'logistic_reg' nicht gefunden
So, again these objects are not here (same error as above, probably). So, for Arnaud Legrand this is perfect because he sees a result. For me it is not. The estimates and the standard errors for him are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to rations. Let's use plot the regression for /~data_flat~/ along with the ratios (/~data~/). So, again these objects are not here (same error as above, probably). So, for Arnaud Legrand this is perfect because he sees a result. For me it is not. The estimates and the standard errors for him are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to rations. Let's use plot the regression for /~data_flat~/ along with the ratios (/~data~/).
#+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R* #+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R*
ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") + geom_point(alpha=.5, size=.5) + xlim(30,90) + ylim(0,1) + theme_bw() ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) + geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") + geom_point(alpha=.5, size=.5) + xlim(30,90) + ylim(0,1) + theme_bw()
#+end_src #+end_src
#+RESULTS: #+RESULTS:
[[file:/tmp/babel-24411vWV/figure24411YLE.png]] [[file:/tmp/babel-24411vWV/figure24411YLE.png]]
This confidence interval seems much more reasonable (in accordance with the data) than the previous one. Let's check whether it corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function. This confidence interval seems much more reasonable (in accordance with the data) than the previous one. Let's check whether it corresponds to the prediction obtained when calling directly predict. Obtaining the prediction can be done directly or through the link function.
Here is the "direct" (response) version I used in my very first plot: Here is the "direct" (response) version I used in my very first plot:
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
...@@ -232,9 +232,9 @@ pred ...@@ -232,9 +232,9 @@ pred
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: Fehler in predict(logistic_reg_flat, list(Temperature = 30), type = "response", : : Fehler in predict(logistic_reg_flat, list(Temperature = 30), type = "response", :
: Objekt 'logistic_reg_flat' nicht gefunden : Objekt 'logistic_reg_flat' nicht gefunden
: :
: Fehler: Objekt 'pred' nicht gefunden : Fehler: Objekt 'pred' nicht gefunden
Again, in my version of this document I cannot find the above defined object anymore. So, I cannot replicate what Arnaud Legrand wrote: The estimated Failure probability for 30° is thus ??. However the /se.fit/ value seems pretty hard to use as I can obviously not simply add \pm2 /se.fit/ to /fit/ to compute a confidence interval. Again, in my version of this document I cannot find the above defined object anymore. So, I cannot replicate what Arnaud Legrand wrote: The estimated Failure probability for 30° is thus ??. However the /se.fit/ value seems pretty hard to use as I can obviously not simply add \pm2 /se.fit/ to /fit/ to compute a confidence interval.
...@@ -246,9 +246,9 @@ pred.link ...@@ -246,9 +246,9 @@ pred.link
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: Fehler in predict(logistic_reg_flat, list(Temperature = 39), type = "link", : : Fehler in predict(logistic_reg_flat, list(Temperature = 39), type = "link", :
: Objekt 'logistic_reg_flat' nicht gefunden : Objekt 'logistic_reg_flat' nicht gefunden
: :
: Fehler: Objekt 'pred.link' nicht gefunden : Fehler: Objekt 'pred.link' nicht gefunden
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
...@@ -265,11 +265,11 @@ logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, pred_link$ ...@@ -265,11 +265,11 @@ logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, pred_link$
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: :
: Fehler: Objekt 'logistic_reg' nicht gefunden : Fehler: Objekt 'logistic_reg' nicht gefunden
The 95% confidence interval for our estimation is thus [??,??]. This is what ~ggplot2~ just plotted me. This seems coherent. The 95% confidence interval for our estimation is thus [??,??]. This is what ~ggplot2~ just plotted me. This seems coherent.
*I am now _not yet_ rather confident that I have managed to correctly compute and plot uncertainty of my prediction.* Let's be honnest, it took me a wile. My first attempts were plainly wrong (I didn't know how to do this so I trusted ~ggplot2~, which I was misusing) and did not use the correct statistical method. I also feel confident now becuase this has been somehow validated by other colleagues but it will be interesting that you collect other kind of plot values that you obtained ,that differ and that you would probably have kept if you didn't have a reference to compare to . Please, provide us with as many versions as you can. *I am now _not yet_ rather confident that I have managed to correctly compute and plot uncertainty of my prediction.* Let's be honnest, it took me a wile. My first attempts were plainly wrong (I didn't know how to do this so I trusted ~ggplot2~, which I was misusing) and did not use the correct statistical method. I also feel confident now becuase this has been somehow validated by other colleagues but it will be interesting that you collect other kind of plot values that you obtained ,that differ and that you would probably have kept if you didn't have a reference to compare to . Please, provide us with as many versions as you can.
So, I'm disappointed because some error in R or in my config leads to the fact that some objects are forgotten between blocks. I will try to export the whole document in pdf to see if that changes. Unfortunately, it doesn't. Right now I do not have the time to figure out by myself how to change this. So I will only upload this document and hope it still contributes to the database in some way. So, I'm disappointed because some error in R or in my config leads to the fact that some objects are forgotten between blocks. I will try to export the whole document in pdf to see if that changes. Unfortunately, it doesn't. Right now I do not have the time to figure out by myself how to change this. So I will only upload this document and hope it still contributes to the database in some way.
...@@ -7,12 +7,12 @@ ...@@ -7,12 +7,12 @@
* Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure * Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
In this document we reperform some of the analysis provided in In this document we reperform some of the analysis provided in
/Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of /Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/ Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
published in /Journal of the American Statistical Association/, Vol. 84, published in /Journal of the American Statistical Association/, Vol. 84,
No. 408 (Dec., 1989), pp. 945-957 and available at No. 408 (Dec., 1989), pp. 945-957 and available at
http://www.jstor.org/stable/2290069. http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum On the fourth page of this article, they indicate that the maximum
likelihood estimates of the logistic regression using only temperature likelihood estimates of the logistic regression using only temperature
...@@ -41,22 +41,22 @@ Running under: Windows 7 x64 (build 7601) Service Pack 1 ...@@ -41,22 +41,22 @@ Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default Matrix products: default
locale: locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252 [5] LC_TIME=French_France.1252
attached base packages: attached base packages:
[1] stats graphics grDevices utils datasets methods base [1] stats graphics grDevices utils datasets methods base
other attached packages: other attached packages:
[1] ggplot2_3.1.0 [1] ggplot2_3.1.0
loaded via a namespace (and not attached): loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 withr_2.1.2 crayon_1.3.4 dplyr_0.7.8 [1] Rcpp_1.0.0 withr_2.1.2 crayon_1.3.4 dplyr_0.7.8
[5] assertthat_0.2.0 grid_3.5.1 plyr_1.8.4 R6_2.3.0 [5] assertthat_0.2.0 grid_3.5.1 plyr_1.8.4 R6_2.3.0
[9] gtable_0.2.0 magrittr_1.5 scales_1.0.0 pillar_1.3.0 [9] gtable_0.2.0 magrittr_1.5 scales_1.0.0 pillar_1.3.0
[13] rlang_0.3.0.1 lazyeval_0.2.1 bindrcpp_0.2.2 glue_1.3.0 [13] rlang_0.3.0.1 lazyeval_0.2.1 bindrcpp_0.2.2 glue_1.3.0
[17] purrr_0.2.5 munsell_0.5.0 compiler_3.5.1 pkgconfig_2.0.2 [17] purrr_0.2.5 munsell_0.5.0 compiler_3.5.1 pkgconfig_2.0.2
[21] colorspace_1.3-2 tidyselect_0.2.5 bindr_0.1.1 tibble_1.4.2 [21] colorspace_1.3-2 tidyselect_0.2.5 bindr_0.1.1 tibble_1.4.2
#+end_example #+end_example
...@@ -68,19 +68,19 @@ devtools::session_info() ...@@ -68,19 +68,19 @@ devtools::session_info()
#+RESULTS: #+RESULTS:
#+begin_example #+begin_example
- Session info --------------------------------------------------------------- - Session info ---------------------------------------------------------------
setting value setting value
version R version 3.5.1 (2018-07-02) version R version 3.5.1 (2018-07-02)
os Windows 7 x64 SP 1 os Windows 7 x64 SP 1
system i386, mingw32 system i386, mingw32
ui RTerm ui RTerm
language (EN) language (EN)
collate French_France.1252 collate French_France.1252
ctype French_France.1252 ctype French_France.1252
tz Europe/Paris tz Europe/Paris
date 2018-12-10 date 2018-12-10
- Packages ------------------------------------------------------------------- - Packages -------------------------------------------------------------------
package * version date lib source package * version date lib source
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1) assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1)
backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.0) backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.0) base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.0)
...@@ -130,7 +130,7 @@ devtools::session_info() ...@@ -130,7 +130,7 @@ devtools::session_info()
* Loading and inspecting data * Loading and inspecting data
Let's start by reading data: Let's start by reading data:
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
data = read.csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv") data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv")
data data
#+end_src #+end_src
...@@ -163,10 +163,10 @@ data ...@@ -163,10 +163,10 @@ data
#+end_example #+end_example
We know from our previous experience on this data set that filtering We know from our previous experience on this data set that filtering
data is a really bad idea. We will therefore process it as such. data is a really bad idea. We will therefore process it as such.
Let's visually inspect how temperature affects malfunction: Let's visually inspect how temperature affects malfunction:
#+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R* #+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R*
plot(data=data, Malfunction/Count ~ Temperature, ylim = c(0,1)) plot(data=data, Malfunction/Count ~ Temperature, ylim = c(0,1))
#+end_src #+end_src
...@@ -174,7 +174,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim = c(0,1)) ...@@ -174,7 +174,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim = c(0,1))
Let's assume O-rings independently fail with the same probability Let's assume O-rings independently fail with the same probability
which solely depends on temperature. A logistic regression should which solely depends on temperature. A logistic regression should
allow us to estimate the influence of temperature. allow us to estimate the influence of temperature.
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count, logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count,
family=binomial(link='logit')) family=binomial(link='logit'))
...@@ -185,15 +185,15 @@ summary(logistic_reg) ...@@ -185,15 +185,15 @@ summary(logistic_reg)
#+begin_example #+begin_example
Call: Call:
glm(formula = Malfunction/Count ~ Temperature, family = binomial(link = "logit"), glm(formula = Malfunction/Count ~ Temperature, family = binomial(link = "logit"),
data = data, weights = Count) data = data, weights = Count)
Deviance Residuals: Deviance Residuals:
Min 1Q Median 3Q Max Min 1Q Median 3Q Max
-0.95227 -0.78299 -0.54117 -0.04379 2.65152 -0.95227 -0.78299 -0.54117 -0.04379 2.65152
Coefficients: Coefficients:
Estimate Std. Error z value Pr(>|z|) Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.08498 3.05247 1.666 0.0957 . (Intercept) 5.08498 3.05247 1.666 0.0957 .
Temperature -0.11560 0.04702 -2.458 0.0140 * Temperature -0.11560 0.04702 -2.458 0.0140 *
--- ---
...@@ -215,13 +215,13 @@ are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156* and their s ...@@ -215,13 +215,13 @@ are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156* and their s
errors are *$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.04702*. The Residual errors are *$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.04702*. The Residual
deviance corresponds to the Goodness of fit *$G^2$ = 18.086* with *21* deviance corresponds to the Goodness of fit *$G^2$ = 18.086* with *21*
degrees of freedom. *I have therefore managed to replicate the results degrees of freedom. *I have therefore managed to replicate the results
of the Dalal /et al./ article*. of the Dalal /et al./ article*.
* Predicting failure probability * Predicting failure probability
The temperature when launching the shuttle was 31°F. Let's try to The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model: estimate the failure probability for such temperature using our model:
#+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R* #+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R*
# shuttle=shuttle[shuttle$r!=0,] # shuttle=shuttle[shuttle$r!=0,]
tempv = seq(from=30, to=90, by = .5) tempv = seq(from=30, to=90, by = .5)
rmv <- predict(logistic_reg,list(Temperature=tempv),type="response") rmv <- predict(logistic_reg,list(Temperature=tempv),type="response")
...@@ -230,28 +230,28 @@ points(data=data, Malfunction/Count ~ Temperature) ...@@ -230,28 +230,28 @@ points(data=data, Malfunction/Count ~ Temperature)
#+end_src #+end_src
This figure is very similar to the Figure 4 of Dalal /et al./ *I have This figure is very similar to the Figure 4 of Dalal /et al./ *I have
managed to replicate the Figure 4 of the Dalal /et al./ article.* managed to replicate the Figure 4 of the Dalal /et al./ article.*
* Confidence on the prediction * Confidence on the prediction
Let's try to plot confidence intervals with ~ggplot2~. Let's try to plot confidence intervals with ~ggplot2~.
#+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R* #+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R*
ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + ggplot(data, aes(y=Malfunction/Count, x=Temperature)) +
geom_point(alpha=.2, size = 2, color="blue") + geom_point(alpha=.2, size = 2, color="blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), geom_smooth(method = "glm", method.args = list(family = "binomial"),
fullrange=T) + fullrange=T) +
xlim(30,90) + ylim(0,1) + theme_bw() xlim(30,90) + ylim(0,1) + theme_bw()
#+end_src #+end_src
I don't get any warning from ~ggplot2~ indicating /"non-integer I don't get any warning from ~ggplot2~ indicating /"non-integer
#successes in a binomial glm!"/ but this confidence region seems #successes in a binomial glm!"/ but this confidence region seems
huge... It seems strange to me that the uncertainty grows so large for huge... It seems strange to me that the uncertainty grows so large for
higher temperatures. And compared to my previous call to ~glm~, I higher temperatures. And compared to my previous call to ~glm~, I
haven't indicated the weight which accounts for the fact that each haven't indicated the weight which accounts for the fact that each
ration Malfunction/Count corresponds to ~Count~ observations (if someone ration Malfunction/Count corresponds to ~Count~ observations (if someone
knows how to do this...). There must be something wrong. knows how to do this...). There must be something wrong.
So let's provide the "raw" data to ~ggplot2~. So let's provide the "raw" data to ~ggplot2~.
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
data_flat = data.frame() data_flat = data.frame()
for(i in 1:nrow(data)) { for(i in 1:nrow(data)) {
...@@ -280,7 +280,7 @@ str(data_flat) ...@@ -280,7 +280,7 @@ str(data_flat)
Let's check whether I obtain the same regression or not: Let's check whether I obtain the same regression or not:
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
logistic_reg_flat = glm(data=data_flat, Malfunction ~ Temperature, logistic_reg_flat = glm(data=data_flat, Malfunction ~ Temperature,
family=binomial(link='logit')) family=binomial(link='logit'))
summary(logistic_reg) summary(logistic_reg)
#+end_src #+end_src
...@@ -289,15 +289,15 @@ summary(logistic_reg) ...@@ -289,15 +289,15 @@ summary(logistic_reg)
#+begin_example #+begin_example
Call: Call:
glm(formula = Malfunction/Count ~ Temperature, family = binomial(link = "logit"), glm(formula = Malfunction/Count ~ Temperature, family = binomial(link = "logit"),
data = data, weights = Count) data = data, weights = Count)
Deviance Residuals: Deviance Residuals:
Min 1Q Median 3Q Max Min 1Q Median 3Q Max
-0.95227 -0.78299 -0.54117 -0.04379 2.65152 -0.95227 -0.78299 -0.54117 -0.04379 2.65152
Coefficients: Coefficients:
Estimate Std. Error z value Pr(>|z|) Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.08498 3.05247 1.666 0.0957 . (Intercept) 5.08498 3.05247 1.666 0.0957 .
Temperature -0.11560 0.04702 -2.458 0.0140 * Temperature -0.11560 0.04702 -2.458 0.0140 *
--- ---
...@@ -316,13 +316,13 @@ Perfect. The estimates and the standard errors for him are the same ...@@ -316,13 +316,13 @@ Perfect. The estimates and the standard errors for him are the same
although the Residual deviance is difference since the distance is now although the Residual deviance is difference since the distance is now
measured with respect to each =0/1= measurement and not to measured with respect to each =0/1= measurement and not to
rations. Let's use plot the regression for /data_flat/ along with the rations. Let's use plot the regression for /data_flat/ along with the
ratios (/data/). ratios (/data/).
#+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R* #+begin_src R :results output graphics :file (org-babel-temp-file "figure" ".png") :exports both :width 600 :height 400 :session *R*
ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), geom_smooth(method = "glm", method.args = list(family = "binomial"),
fullrange=T) + fullrange=T) +
geom_point(data=data, aes(y=Malfunction/Count, x=Temperature), geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),
alpha=.2, size = 2, color="blue") + alpha=.2, size = 2, color="blue") +
geom_point(alpha=.5, size=.5) + xlim(30,90) + ylim(0,1) + theme_bw() geom_point(alpha=.5, size=.5) + xlim(30,90) + ylim(0,1) + theme_bw()
#+end_src #+end_src
...@@ -330,7 +330,7 @@ This confidence interval seems much more reasonable (in accordance ...@@ -330,7 +330,7 @@ This confidence interval seems much more reasonable (in accordance
with the data) than the previous one. Let's check whether it with the data) than the previous one. Let's check whether it
corresponds to the prediction obtained when calling directly corresponds to the prediction obtained when calling directly
predict. Obtaining the prediction can be done directly or through the predict. Obtaining the prediction can be done directly or through the
link function. link function.
Here is the "direct" (response) version I used in my very first plot: Here is the "direct" (response) version I used in my very first plot:
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
...@@ -341,12 +341,12 @@ pred ...@@ -341,12 +341,12 @@ pred
#+RESULTS: #+RESULTS:
#+begin_example #+begin_example
$fit $fit
1 1
0.834373 0.834373
$se.fit $se.fit
1 1
0.2293304 0.2293304
$residual.scale $residual.scale
[1] 1 [1] 1
...@@ -354,7 +354,7 @@ $residual.scale ...@@ -354,7 +354,7 @@ $residual.scale
The estimated Failure probability for 30° is thus 0.834. However the The estimated Failure probability for 30° is thus 0.834. However the
/se.fit/ value seems pretty hard to use as I can obviously not simply /se.fit/ value seems pretty hard to use as I can obviously not simply
add \pm2 /se.fit/ to /fit/ to compute a confidence interval. add \pm2 /se.fit/ to /fit/ to compute a confidence interval.
Here is the "link" version: Here is the "link" version:
#+begin_src R :results output :session *R* :exports both #+begin_src R :results output :session *R* :exports both
...@@ -364,12 +364,12 @@ pred_link ...@@ -364,12 +364,12 @@ pred_link
#+RESULTS: #+RESULTS:
: $fit : $fit
: 1 : 1
: 1.616942 : 1.616942
: :
: $se.fit : $se.fit
: [1] 1.659473 : [1] 1.659473
: :
: $residual.scale : $residual.scale
: [1] 1 : [1] 1
...@@ -378,7 +378,7 @@ logistic_reg$family$linkinv(pred_link$fit) ...@@ -378,7 +378,7 @@ logistic_reg$family$linkinv(pred_link$fit)
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: 1 : 1
: 0.834373 : 0.834373
I recover 0.834 for the Estimated Failure probability at 30°. But now, going through the /linkinv/ function, we can use /se.fit/: I recover 0.834 for the Estimated Failure probability at 30°. But now, going through the /linkinv/ function, we can use /se.fit/:
...@@ -388,11 +388,11 @@ logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, pred_link$ ...@@ -388,11 +388,11 @@ logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, pred_link$
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: 1 1 : 1 1
: 0.1630612 0.9923814 : 0.1630612 0.9923814
The 95% confidence interval for our estimation is thus [0.163,0.992]. This The 95% confidence interval for our estimation is thus [0.163,0.992]. This
is what ~ggplot2~ just plotted me. This seems coherent. is what ~ggplot2~ just plotted me. This seems coherent.
*I am now rather confident that I have managed to correctly compute and *I am now rather confident that I have managed to correctly compute and
plot uncertainty of my prediction.* Let's be honnest, it took me a plot uncertainty of my prediction.* Let's be honnest, it took me a
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment