Replace challenger_Python_org.org

parent b486f2d3
* Chalenger - Emacs - Python - Windows 7 64 bits # -*- coding: utf-8 -*-
# -*- mode: org -*-
** Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure #+TITLE: Challenger - Python - Emacs - Windows 7 64 bits
In this document we reperform some of the analysis provided in * Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
/Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
published in /Journal of the American Statistical Association/, Vol. 84,
No. 408 (Dec., 1989), pp. 945-957 and available at
http://www.jstor.org/stable/2290069.
On the fourth page of this article, they indicate that the maximum In this document we reperform some of the analysis provided in
likelihood estimates of the logistic regression using only temperature /Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
are: *$\hat{\alpha}$ = 5.085* and *$\hat{\beta}$ = -0.1156* and their Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
asymptotic standard errors are *$s_{\hat{\alpha}}$ = 3.052* and published in /Journal of the American Statistical Association/, Vol. 84,
*$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit indicated for this model was No. 408 (Dec., 1989), pp. 945-957 and available at
*$G^2$ = 18.086* with *21* degrees of freedom. Our goal is to reproduce http://www.jstor.org/stable/2290069.
the computation behind these values and the Figure 4 of this article,
possibly in a nicer looking way.
*** Technical information on the computer on which the analysis is run On the fourth page of this article, they indicate that the maximum
likelihood estimates of the logistic regression using only temperature
are: *$\hat{\alpha}$ = 5.085* and *$\hat{\beta}$ = -0.1156* and their
asymptotic standard errors are *$s_{\hat{\alpha}}$ = 3.052* and
*$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit indicated for this model was
*$G^2$ = 18.086* with *21* degrees of freedom. Our goal is to reproduce
the computation behind these values and the Figure 4 of this article,
possibly in a nicer looking way.
We will be using the Python 3 language using the pandas, statsmodels, ** Technical information on the computer on which the analysis is run
and numpy library.
#+begin_src python :results output :session :exports both We will be using the Python 3 language using the pandas, statsmodels,
and numpy library.
#+begin_src python :results output :session :exports both
def print_imported_modules(): def print_imported_modules():
import sys import sys
for name, val in sorted(sys.modules.items()): for name, val in sorted(sys.modules.items()):
...@@ -45,21 +48,21 @@ import seaborn as sns ...@@ -45,21 +48,21 @@ import seaborn as sns
print_sys_info() print_sys_info()
print_imported_modules() print_imported_modules()
#+end_src #+end_src
*** Loading and inspecting data ** Loading and inspecting data
Let's start by reading data. Let's start by reading data.
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv") data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv")
print(data) print(data)
#+end_src #+end_src
We know from our previous experience on this data set that filtering We know from our previous experience on this data set that filtering
data is a really bad idea. We will therefore process it as such. data is a really bad idea. We will therefore process it as such.
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
%matplotlib inline %matplotlib inline
pd.set_option('mode.chained_assignment',None) # this removes a useless warning from pandas pd.set_option('mode.chained_assignment',None) # this removes a useless warning from pandas
...@@ -70,15 +73,15 @@ plt.tight_layout() ...@@ -70,15 +73,15 @@ plt.tight_layout()
plt.savefig(matplot_lib_filename) plt.savefig(matplot_lib_filename)
matplot_lib_filename matplot_lib_filename
#+end_src #+end_src
*** Logistic regression ** Logistic regression
Let's assume O-rings independently fail with the same probability Let's assume O-rings independently fail with the same probability
which solely depends on temperature. A logistic regression should which solely depends on temperature. A logistic regression should
allow us to estimate the influence of temperature. allow us to estimate the influence of temperature.
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
import statsmodels.api as sm import statsmodels.api as sm
data["Success"]=data.Count-data.Malfunction data["Success"]=data.Count-data.Malfunction
...@@ -88,46 +91,52 @@ logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], ...@@ -88,46 +91,52 @@ logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
family=sm.families.Binomial(sm.families.links.logit)).fit() family=sm.families.Binomial(sm.families.links.logit)).fit()
print(logmodel.summary()) print(logmodel.summary())
#+end_src #+end_src
The maximum likelyhood estimator of the intercept and of Temperature The maximum likelyhood estimator of the intercept and of Temperature
are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds* are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
to the values from the article of Dalal /et al./ The standard errors are to the values from the article of Dalal /et al./ The standard errors are
/$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from /$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
/3.01444/ with *21* degrees of freedom. I cannot find any value similar /3.01444/ with *21* degrees of freedom. I cannot find any value similar
to the Goodness of fit (*$G^2$ = 18.086*) reported by Dalal /et al./ There to the Goodness of fit (*$G^2$ = 18.086*) reported by Dalal /et al./ There
seems to be something wrong. Oh I know, I haven't indicated that my seems to be something wrong. Oh I know, I haven't indicated that my
observations are actually the result of 6 observations for each rocket observations are actually the result of 6 observations for each rocket
launch. Let's indicate these weights (since the weights are always the launch. Let's indicate these weights (since the weights are always the
same throughout all experiments, it does not change the estimates of same throughout all experiments, it does not change the estimates of
the fit but it does influence the variance estimates). the fit but it does influence the variance estimates).
#+begin_src python :results output :session :exports both #+begin_src python :results output :session :exports both
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
family=sm.families.Binomial(sm.families.links.logit), family=sm.families.Binomial(sm.families.links.logit),
var_weights=data['Count']).fit() var_weights=data['Count']).fit()
print(logmodel.summary()) print(logmodel.summary())
#+end_src #+end_src
Good, now I have recovered the asymptotic standard errors Good, now I have recovered the asymptotic standard errors
*$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit *$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit
(Deviance) indicated for this model is *$G^2$ = 18.086* with *21* degrees (Deviance) indicated for this model is *$G^2$ = 18.086* with *21* degrees
of freedom (Df Residuals). of freedom (Df Residuals).
*I have therefore managed to fully replicate the results of the Dalal *I have therefore managed to fully replicate the results of the Dalal
/et al./ article*. /et al./ article*.
*** Predicting failure probability ** Predicting failure probability
The temperature when launching the shuttle was 31°F. Let's try to The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model: estimate the failure probability for such temperature using our model:
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both #+begin_src python :results output :session :exports both
%matplotlib inline data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121),
data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 'Intercept': 1}) 'Intercept': 1})
data_pred['Frequency'] = logmodel.predict(data_pred) data_pred['Frequency'] = logmodel.predict(data_pred)
print(data_pred.head())
#+end_src
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
%matplotlib inline
data_pred.plot(x="Temperature",y="Frequency",kind="line",ylim=[0,1]) data_pred.plot(x="Temperature",y="Frequency",kind="line",ylim=[0,1])
plt.scatter(x=data["Temperature"],y=data["Frequency"]) plt.scatter(x=data["Temperature"],y=data["Frequency"])
...@@ -135,27 +144,26 @@ plt.grid(True) ...@@ -135,27 +144,26 @@ plt.grid(True)
plt.savefig(matplot_lib_filename) plt.savefig(matplot_lib_filename)
matplot_lib_filename matplot_lib_filename
#+end_src #+end_src
#+begin_src python :results output :session :exports both
print(data_pred.head())
#+end_src
La fonction =logmodel.predict(data_pred)= ne fonctionne pas correctement (Frequency = 1 pour toutes les La fonction =logmodel.predict(data_pred)= ne fonctionne pas avec les
températures). dernières versions de pandas (Frequency = 1 pour toutes les températures).
On peut alors utiliser le code suivant pour calculer les prédictions On peut alors utiliser le code suivant pour calculer les prédictions
et tracer la courbe : et tracer la courbe :
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both #+begin_src python :results output :session :exports both
# Inspiring from http://blog.yhat.com/posts/logistic-regression-and-python.html # Inspiring from http://blog.yhat.com/posts/logistic-regression-and-python.html
%matplotlib inline
def logit_inv(x): def logit_inv(x):
return(np.exp(x)/(np.exp(x)+1)) return(np.exp(x)/(np.exp(x)+1))
data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + logmodel.params['Intercept']) data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] +
logmodel.params['Intercept'])
print(data_pred.head()) print(data_pred.head())
#+end_src
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
%matplotlib inline
data_pred.plot(x="Temperature",y="Prob",kind="line",ylim=[0,1]) data_pred.plot(x="Temperature",y="Prob",kind="line",ylim=[0,1])
plt.scatter(x=data["Temperature"],y=data["Frequency"]) plt.scatter(x=data["Temperature"],y=data["Frequency"])
plt.grid(True) plt.grid(True)
...@@ -164,16 +172,16 @@ plt.savefig(matplot_lib_filename) ...@@ -164,16 +172,16 @@ plt.savefig(matplot_lib_filename)
matplot_lib_filename matplot_lib_filename
#+end_src #+end_src
This figure is very similar to the Figure 4 of Dalal /et al./ *I have This figure is very similar to the Figure 4 of Dalal /et al./ *I have
managed to replicate the Figure 4 of the Dalal /et al./ article.* managed to replicate the Figure 4 of the Dalal /et al./ article.*
*** Computing and plotting uncertainty ** Computing and plotting uncertainty
Following the documentation of Following the documentation of
[Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html), [Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html),
I use regplot. I use regplot.
#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
sns.set(color_codes=True) sns.set(color_codes=True)
plt.xlim(30,90) plt.xlim(30,90)
plt.ylim(0,1) plt.ylim(0,1)
...@@ -182,12 +190,12 @@ plt.show() ...@@ -182,12 +190,12 @@ plt.show()
plt.savefig(matplot_lib_filename) plt.savefig(matplot_lib_filename)
matplot_lib_filename matplot_lib_filename
#+end_src #+end_src
**I think I have managed to correctly compute and plot the uncertainty **I think I have managed to correctly compute and plot the uncertainty
of my prediction.** Although the shaded area seems very similar to of my prediction.** Although the shaded area seems very similar to
[the one obtained by with [the one obtained by with
R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf),
I can spot a few differences (e.g., the blue point for temperature I can spot a few differences (e.g., the blue point for temperature
63 is outside)... Could this be a numerical error ? Or a difference 63 is outside)... Could this be a numerical error ? Or a difference
in the statistical method ? It is not clear which one is "right". in the statistical method ? It is not clear which one is "right".
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment