From 901e0a3fe2ac1f157eeb0b81e1caf37648307600 Mon Sep 17 00:00:00 2001 From: Marie-Gabrielle Dondon <85bc36e0a8096c618fbd5993d1cca191@app-learninglab.inria.fr> Date: Mon, 12 Nov 2018 20:40:49 +0000 Subject: [PATCH] Replace challenger_Python_org.org --- src/Python3/challenger_Python_org.org | 190 ++++++++++++++------------ 1 file changed, 99 insertions(+), 91 deletions(-) diff --git a/src/Python3/challenger_Python_org.org b/src/Python3/challenger_Python_org.org index 06ecdcc..a1f6b90 100644 --- a/src/Python3/challenger_Python_org.org +++ b/src/Python3/challenger_Python_org.org @@ -1,29 +1,32 @@ -* Chalenger - Emacs - Python - Windows 7 64 bits +# -*- coding: utf-8 -*- +# -*- mode: org -*- -** Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure +#+TITLE: Challenger - Python - Emacs - Windows 7 64 bits - In this document we reperform some of the analysis provided in - /Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of - Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/ - published in /Journal of the American Statistical Association/, Vol. 84, - No. 408 (Dec., 1989), pp. 945-957 and available at - http://www.jstor.org/stable/2290069. +* Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure - On the fourth page of this article, they indicate that the maximum - likelihood estimates of the logistic regression using only temperature - are: *$\hat{\alpha}$ = 5.085* and *$\hat{\beta}$ = -0.1156* and their - asymptotic standard errors are *$s_{\hat{\alpha}}$ = 3.052* and - *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit indicated for this model was - *$G^2$ = 18.086* with *21* degrees of freedom. Our goal is to reproduce - the computation behind these values and the Figure 4 of this article, - possibly in a nicer looking way. +In this document we reperform some of the analysis provided in +/Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of +Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/ +published in /Journal of the American Statistical Association/, Vol. 84, +No. 408 (Dec., 1989), pp. 945-957 and available at +http://www.jstor.org/stable/2290069. -*** Technical information on the computer on which the analysis is run +On the fourth page of this article, they indicate that the maximum +likelihood estimates of the logistic regression using only temperature +are: *$\hat{\alpha}$ = 5.085* and *$\hat{\beta}$ = -0.1156* and their +asymptotic standard errors are *$s_{\hat{\alpha}}$ = 3.052* and +*$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit indicated for this model was +*$G^2$ = 18.086* with *21* degrees of freedom. Our goal is to reproduce +the computation behind these values and the Figure 4 of this article, +possibly in a nicer looking way. - We will be using the Python 3 language using the pandas, statsmodels, - and numpy library. +** Technical information on the computer on which the analysis is run - #+begin_src python :results output :session :exports both +We will be using the Python 3 language using the pandas, statsmodels, +and numpy library. + +#+begin_src python :results output :session :exports both def print_imported_modules(): import sys for name, val in sorted(sys.modules.items()): @@ -45,21 +48,21 @@ import seaborn as sns print_sys_info() print_imported_modules() - #+end_src +#+end_src -*** Loading and inspecting data +** Loading and inspecting data - Let's start by reading data. +Let's start by reading data. - #+begin_src python :results output :session :exports both +#+begin_src python :results output :session :exports both data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv") print(data) - #+end_src +#+end_src - We know from our previous experience on this data set that filtering - data is a really bad idea. We will therefore process it as such. +We know from our previous experience on this data set that filtering +data is a really bad idea. We will therefore process it as such. - #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both +#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both %matplotlib inline pd.set_option('mode.chained_assignment',None) # this removes a useless warning from pandas @@ -70,15 +73,15 @@ plt.tight_layout() plt.savefig(matplot_lib_filename) matplot_lib_filename - #+end_src +#+end_src -*** Logistic regression +** Logistic regression - Let's assume O-rings independently fail with the same probability - which solely depends on temperature. A logistic regression should - allow us to estimate the influence of temperature. +Let's assume O-rings independently fail with the same probability +which solely depends on temperature. A logistic regression should +allow us to estimate the influence of temperature. - #+begin_src python :results output :session :exports both +#+begin_src python :results output :session :exports both import statsmodels.api as sm data["Success"]=data.Count-data.Malfunction @@ -88,46 +91,52 @@ logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], family=sm.families.Binomial(sm.families.links.logit)).fit() print(logmodel.summary()) - #+end_src - - The maximum likelyhood estimator of the intercept and of Temperature - are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds* - to the values from the article of Dalal /et al./ The standard errors are - /$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from - the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is - /3.01444/ with *21* degrees of freedom. I cannot find any value similar - to the Goodness of fit (*$G^2$ = 18.086*) reported by Dalal /et al./ There - seems to be something wrong. Oh I know, I haven't indicated that my - observations are actually the result of 6 observations for each rocket - launch. Let's indicate these weights (since the weights are always the - same throughout all experiments, it does not change the estimates of - the fit but it does influence the variance estimates). - - #+begin_src python :results output :session :exports both +#+end_src + +The maximum likelyhood estimator of the intercept and of Temperature +are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds* +to the values from the article of Dalal /et al./ The standard errors are +/$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from +the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is +/3.01444/ with *21* degrees of freedom. I cannot find any value similar +to the Goodness of fit (*$G^2$ = 18.086*) reported by Dalal /et al./ There +seems to be something wrong. Oh I know, I haven't indicated that my +observations are actually the result of 6 observations for each rocket +launch. Let's indicate these weights (since the weights are always the +same throughout all experiments, it does not change the estimates of +the fit but it does influence the variance estimates). + +#+begin_src python :results output :session :exports both logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], family=sm.families.Binomial(sm.families.links.logit), var_weights=data['Count']).fit() print(logmodel.summary()) - #+end_src +#+end_src - Good, now I have recovered the asymptotic standard errors - *$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit - (Deviance) indicated for this model is *$G^2$ = 18.086* with *21* degrees - of freedom (Df Residuals). +Good, now I have recovered the asymptotic standard errors +*$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit +(Deviance) indicated for this model is *$G^2$ = 18.086* with *21* degrees +of freedom (Df Residuals). - *I have therefore managed to fully replicate the results of the Dalal - /et al./ article*. +*I have therefore managed to fully replicate the results of the Dalal +/et al./ article*. -*** Predicting failure probability +** Predicting failure probability - The temperature when launching the shuttle was 31°F. Let's try to - estimate the failure probability for such temperature using our model: +The temperature when launching the shuttle was 31°F. Let's try to +estimate the failure probability for such temperature using our model: - #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both -%matplotlib inline -data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 'Intercept': 1}) +#+begin_src python :results output :session :exports both +data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), + 'Intercept': 1}) data_pred['Frequency'] = logmodel.predict(data_pred) +print(data_pred.head()) +#+end_src + +#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both +%matplotlib inline + data_pred.plot(x="Temperature",y="Frequency",kind="line",ylim=[0,1]) plt.scatter(x=data["Temperature"],y=data["Frequency"]) @@ -135,27 +144,26 @@ plt.grid(True) plt.savefig(matplot_lib_filename) matplot_lib_filename - #+end_src - - #+begin_src python :results output :session :exports both -print(data_pred.head()) - #+end_src +#+end_src - La fonction =logmodel.predict(data_pred)= ne fonctionne pas correctement (Frequency = 1 pour toutes les - températures). +La fonction =logmodel.predict(data_pred)= ne fonctionne pas avec les +dernières versions de pandas (Frequency = 1 pour toutes les températures). - On peut alors utiliser le code suivant pour calculer les prédictions - et tracer la courbe : +On peut alors utiliser le code suivant pour calculer les prédictions +et tracer la courbe : - #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both +#+begin_src python :results output :session :exports both # Inspiring from http://blog.yhat.com/posts/logistic-regression-and-python.html -%matplotlib inline def logit_inv(x): return(np.exp(x)/(np.exp(x)+1)) -data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + logmodel.params['Intercept']) +data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + + logmodel.params['Intercept']) print(data_pred.head()) +#+end_src +#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both +%matplotlib inline data_pred.plot(x="Temperature",y="Prob",kind="line",ylim=[0,1]) plt.scatter(x=data["Temperature"],y=data["Frequency"]) plt.grid(True) @@ -164,16 +172,16 @@ plt.savefig(matplot_lib_filename) matplot_lib_filename #+end_src - This figure is very similar to the Figure 4 of Dalal /et al./ *I have - managed to replicate the Figure 4 of the Dalal /et al./ article.* +This figure is very similar to the Figure 4 of Dalal /et al./ *I have +managed to replicate the Figure 4 of the Dalal /et al./ article.* -*** Computing and plotting uncertainty +** Computing and plotting uncertainty - Following the documentation of - [Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html), - I use regplot. +Following the documentation of +[Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html), +I use regplot. - #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both +#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both sns.set(color_codes=True) plt.xlim(30,90) plt.ylim(0,1) @@ -182,12 +190,12 @@ plt.show() plt.savefig(matplot_lib_filename) matplot_lib_filename - #+end_src - - **I think I have managed to correctly compute and plot the uncertainty - of my prediction.** Although the shaded area seems very similar to - [the one obtained by with - R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), - I can spot a few differences (e.g., the blue point for temperature - 63 is outside)... Could this be a numerical error ? Or a difference - in the statistical method ? It is not clear which one is "right". +#+end_src + +**I think I have managed to correctly compute and plot the uncertainty + of my prediction.** Although the shaded area seems very similar to + [the one obtained by with + R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), + I can spot a few differences (e.g., the blue point for temperature + 63 is outside)... Could this be a numerical error ? Or a difference + in the statistical method ? It is not clear which one is "right". -- 2.18.1