Replace challenger_Python_org.org

901e0a3f · Marie-Gabrielle Dondon · b486f2d3 · 901e0a3f
Commit 901e0a3f authored Nov 12, 2018 by Marie-Gabrielle Dondon
Hide whitespace changes
Inline Side-by-side

Showing with 99 additions and 91 deletions

challenger_Python_org.org src/Python3/challenger_Python_org.org +99 -91

No files found.
--- a/src/Python3/challenger_Python_org.org
+++ b/src/Python3/challenger_Python_org.org
-* Chalenger - Emacs - Python - Windows 7 64 bits
+# -*- coding: utf-8 -*-
+# -*- mode: org -*-

-** Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
+#+TITLE: Challenger - Python - Emacs - Windows 7 64 bits

- In this document we reperform some of the analysis provided in 
- /Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
- Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
- published in /Journal of the American Statistical Association/, Vol. 84,
- No. 408 (Dec., 1989), pp. 945-957 and available at
- http://www.jstor.org/stable/2290069. 
+* Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure

- On the fourth page of this article, they indicate that the maximum
- likelihood estimates of the logistic regression using only temperature
- are: *$\hat{\alpha}$ = 5.085* and *$\hat{\beta}$ = -0.1156* and their
- asymptotic standard errors are *$s_{\hat{\alpha}}$ = 3.052* and
- *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit indicated for this model was
- *$G^2$ = 18.086* with *21* degrees of freedom. Our goal is to reproduce
- the computation behind these values and the Figure 4 of this article,
- possibly in a nicer looking way.
+In this document we reperform some of the analysis provided in 
+/Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
+Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
+published in /Journal of the American Statistical Association/, Vol. 84,
+No. 408 (Dec., 1989), pp. 945-957 and available at
+http://www.jstor.org/stable/2290069. 

-*** Technical information on the computer on which the analysis is run
+On the fourth page of this article, they indicate that the maximum
+likelihood estimates of the logistic regression using only temperature
+are: *$\hat{\alpha}$ = 5.085* and *$\hat{\beta}$ = -0.1156* and their
+asymptotic standard errors are *$s_{\hat{\alpha}}$ = 3.052* and
+*$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit indicated for this model was
+*$G^2$ = 18.086* with *21* degrees of freedom. Our goal is to reproduce
+the computation behind these values and the Figure 4 of this article,
+possibly in a nicer looking way.

- We will be using the Python 3 language using the pandas, statsmodels,
- and numpy library.
+** Technical information on the computer on which the analysis is run

- #+begin_src python :results output :session :exports both
+We will be using the Python 3 language using the pandas, statsmodels,
+and numpy library.
+
+#+begin_src python :results output :session :exports both
 def print_imported_modules():
    import sys
    for name, val in sorted(sys.modules.items()):
@@ -45,21 +48,21 @@ import seaborn as sns

 print_sys_info()
 print_imported_modules()
- #+end_src
+#+end_src

-*** Loading and inspecting data
+** Loading and inspecting data

- Let's start by reading data.
+Let's start by reading data.

-  #+begin_src python :results output :session :exports both
+#+begin_src python :results output :session :exports both
 data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv")
 print(data)
-  #+end_src
+#+end_src

- We know from our previous experience on this data set that filtering
- data is a really bad idea. We will therefore process it as such.
+We know from our previous experience on this data set that filtering
+data is a really bad idea. We will therefore process it as such.

-  #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
 %matplotlib inline
 pd.set_option('mode.chained_assignment',None) # this removes a useless warning from pandas

@@ -70,15 +73,15 @@ plt.tight_layout()

 plt.savefig(matplot_lib_filename)
 matplot_lib_filename
-  #+end_src
+#+end_src

-*** Logistic regression
+** Logistic regression

- Let's assume O-rings independently fail with the same probability
- which solely depends on temperature. A logistic regression should
- allow us to estimate the influence of temperature.
+Let's assume O-rings independently fail with the same probability
+which solely depends on temperature. A logistic regression should
+allow us to estimate the influence of temperature.

- #+begin_src python :results output :session :exports both
+#+begin_src python :results output :session :exports both
 import statsmodels.api as sm

 data["Success"]=data.Count-data.Malfunction
@@ -88,46 +91,52 @@ logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
                family=sm.families.Binomial(sm.families.links.logit)).fit()

 print(logmodel.summary())
- #+end_src
-
- The maximum likelyhood estimator of the intercept and of Temperature
- are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
- to the values from the article of Dalal /et al./ The standard errors are 
- /$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
- the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
- /3.01444/ with *21* degrees of freedom. I cannot find any value similar
- to the Goodness of fit (*$G^2$ = 18.086*) reported by Dalal /et al./ There
- seems to be something wrong. Oh I know, I haven't indicated that my
- observations are actually the result of 6 observations for each rocket
- launch. Let's indicate these weights (since the weights are always the
- same throughout all experiments, it does not change the estimates of
- the fit but it does influence the variance estimates).
-
- #+begin_src python :results output :session :exports both
+#+end_src
+
+The maximum likelyhood estimator of the intercept and of Temperature
+are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
+to the values from the article of Dalal /et al./ The standard errors are 
+/$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
+the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
+/3.01444/ with *21* degrees of freedom. I cannot find any value similar
+to the Goodness of fit (*$G^2$ = 18.086*) reported by Dalal /et al./ There
+seems to be something wrong. Oh I know, I haven't indicated that my
+observations are actually the result of 6 observations for each rocket
+launch. Let's indicate these weights (since the weights are always the
+same throughout all experiments, it does not change the estimates of
+the fit but it does influence the variance estimates).
+
+#+begin_src python :results output :session :exports both
 logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], 
                family=sm.families.Binomial(sm.families.links.logit),
                var_weights=data['Count']).fit()

 print(logmodel.summary())
- #+end_src
+#+end_src

- Good, now I have recovered the asymptotic standard errors
- *$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit
- (Deviance) indicated for this model is *$G^2$ = 18.086* with *21* degrees
- of freedom (Df Residuals).
+Good, now I have recovered the asymptotic standard errors
+*$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit
+(Deviance) indicated for this model is *$G^2$ = 18.086* with *21* degrees
+of freedom (Df Residuals).

- *I have therefore managed to fully replicate the results of the Dalal
- /et al./ article*.
+*I have therefore managed to fully replicate the results of the Dalal
+/et al./ article*.

-*** Predicting failure probability
+** Predicting failure probability

- The temperature when launching the shuttle was 31°F. Let's try to
- estimate the failure probability for such temperature using our model:
+The temperature when launching the shuttle was 31°F. Let's try to
+estimate the failure probability for such temperature using our model:

- #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
-%matplotlib inline
-data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 'Intercept': 1})
+#+begin_src python :results output :session :exports both
+data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 
+                          'Intercept': 1})
 data_pred['Frequency'] = logmodel.predict(data_pred)
+print(data_pred.head())
+#+end_src
+
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+%matplotlib inline
+

 data_pred.plot(x="Temperature",y="Frequency",kind="line",ylim=[0,1])
 plt.scatter(x=data["Temperature"],y=data["Frequency"])
@@ -135,27 +144,26 @@ plt.grid(True)

 plt.savefig(matplot_lib_filename)
 matplot_lib_filename
- #+end_src
-
- #+begin_src python :results output :session :exports both
-print(data_pred.head())
- #+end_src
+#+end_src

- La fonction =logmodel.predict(data_pred)= ne fonctionne pas correctement (Frequency = 1 pour toutes les
- températures).
+La fonction =logmodel.predict(data_pred)= ne fonctionne pas avec les
+dernières versions de pandas (Frequency = 1 pour toutes les températures).

- On peut alors utiliser le code suivant pour calculer les prédictions
- et tracer la courbe :
+On peut alors utiliser le code suivant pour calculer les prédictions
+et tracer la courbe :

- #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+#+begin_src python :results output :session :exports both
 # Inspiring from http://blog.yhat.com/posts/logistic-regression-and-python.html
-%matplotlib inline
 def logit_inv(x):
    return(np.exp(x)/(np.exp(x)+1))

-data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + logmodel.params['Intercept'])
+data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + 
+                            logmodel.params['Intercept'])
 print(data_pred.head())
+#+end_src

+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+%matplotlib inline
 data_pred.plot(x="Temperature",y="Prob",kind="line",ylim=[0,1])
 plt.scatter(x=data["Temperature"],y=data["Frequency"])
 plt.grid(True)
@@ -164,16 +172,16 @@ plt.savefig(matplot_lib_filename)
 matplot_lib_filename
 #+end_src

- This figure is very similar to the Figure 4 of Dalal /et al./ *I have
- managed to replicate the Figure 4 of the Dalal /et al./ article.*
+This figure is very similar to the Figure 4 of Dalal /et al./ *I have
+managed to replicate the Figure 4 of the Dalal /et al./ article.*

-*** Computing and plotting uncertainty
+** Computing and plotting uncertainty

- Following the documentation of
- [Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html),
- I use regplot.
+Following the documentation of
+[Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html),
+I use regplot.

- #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
 sns.set(color_codes=True)
 plt.xlim(30,90)
 plt.ylim(0,1)
@@ -182,12 +190,12 @@ plt.show()

 plt.savefig(matplot_lib_filename)
 matplot_lib_filename
- #+end_src
-
- **I think I have managed to correctly compute and plot the uncertainty
-   of my prediction.** Although the shaded area seems very similar to
-   [the one obtained by with
-   R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf),
-   I can spot a few differences (e.g., the blue point for temperature
-   63 is outside)... Could this be a numerical error ? Or a difference
-   in the statistical method ? It is not clear which one is "right".
+#+end_src
+
+**I think I have managed to correctly compute and plot the uncertainty
+  of my prediction.** Although the shaded area seems very similar to
+  [the one obtained by with
+  R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf),
+  I can spot a few differences (e.g., the blue point for temperature
+  63 is outside)... Could this be a numerical error ? Or a difference
+  in the statistical method ? It is not clear which one is "right".