From 901e0a3fe2ac1f157eeb0b81e1caf37648307600 Mon Sep 17 00:00:00 2001
From: Marie-Gabrielle Dondon
 <85bc36e0a8096c618fbd5993d1cca191@app-learninglab.inria.fr>
Date: Mon, 12 Nov 2018 20:40:49 +0000
Subject: [PATCH] Replace challenger_Python_org.org

---
 src/Python3/challenger_Python_org.org | 190 ++++++++++++++------------
 1 file changed, 99 insertions(+), 91 deletions(-)

diff --git a/src/Python3/challenger_Python_org.org b/src/Python3/challenger_Python_org.org
index 06ecdcc..a1f6b90 100644
--- a/src/Python3/challenger_Python_org.org
+++ b/src/Python3/challenger_Python_org.org
@@ -1,29 +1,32 @@
-* Chalenger - Emacs - Python - Windows 7 64 bits
+# -*- coding: utf-8 -*-
+# -*- mode: org -*-
 
-** Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
+#+TITLE: Challenger - Python - Emacs - Windows 7 64 bits
 
- In this document we reperform some of the analysis provided in 
- /Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
- Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
- published in /Journal of the American Statistical Association/, Vol. 84,
- No. 408 (Dec., 1989), pp. 945-957 and available at
- http://www.jstor.org/stable/2290069. 
+* Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
 
- On the fourth page of this article, they indicate that the maximum
- likelihood estimates of the logistic regression using only temperature
- are: *$\hat{\alpha}$ = 5.085* and *$\hat{\beta}$ = -0.1156* and their
- asymptotic standard errors are *$s_{\hat{\alpha}}$ = 3.052* and
- *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit indicated for this model was
- *$G^2$ = 18.086* with *21* degrees of freedom. Our goal is to reproduce
- the computation behind these values and the Figure 4 of this article,
- possibly in a nicer looking way.
+In this document we reperform some of the analysis provided in 
+/Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
+Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
+published in /Journal of the American Statistical Association/, Vol. 84,
+No. 408 (Dec., 1989), pp. 945-957 and available at
+http://www.jstor.org/stable/2290069. 
 
-*** Technical information on the computer on which the analysis is run
+On the fourth page of this article, they indicate that the maximum
+likelihood estimates of the logistic regression using only temperature
+are: *$\hat{\alpha}$ = 5.085* and *$\hat{\beta}$ = -0.1156* and their
+asymptotic standard errors are *$s_{\hat{\alpha}}$ = 3.052* and
+*$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit indicated for this model was
+*$G^2$ = 18.086* with *21* degrees of freedom. Our goal is to reproduce
+the computation behind these values and the Figure 4 of this article,
+possibly in a nicer looking way.
 
- We will be using the Python 3 language using the pandas, statsmodels,
- and numpy library.
+** Technical information on the computer on which the analysis is run
 
- #+begin_src python :results output :session :exports both
+We will be using the Python 3 language using the pandas, statsmodels,
+and numpy library.
+
+#+begin_src python :results output :session :exports both
 def print_imported_modules():
     import sys
     for name, val in sorted(sys.modules.items()):
@@ -45,21 +48,21 @@ import seaborn as sns
 
 print_sys_info()
 print_imported_modules()
- #+end_src
+#+end_src
 
-*** Loading and inspecting data
+** Loading and inspecting data
 
- Let's start by reading data.
+Let's start by reading data.
 
-  #+begin_src python :results output :session :exports both
+#+begin_src python :results output :session :exports both
 data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv")
 print(data)
-  #+end_src
+#+end_src
 
- We know from our previous experience on this data set that filtering
- data is a really bad idea. We will therefore process it as such.
+We know from our previous experience on this data set that filtering
+data is a really bad idea. We will therefore process it as such.
 
-  #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
 %matplotlib inline
 pd.set_option('mode.chained_assignment',None) # this removes a useless warning from pandas
 
@@ -70,15 +73,15 @@ plt.tight_layout()
 
 plt.savefig(matplot_lib_filename)
 matplot_lib_filename
-  #+end_src
+#+end_src
 
-*** Logistic regression
+** Logistic regression
 
- Let's assume O-rings independently fail with the same probability
- which solely depends on temperature. A logistic regression should
- allow us to estimate the influence of temperature.
+Let's assume O-rings independently fail with the same probability
+which solely depends on temperature. A logistic regression should
+allow us to estimate the influence of temperature.
 
- #+begin_src python :results output :session :exports both
+#+begin_src python :results output :session :exports both
 import statsmodels.api as sm
 
 data["Success"]=data.Count-data.Malfunction
@@ -88,46 +91,52 @@ logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
                 family=sm.families.Binomial(sm.families.links.logit)).fit()
 
 print(logmodel.summary())
- #+end_src
-
- The maximum likelyhood estimator of the intercept and of Temperature
- are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
- to the values from the article of Dalal /et al./ The standard errors are 
- /$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
- the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
- /3.01444/ with *21* degrees of freedom. I cannot find any value similar
- to the Goodness of fit (*$G^2$ = 18.086*) reported by Dalal /et al./ There
- seems to be something wrong. Oh I know, I haven't indicated that my
- observations are actually the result of 6 observations for each rocket
- launch. Let's indicate these weights (since the weights are always the
- same throughout all experiments, it does not change the estimates of
- the fit but it does influence the variance estimates).
-
- #+begin_src python :results output :session :exports both
+#+end_src
+
+The maximum likelyhood estimator of the intercept and of Temperature
+are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
+to the values from the article of Dalal /et al./ The standard errors are 
+/$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
+the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
+/3.01444/ with *21* degrees of freedom. I cannot find any value similar
+to the Goodness of fit (*$G^2$ = 18.086*) reported by Dalal /et al./ There
+seems to be something wrong. Oh I know, I haven't indicated that my
+observations are actually the result of 6 observations for each rocket
+launch. Let's indicate these weights (since the weights are always the
+same throughout all experiments, it does not change the estimates of
+the fit but it does influence the variance estimates).
+
+#+begin_src python :results output :session :exports both
 logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], 
                 family=sm.families.Binomial(sm.families.links.logit),
                 var_weights=data['Count']).fit()
 
 print(logmodel.summary())
- #+end_src
+#+end_src
 
- Good, now I have recovered the asymptotic standard errors
- *$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit
- (Deviance) indicated for this model is *$G^2$ = 18.086* with *21* degrees
- of freedom (Df Residuals).
+Good, now I have recovered the asymptotic standard errors
+*$s_{\hat{\alpha}}$ = 3.052* and *$s_{\hat{\beta}}$ = 0.047*. The Goodness of fit
+(Deviance) indicated for this model is *$G^2$ = 18.086* with *21* degrees
+of freedom (Df Residuals).
 
- *I have therefore managed to fully replicate the results of the Dalal
- /et al./ article*.
+*I have therefore managed to fully replicate the results of the Dalal
+/et al./ article*.
 
-*** Predicting failure probability
+** Predicting failure probability
 
- The temperature when launching the shuttle was 31°F. Let's try to
- estimate the failure probability for such temperature using our model:
+The temperature when launching the shuttle was 31°F. Let's try to
+estimate the failure probability for such temperature using our model:
 
- #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
-%matplotlib inline
-data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 'Intercept': 1})
+#+begin_src python :results output :session :exports both
+data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 
+                          'Intercept': 1})
 data_pred['Frequency'] = logmodel.predict(data_pred)
+print(data_pred.head())
+#+end_src
+
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+%matplotlib inline
+
 
 data_pred.plot(x="Temperature",y="Frequency",kind="line",ylim=[0,1])
 plt.scatter(x=data["Temperature"],y=data["Frequency"])
@@ -135,27 +144,26 @@ plt.grid(True)
 
 plt.savefig(matplot_lib_filename)
 matplot_lib_filename
- #+end_src
-
- #+begin_src python :results output :session :exports both
-print(data_pred.head())
- #+end_src
+#+end_src
 
- La fonction =logmodel.predict(data_pred)= ne fonctionne pas correctement (Frequency = 1 pour toutes les
- températures).
+La fonction =logmodel.predict(data_pred)= ne fonctionne pas avec les
+dernières versions de pandas (Frequency = 1 pour toutes les températures).
 
- On peut alors utiliser le code suivant pour calculer les prédictions
- et tracer la courbe :
+On peut alors utiliser le code suivant pour calculer les prédictions
+et tracer la courbe :
 
- #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+#+begin_src python :results output :session :exports both
 # Inspiring from http://blog.yhat.com/posts/logistic-regression-and-python.html
-%matplotlib inline
 def logit_inv(x):
     return(np.exp(x)/(np.exp(x)+1))
 
-data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + logmodel.params['Intercept'])
+data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + 
+                            logmodel.params['Intercept'])
 print(data_pred.head())
+#+end_src
 
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+%matplotlib inline
 data_pred.plot(x="Temperature",y="Prob",kind="line",ylim=[0,1])
 plt.scatter(x=data["Temperature"],y=data["Frequency"])
 plt.grid(True)
@@ -164,16 +172,16 @@ plt.savefig(matplot_lib_filename)
 matplot_lib_filename
 #+end_src
 
- This figure is very similar to the Figure 4 of Dalal /et al./ *I have
- managed to replicate the Figure 4 of the Dalal /et al./ article.*
+This figure is very similar to the Figure 4 of Dalal /et al./ *I have
+managed to replicate the Figure 4 of the Dalal /et al./ article.*
 
-*** Computing and plotting uncertainty
+** Computing and plotting uncertainty
 
- Following the documentation of
- [Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html),
- I use regplot.
+Following the documentation of
+[Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html),
+I use regplot.
 
- #+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
+#+begin_src python :results file :session :var matplot_lib_filename=(org-babel-temp-file "figure" ".png") :exports both
 sns.set(color_codes=True)
 plt.xlim(30,90)
 plt.ylim(0,1)
@@ -182,12 +190,12 @@ plt.show()
 
 plt.savefig(matplot_lib_filename)
 matplot_lib_filename
- #+end_src
-
- **I think I have managed to correctly compute and plot the uncertainty
-   of my prediction.** Although the shaded area seems very similar to
-   [the one obtained by with
-   R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf),
-   I can spot a few differences (e.g., the blue point for temperature
-   63 is outside)... Could this be a numerical error ? Or a difference
-   in the statistical method ? It is not clear which one is "right".
+#+end_src
+
+**I think I have managed to correctly compute and plot the uncertainty
+  of my prediction.** Although the shaded area seems very similar to
+  [the one obtained by with
+  R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf),
+  I can spot a few differences (e.g., the blue point for temperature
+  63 is outside)... Could this be a numerical error ? Or a difference
+  in the statistical method ? It is not clear which one is "right".
-- 
2.18.1