* Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure
  :PROPERTIES:
  :CUSTOM_ID: risk-analysis-of-the-space-shuttle-pre-challenger-prediction-of-failure
  :END:

NDT: This document is based on the
[[https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/src/Python3/challenger.ipynb][original
ipynb notebook]] from the
mooc, which was converted into an org notebook thanks to pandoc. Then code
blocks were manually edited (to a minimal extent) to retrieve
locally-executed results. Edits include: indentation removal, adding
plt.savefig(), etc.

In this document we reperform some of the analysis provided in /Risk
Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure/ by
/Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/ published in
/Journal of the American Statistical Association/, Vol. 84, No. 408
(Dec., 1989), pp. 945-957 and available at
[[http://www.jstor.org/stable/2290069]].

On the fourth page of this article, they indicate that the maximum
likelihood estimates of the logistic regression using only temperature
are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic
standard errors are $s_{\hat{\alpha}}=3.052$ and
$s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model
was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce
the computation behind these values and the Figure 4 of this article,
possibly in a nicer looking way.

** Technical information on the computer on which the analysis is run
   :PROPERTIES:
   :CUSTOM_ID: technical-information-on-the-computer-on-which-the-analysis-is-run
   :END:

We will be using the python3 language using the pandas, statsmodels,
numpy, matplotlib and seaborn libraries.

#+BEGIN_SRC python :session :exports both :results output 
def print_imported_modules():
   import sys
   for name, val in sorted(sys.modules.items()):
    if(hasattr(val, '__version__')): 
           print(val.__name__, val.__version__)
#        else:
#            print(val.__name__, "(unknown version)")
def print_sys_info():
   import sys
   import platform
   print(sys.version)
   print(platform.uname())

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

print_sys_info()
print_imported_modules()
#+END_SRC

#+RESULTS:
#+begin_example
3.8.1 (default, Jan  8 2020, 22:29:32) 
[GCC 7.3.0]
uname_result(system='Linux', node='hpisor', release='4.9.0-12-amd64', version='#1 SMP Debian 4.9.210-1 (2020-01-20)', machine='x86_64', processor='')
_csv 1.0
_ctypes 1.1.0
decimal 1.70
argparse 1.1
csv 1.0
ctypes 1.1.0
cycler 0.10.0
dateutil 2.8.1
decimal 1.70
distutils 3.8.1
json 2.0.9
kiwisolver 1.2.0
logging 0.5.1.2
matplotlib 3.1.3
matplotlib.backends.backend_agg 3.1.3
matplotlib.backends.backend_qt5 5.9.2
matplotlib.backends.qt_compat 5.9.2
matplotlib.backends.qt_editor._formlayout 1.0.10
mkl 2.3.0
numpy 1.18.1
numpy.core 1.18.1
numpy.core._multiarray_umath 3.1
numpy.lib 1.18.1
numpy.linalg._umath_linalg b'0.1.5'
pandas 1.0.3
patsy 0.5.1
patsy.version 0.5.1
platform 1.0.8
pyparsing 2.4.7
pytz 2020.1
re 2.2.1
scipy 1.4.1
scipy._lib._uarray 0.5.1+5.ga864a57.scipy
scipy._lib.decorator 4.0.5
scipy._lib.six 1.2.0
scipy.integrate._dop b'$Revision: $'
scipy.integrate._ode $Id$
scipy.integrate._odepack  1.9 
scipy.integrate._quadpack  1.13 
scipy.integrate.lsoda b'$Revision: $'
scipy.integrate.vode b'$Revision: $'
scipy.interpolate._fitpack  1.7 
scipy.interpolate.dfitpack b'$Revision: $'
scipy.linalg 0.4.9
scipy.linalg._fblas b'$Revision: $'
scipy.linalg._flapack b'$Revision: $'
scipy.linalg._flinalg b'$Revision: $'
scipy.ndimage 2.0
scipy.optimize._cobyla b'$Revision: $'
scipy.optimize._lbfgsb b'$Revision: $'
scipy.optimize._minpack  1.10 
scipy.optimize._nnls b'$Revision: $'
scipy.optimize._slsqp b'$Revision: $'
scipy.optimize.minpack2 b'$Revision: $'
scipy.signal.spline 0.2
scipy.sparse.linalg.eigen.arpack._arpack b'$Revision: $'
scipy.sparse.linalg.isolve._iterative b'$Revision: $'
scipy.special.specfun b'$Revision: $'
scipy.stats.mvn b'$Revision: $'
scipy.stats.statlib b'$Revision: $'
seaborn 0.10.1
seaborn.external.husl 2.1.0
six 1.14.0
statsmodels 0.11.0
statsmodels.__init__ 0.11.0
statsmodels.api 0.11.0
statsmodels.tools.web 0.11.0
urllib.request 3.8
zlib 1.0
#+end_example

** Loading and inspecting data
   :PROPERTIES:
   :CUSTOM_ID: loading-and-inspecting-data
   :END:

Let's start by reading data.

Note: url corrected based on forum post at https://www.fun-mooc.fr/courses/course-v1:inria+41016+self-paced/courseware/7bf2267c336246f9b6518db624692e14/96b7ce47bd11466a9a2e63d8e8a93d99/

#+BEGIN_SRC python :session :exports both :results value 
data = pd.read_csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv")
data
#+END_SRC

#+RESULTS:
#+begin_example
         Date  Count  Temperature  Pressure  Malfunction
0     4/12/81      6           66        50            0
1    11/12/81      6           70        50            1
2     3/22/82      6           69        50            0
3    11/11/82      6           68        50            0
4     4/04/83      6           67        50            0
5     6/18/82      6           72        50            0
6     8/30/83      6           73       100            0
7    11/28/83      6           70       100            0
8     2/03/84      6           57       200            1
9     4/06/84      6           63       200            1
10    8/30/84      6           70       200            1
11   10/05/84      6           78       200            0
12   11/08/84      6           67       200            0
13    1/24/85      6           53       200            2
14    4/12/85      6           67       200            0
15    4/29/85      6           75       200            0
16    6/17/85      6           70       200            0
17  7/2903/85      6           81       200            0
18    8/27/85      6           76       200            0
19   10/03/85      6           79       200            0
20   10/30/85      6           75       200            2
21   11/26/85      6           76       200            0
22    1/12/86      6           58       200            1
#+end_example


We know from our previous experience on this data set that filtering
data is a really bad idea. We will therefore process it as such.

#+BEGIN_SRC python :session :exports both :results output
#%matplotlib inline
pd.set_option('mode.chained_assignment',None) # this removes a useless warning from pandas
import matplotlib.pyplot as plt

data["Frequency"]=data.Malfunction/data.Count
data.plot(x="Temperature",y="Frequency",kind="scatter",ylim=[0,1])
plt.grid(True)
plt.savefig("fig1.png")
#+END_SRC

#+RESULTS:
: /home/eliox/miniconda3/envs/mooc-rr-emacs/lib/python3.8/site-packages/pandas/plotting/_matplotlib/core.py:320: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
:   fig = self.plt.figure(figsize=self.figsize)
: 'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

[[./fig1.png]]

** Logistic regression
   :PROPERTIES:
   :CUSTOM_ID: logistic-regression
   :END:

Let's assume O-rings independently fail with the same probability which
solely depends on temperature. A logistic regression should allow us to
estimate the influence of temperature.

#+BEGIN_SRC python :session :exports both :results output
import statsmodels.api as sm

data["Success"]=data.Count-data.Malfunction
data["Intercept"]=1

logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
                family=sm.families.Binomial(sm.families.links.logit)).fit()

print(logmodel.summary())
#+END_SRC

#+RESULTS:
#+begin_example
/tmp/babel-tMc2Ef/python-Skn7ZF:7: DeprecationWarning: Calling Family(..) with a link class as argument is deprecated.
Use an instance of a link class instead.
  family=sm.families.Binomial(sm.families.links.logit)).fit()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:              Frequency   No. Observations:                   23
Model:                            GLM   Df Residuals:                       21
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -3.9210
Date:               mer., 06 mai 2020   Deviance:                       3.0144
Time:                        18:03:21   Pearson chi2:                     5.00
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       5.0850      7.477      0.680      0.496      -9.570      19.740
Temperature    -0.1156      0.115     -1.004      0.316      -0.341       0.110
===============================================================================
#+end_example

The maximum likelyhood estimator of the intercept and of Temperature are
thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$. This *corresponds*
to the values from the article of Dalal /et al./ The standard errors are
$s_{\hat{\alpha}} = 7.477$ and $s_{\hat{\beta}} = 0.115$, which is
*different* from the $3.052$ and $0.04702$ reported by Dallal /et al./
The deviance is $3.01444$ with 21 degrees of freedom. I cannot find any
value similar to the Goodness of fit ($G^2=18.086$) reported by Dalal
/et al./ There seems to be something wrong. Oh I know, I haven't
indicated that my observations are actually the result of 6 observations
for each rocket launch. Let's indicate these weights (since the weights
are always the same throughout all experiments, it does not change the
estimates of the fit but it does influence the variance estimates).

#+BEGIN_SRC python :session :exports both :results value
logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], 
                family=sm.families.Binomial(sm.families.links.logit),
                var_weights=data['Count']).fit()

logmodel.summary()
#+END_SRC

#+RESULTS:
#+begin_example
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:              Frequency   No. Observations:                   23
Model:                            GLM   Df Residuals:                       21
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -23.526
Date:               mer., 06 mai 2020   Deviance:                       18.086
Time:                        18:03:21   Pearson chi2:                     30.0
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       5.0850      3.052      1.666      0.096      -0.898      11.068
Temperature    -0.1156      0.047     -2.458      0.014      -0.208      -0.023
===============================================================================
#+end_example

Good, now I have recovered the asymptotic standard errors
$s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of
fit (Deviance) indicated for this model is $G^2=18.086$ with 21 degrees
of freedom (Df Residuals).

*I have therefore managed to fully replicate the results of the Dalal
/et al./ article*.

** Predicting failure probability
   :PROPERTIES:
   :CUSTOM_ID: predicting-failure-probability
   :END:

The temperature when launching the shuttle was 31°F. Let's try to
estimate the failure probability for such temperature using our model.:

#+BEGIN_SRC python :session :results output :exports both
#%matplotlib inline
data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 'Intercept': 1})
print(logmodel.predict(data_pred))
"""
data_pred['Frequency'] = logmodel.predict(data_pred)
data_pred.plot(x="Temperature",y="Frequency",kind="scatter",ylim=[0,1])
plt.scatter(x=data["Temperature"],y=data["Frequency"])
plt.grid(True)
plt.savefig("fig2.png")
"""
#+END_SRC

#+RESULTS:
#+begin_example
0      1.0
1      1.0
2      1.0
3      1.0
4      1.0
      ... 
116    1.0
117    1.0
118    1.0
119    1.0
120    1.0
Length: 121, dtype: float64
#+end_example

[[./fig2.png]]

This figure is very similar to the Figure 4 of Dalal /et al./ *I have
managed to replicate the Figure 4 of the Dalal /et al./ article.*

** Computing and plotting uncertainty
   :PROPERTIES:
   :CUSTOM_ID: computing-and-plotting-uncertainty
   :END:

Following the documentation of
[[https://seaborn.pydata.org/generated/seaborn.regplot.html][Seaborn]],
I use regplot.

#+BEGIN_SRC python :session :results output :exports both
sns.set(color_codes=True)
plt.figure(figsize=(5,3))
plt.xlim(30,90)
plt.ylim(0,1)
sns.regplot(x='Temperature', y='Frequency', data=data, logistic=True)
plt.savefig("fig3.png")
#+END_SRC

#+RESULTS:
: /tmp/babel-tMc2Ef/python-j6HG2E:2: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
:   plt.figure(figsize=(5,3))

[[./fig3.png]]

*I think I have managed to correctly compute and plot the uncertainty of
my prediction.* Although the shaded area seems very similar to
[[https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf][the
one obtained by with R]], I can spot a few differences (e.g., the blue
point for temperature 63 is outside)... Could this be a numerical error
? Or a difference in the statistical method ? It is not clear which one
is "right".