"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"pd.set_option('mode.chained_assignment',None) # this removes a useless warning from pandas\n",
"import matplotlib.pyplot as plt\n",
"\n",
"data[\"Frequency\"]=data.Malfunction/data.Count\n",
"data.plot(x=\"Temperature\",y=\"Frequency\",kind=\"scatter\",ylim=[0,1])\n",
"plt.grid(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logistic regression\n",
"\n",
"Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
Generalized Linear Model Regression Results
\n",
"
\n",
"
Dep. Variable:
Frequency
No. Observations:
23
\n",
"
\n",
"
\n",
"
Model:
GLM
Df Residuals:
21
\n",
"
\n",
"
\n",
"
Model Family:
Binomial
Df Model:
1
\n",
"
\n",
"
\n",
"
Link Function:
logit
Scale:
1.0000
\n",
"
\n",
"
\n",
"
Method:
IRLS
Log-Likelihood:
-3.9210
\n",
"
\n",
"
\n",
"
Date:
Mon, 06 Apr 2020
Deviance:
3.0144
\n",
"
\n",
"
\n",
"
Time:
11:51:34
Pearson chi2:
5.00
\n",
"
\n",
"
\n",
"
No. Iterations:
6
Covariance Type:
nonrobust
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
z
P>|z|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
5.0850
7.477
0.680
0.496
-9.570
19.740
\n",
"
\n",
"
\n",
"
Temperature
-0.1156
0.115
-1.004
0.316
-0.341
0.110
\n",
"
\n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Generalized Linear Model Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Frequency No. Observations: 23\n",
"Model: GLM Df Residuals: 21\n",
"Model Family: Binomial Df Model: 1\n",
"Link Function: logit Scale: 1.0000\n",
"Method: IRLS Log-Likelihood: -3.9210\n",
"Date: Mon, 06 Apr 2020 Deviance: 3.0144\n",
"Time: 11:51:34 Pearson chi2: 5.00\n",
"No. Iterations: 6 Covariance Type: nonrobust\n",
"===============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"-------------------------------------------------------------------------------\n",
"Intercept 5.0850 7.477 0.680 0.496 -9.570 19.740\n",
"Temperature -0.1156 0.115 -1.004 0.316 -0.341 0.110\n",
"===============================================================================\n",
"\"\"\""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import statsmodels.api as sm\n",
"\n",
"data[\"Success\"]=data.Count-data.Malfunction\n",
"data[\"Intercept\"]=1\n",
"\n",
"logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], \n",
" family=sm.families.Binomial(sm.families.links.logit)).fit()\n",
"\n",
"logmodel.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The maximum likelyhood estimator of the intercept and of Temperature are thus $\\hat{\\alpha}=5.0849$ and $\\hat{\\beta}=-0.1156$. This **corresponds** to the values from the article of Dalal *et al.* The standard errors are $s_{\\hat{\\alpha}} = 7.477$ and $s_{\\hat{\\beta}} = 0.115$, which is **different** from the $3.052$ and $0.04702$ reported by Dallal *et al.* The deviance is $3.01444$ with 21 degrees of freedom. I cannot find any value similar to the Goodness of fit ($G^2=18.086$) reported by Dalal *et al.* There seems to be something wrong. Oh I know, I haven't indicated that my observations are actually the result of 6 observations for each rocket launch. Let's indicate these weights (since the weights are always the same throughout all experiments, it does not change the estimates of the fit but it does influence the variance estimates)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
Generalized Linear Model Regression Results
\n",
"
\n",
"
Dep. Variable:
Frequency
No. Observations:
23
\n",
"
\n",
"
\n",
"
Model:
GLM
Df Residuals:
21
\n",
"
\n",
"
\n",
"
Model Family:
Binomial
Df Model:
1
\n",
"
\n",
"
\n",
"
Link Function:
logit
Scale:
1.0000
\n",
"
\n",
"
\n",
"
Method:
IRLS
Log-Likelihood:
-23.526
\n",
"
\n",
"
\n",
"
Date:
Mon, 06 Apr 2020
Deviance:
18.086
\n",
"
\n",
"
\n",
"
Time:
11:51:34
Pearson chi2:
30.0
\n",
"
\n",
"
\n",
"
No. Iterations:
6
Covariance Type:
nonrobust
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
z
P>|z|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
5.0850
3.052
1.666
0.096
-0.898
11.068
\n",
"
\n",
"
\n",
"
Temperature
-0.1156
0.047
-2.458
0.014
-0.208
-0.023
\n",
"
\n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Generalized Linear Model Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Frequency No. Observations: 23\n",
"Model: GLM Df Residuals: 21\n",
"Model Family: Binomial Df Model: 1\n",
"Link Function: logit Scale: 1.0000\n",
"Method: IRLS Log-Likelihood: -23.526\n",
"Date: Mon, 06 Apr 2020 Deviance: 18.086\n",
"Time: 11:51:34 Pearson chi2: 30.0\n",
"No. Iterations: 6 Covariance Type: nonrobust\n",
"===============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"-------------------------------------------------------------------------------\n",
"Intercept 5.0850 3.052 1.666 0.096 -0.898 11.068\n",
"Temperature -0.1156 0.047 -2.458 0.014 -0.208 -0.023\n",
"===============================================================================\n",
"\"\"\""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], \n",
" family=sm.families.Binomial(sm.families.links.logit),\n",
" var_weights=data['Count']).fit()\n",
"\n",
"logmodel.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Good, now I have recovered the asymptotic standard errors $s_{\\hat{\\alpha}}=3.052$ and $s_{\\hat{\\beta}}=0.047$.\n",
"The Goodness of fit (Deviance) indicated for this model is $G^2=18.086$ with 21 degrees of freedom (Df Residuals).\n",
"\n",
"**I have therefore managed to fully replicate the results of the Dalal *et al.* article**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predicting failure probability\n",
"The temperature when launching the shuttle was 31°F. Let's try to estimate the failure probability for such temperature using our model.:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 'Intercept': 1})\n",
"data_pred['Frequency'] = logmodel.predict(data_pred)\n",
"data_pred.plot(x=\"Temperature\",y=\"Frequency\",kind=\"line\",ylim=[0,1])\n",
"plt.scatter(x=data[\"Temperature\"],y=data[\"Frequency\"])\n",
"plt.grid(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false,
"scrolled": true
},
"source": [
"This figure is very similar to the Figure 4 of Dalal *et al.* **I have managed to replicate the Figure 4 of the Dalal *et al.* article.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing and plotting uncertainty"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Following the documentation of [Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html), I use regplot."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.\n",
" return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.set(color_codes=True)\n",
"plt.xlim(30,90)\n",
"plt.ylim(0,1)\n",
"sns.regplot(x='Temperature', y='Frequency', data=data, logistic=True)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
]
}
],
"metadata": {
"celltoolbar": "Hide code",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}