Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
M
mooc-rr-ressources
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
4
Merge Requests
4
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Commits
Issue Boards
Open sidebar
Learning Lab
mooc-rr-ressources
Commits
0d300a94
Commit
0d300a94
authored
Oct 01, 2018
by
Konrad Hinsen
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Traduction anglaise de challenger.org
parent
ce95d01f
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
263 additions
and
0 deletions
+263
-0
challenger-en.org
module2/exo5/challenger-en.org
+263
-0
No files found.
module2/exo5/challenger-en.org
0 → 100644
View file @
0d300a94
#+
TITLE
:
Analysis
of
the
risk
of
failure
of
the
toric
joints
of
the
space
shuttle
Challenger
#+
AUTHOR
:
Konrad
Hinsen
,
Arnaud
Legrand
,
Christophe
Pouzat
#+
DATE
:
Juin
2018
#+
LANGUAGE
:
en
#+
OPTIONS
:
H
:
3
creator
:
nil
timestamp
:
nil
skip
:
nil
toc
:
nil
num
:
t
^:
nil
~:~
#
#+
OPTIONS
:
author
:
nil
title
:
nil
date
:
nil
#+
HTML_HEAD
:
<
link
rel
=
"stylesheet"
type
=
"text/css"
href
=
"http://www.pirilampo.org/styles/readtheorg/css/htmlize.css"
/>
#+
HTML_HEAD
:
<
link
rel
=
"stylesheet"
type
=
"text/css"
href
=
"http://www.pirilampo.org/styles/readtheorg/css/readtheorg.css"
/>
#+
HTML_HEAD
:
<
script
src
=
"https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"
></
script
>
#+
HTML_HEAD
:
<
script
src
=
"https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js"
></
script
>
#+
HTML_HEAD
:
<
script
type
=
"text/javascript"
src
=
"http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"
></
script
>
#+
HTML_HEAD
:
<
script
type
=
"text/javascript"
src
=
"http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"
></
script
>
#+
LATEX_HEADER
:
\
usepackage
[
utf8
]{
inputenc
}
#+
LATEX_HEADER
:
\
usepackage
[
T1
]{
fontenc
}
#+
LATEX_HEADER
:
\
usepackage
{
textcomp
}
#+
LATEX_HEADER
:
\
usepackage
[
a4paper
,
margin
=
.8
in
]{
geometry
}
#+
LATEX_HEADER
:
\
usepackage
[
usenames
,
dvipsnames
,
svgnames
,
table
]{
xcolor
}
#+
LATEX_HEADER
:
\
usepackage
{
palatino
}
#+
LATEX_HEADER
:
\
usepackage
{
svg
}
#+
LATEX_HEADER
:
\
let
\
epsilon
=\
varepsilon
*
Forword
:*
The
explanations
given
in
this
document
about
the
context
of
the
study
have
been
taken
from
the
excellent
book
/
Visual
Explanations
:
Images
and
Quantities
,
Evidence
and
Narrative
/
by
Edward
R
.
Tufte
,
published
in
1997
by
/
Graphics
Press
/
and
re
-
edited
in
2005
,
and
from
the
article
/
Risk
Analysis
of
the
Space
Shuttle
:
Pre
-
Challenger
Prediction
of
Failure
/
by
Dalal
et
al
.,
published
in
1989
in
the
/
Journal
of
the
American
Statistical
Association
/.
*
Context
In
this
study
,
we
propose
a
re
-
examination
of
the
[[
https
://
en
.
wikipedia
.
org
/
wiki
/
Space_Shuttle_Challenger_disaster
][
space
shuttle
Challenger
disaster
]].
On
January
28
th
,
1986
,
the
space
shuttle
Challenged
exploded
73
seconds
after
launch
(
see
Figure
[[
fig
:
photo
]]),
causing
the
death
of
the
seven
astronauts
on
board
.
The
explosion
was
caused
by
the
failure
of
the
two
O
-
ring
seals
between
the
upper
and
lower
parts
of
the
boosters
(
see
Figure
[[
fig
:
oring
]]).
The
seals
had
lost
their
efficiency
because
of
the
exceptionally
cold
weather
at
the
time
of
launch
.
The
temperature
on
that
morning
was
just
below
0
°
C
,
whereas
the
preceding
flights
hat
been
launched
at
temperatures
at
least
7
to
10
°
C
higher
.
#+
NAME
:
fig
:
photo
#+
ATTR_LATEX
:
:
width
10
cm
#+
CAPTION
:
Photographs
of
the
Challenger
catastrophe
.
file
:
challenger5
.
jpg
#+
NAME
:
fig
:
oring
#+
ATTR_LATEX
:
:
width
10
cm
#+
CAPTION
:
Diagram
of
the
boosters
of
space
shuttle
Challenger
.
The
rubber
O
-
ring
seals
(
a
principal
and
a
secondary
one
)
of
more
than
11
meter
circumference
prevent
leaks
between
the
upper
and
lower
parts
.
file
:
o
-
ring
.
png
#
From
#
https
://
i0
.
wp
.
com
/
www
.
kylehailey
.
com
/
wp
-
content
/
uploads
/
2014
/
01
/
Screen
-
Shot
-
2013
-
12
-
30
-
at
-
12.05.04
-
PM
-
1024
x679
.
png
?
zoom
=
2
&
resize
=
594
%
2
C393
What
is
most
astonishing
is
that
the
precise
cause
of
the
accident
had
been
intensely
debated
several
days
before
and
ws
still
under
discussion
the
day
before
the
launch
,
during
a
three
-
hour
teleconference
involving
engineers
from
Morton
Thiokol
(
the
supplier
of
the
engines
)
and
from
NASA
.
Whereas
the
immediate
cause
of
the
accident
,
the
failure
of
the
O
-
ringe
,
was
quickly
identified
,
the
underlying
causes
of
the
disaster
have
regularly
served
as
a
case
study
,
be
it
in
management
training
(
work
organisation
,
decision
taking
in
spite
of
political
pressure
,
communication
problems
),
statistics
(
risk
evaluation
,
modelisation
,
data
visualization
),
or
sociology
(
history
,
bureaucracy
,
conforming
to
organisational
norms
).
In
the
study
that
we
propose
,
we
are
mainly
concerned
with
the
statistical
aspect
,
which
however
is
only
one
piece
of
the
puzzle
.
We
invite
you
to
read
the
documentes
cited
in
the
foreword
for
more
information
.
The
following
study
takes
up
a
part
of
the
analyses
that
were
done
that
night
with
the
goal
of
evaluating
the
potential
impact
of
temperature
and
air
pressure
on
the
probability
of
O
-
ring
malfunction
.
The
starting
point
is
experimental
results
obtained
by
NASA
engineers
over
the
six
years
preceding
the
Challenger
launch
.
In
the
directory
~
module2
/
exo5
/~
of
your
GitLab
workspace
,
you
will
find
the
original
data
as
welas
an
analysis
for
each
of
the
paths
we
offer
.
This
analysis
consists
of
four
steps
:
1.
Loading
the
data
2.
Visual
inspection
3.
Estimation
of
the
influence
of
temperature
4.
Estimation
of
the
probability
of
O
-
ring
malfunction
The
first
two
steps
require
only
a
basic
knowledge
of
R
or
Python
.
The
third
step
assumes
some
familiarity
with
logistic
regression
,
and
the
fourth
a
basic
knowledge
of
probability
.
In
the
next
section
,
we
give
an
introduction
to
logistic
regression
that
skips
the
details
of
the
computations
and
focuses
instead
on
the
interpretation
of
the
results
.
*
Introduction
to
logistic
regression
Suppose
we
have
the
following
dataset
that
indicates
for
a
group
of
people
of
varying
age
if
they
suffer
from
a
specific
illness
or
not
.
I
will
present
the
analysis
in
R
but
Python
code
would
look
quite
similar
.
The
data
are
stored
in
a
data
frame
that
is
summarized
as
:
#+
begin_src
R
:
results
output
:
session
*
R
*
:
exports
none
library
(
Hmisc
)
#
pour
calculer
un
intervalle
de
confiance
sur
des
donn
é
es
binomiales
library
(
ggplot2
)
library
(
dplyr
)
set
.
seed
(
42
)
proba
=
function
(
age
)
{
val
=(
age
-
50
)/
4
return
(
exp
(
val
)/(
1
+
exp
(
val
)))
}
df
=
data
.
frame
(
Age
=
runif
(
400
,
min
=
22
,
max
=
80
))
df
$
Ill
=
sapply
(
df
$
Age
,
function
(
x
)
rbinom
(
n
=
1
,
size
=
1
,
prob
=
proba
(
x
)))
#+
end_src
#+
RESULTS
:
#+
begin_src
R
:
results
output
:
session
*
R
*
:
exports
both
summary
(
df
)
str
(
df
)
#+
end_src
#+
RESULTS
:
#+
begin_example
Age
Ill
Min
.
:
22.01
Min
.
:
0.000
1
st
Qu
.:
35.85
1
st
Qu
.:
0.000
Median
:
50.37
Median
:
1.000
Mean
:
50.83
Mean
:
0.515
3
rd
Qu
.:
65.37
3
rd
Qu
.:
1.000
Max
.
:
79.80
Max
.
:
1.000
'data.frame'
:
400
obs
.
of
2
variables
:
$
Age
:
num
75.1
76.4
38.6
70.2
59.2
...
$
Ill
:
int
1
1
0
1
1
1
0
0
1
1
...
#+
end_example
Here
is
a
plot
that
provides
a
better
indication
of
the
link
that
could
exist
between
age
and
illness
:
#+
begin_src
R
:
results
output
graphics
:
file
fig1
.
svg
:
exports
both
:
width
4
:
height
3
:
session
*
R
*
ggplot
(
df
,
aes
(
x
=
Age
,
y
=
Ill
))
+
geom_point
(
alpha
=
.3
,
size
=
3
)
+
theme_bw
()
#+
end_src
#+
ATTR_LATEX
:
:
width
8
cm
#+
RESULTS
:
[[
file
:
fig1
.
svg
]]
Clearly
the
probability
of
suffering
from
this
illness
increases
with
age
.
But
how
can
we
estimate
this
probability
based
only
on
this
binary
data
ill
/
not
ill
?
For
each
age
slice
(
of
,
for
example
,
5
years
),
we
could
look
at
the
frequency
of
the
illness
.
The
following
code
is
a
bit
complicated
because
the
computation
of
the
confidence
interval
for
this
kind
of
data
requires
a
particular
treatment
using
the
function
=
binconf
=.
#+
begin_src
R
:
results
output
graphics
:
file
fig1bis
.
svg
:
exports
both
:
width
4
:
height
3
:
session
*
R
*
age_range
=
5
df_grouped
=
df
%>%
mutate
(
Age
=
age_range
*(
floor
(
Age
/
age_range
)+
.5
))
%>%
group_by
(
Age
)
%>%
summarise
(
Ill
=
sum
(
Ill
),
N
=
n
())
%>%
rowwise
()
%>%
do
(
data
.
frame
(
Age
=.$
Age
,
binconf
(.$
Ill
,
.$
N
,
alpha
=
0.05
)))
%>%
as
.
data
.
frame
()
ggplot
(
df_grouped
,
aes
(
x
=
Age
))
+
geom_point
(
data
=
df
,
aes
(
y
=
Ill
),
alpha
=
.3
,
size
=
3
)
+
geom_errorbar
(
data
=
df_grouped
,
aes
(
x
=
Age
,
ymin
=
Lower
,
ymax
=
Upper
,
y
=
PointEst
),
color
=
"darkred"
)
+
geom_point
(
data
=
df_grouped
,
aes
(
x
=
Age
,
y
=
PointEst
),
size
=
3
,
shape
=
21
,
color
=
"darkred"
)
+
theme_bw
()
#+
end_src
#+
ATTR_LATEX
:
:
width
8
cm
#+
RESULTS
:
[[
file
:
fig1bis
.
svg
]]
A
disadvantage
of
this
method
is
that
the
computation
is
done
independently
for
each
age
slice
,
which
moreover
has
been
chosen
arbitrarily
.
For
describing
the
evolution
in
a
more
continuous
fashion
,
we
could
apply
a
linear
regression
(
which
is
the
simplest
model
for
taking
into
account
the
influence
of
a
parameter
)
and
thus
estimate
the
impact
of
age
on
the
probability
of
illness
:
#+
begin_src
R
:
results
output
graphics
:
file
fig2
.
svg
:
exports
both
:
width
4
:
height
3
:
session
*
R
*
ggplot
(
df
,
aes
(
x
=
Age
,
y
=
Ill
))
+
geom_point
(
alpha
=
.3
,
size
=
3
)
+
theme_bw
()
+
geom_smooth
(
method
=
"lm"
)
#+
end_src
#+
ATTR_LATEX
:
:
width
8
cm
#+
RESULTS
:
[[
file
:
fig2
.
svg
]]
The
blue
line
is
the
linear
regression
in
the
sense
of
least
squares
,
and
the
grey
zone
is
the
95
%
confidence
interval
of
this
estimation
.
In
other
words
,
given
the
dataset
and
the
hypothesis
of
linearity
,
the
blue
line
is
the
most
probable
one
and
there
is
a
95
%
chance
that
the
true
line
is
in
the
grey
zone
.
It
is
,
however
,
clear
from
the
plot
that
this
estimation
is
meaningless
.
A
probability
must
lie
between
0
and
1
,
whereas
a
linear
regression
will
inevitably
lead
to
impossible
values
(
negative
or
greater
than
1
)
for
somewhat
extreme
age
values
(
young
or
old
).
The
reason
is
simply
that
a
linear
regression
implies
the
hypothesis
$\
textsf
{
Ill
}
=
\
alpha
.\
textsf
{
Age
}
+
\
beta
+
\
epsilon
$,
where
$\
alpha
$
and
$\
beta
$
are
real
numbers
and
$\
epsilon
$
is
a
noise
(
a
random
variable
of
mean
zero
),
wihh
$\
alpha
$
and
$\
beta
$
estimated
from
the
data
.
This
doesn
't make sense for estimating a probability,
and therefore [[https://en.wikipedia.org/wiki/Logistic_regression][logistic regression]] is a better choice:
#+begin_src R :results output graphics :file fig3.svg :exports both :width 4 :height 3 :session *R*
ggplot(df,aes(x=Age,y=Ill)) + geom_point(alpha=.3,size=3) +
theme_bw() +
geom_smooth(method = "glm",
method.args = list(family = "binomial")) + xlim(20,80)
#+end_src
#+ATTR_LATEX: :width 8cm
#+RESULTS:
[[file:fig3.svg]]
Here the =ggplot= library does all the computations for us and only
shows the result graphically, but in the Challenger risk analysis we
perform the regression and prediction "by hand" in =R= or =Python=
(depending on the path you have chosen), so that we can inspect the
results in more detail. Like before, the blue line indicates the
estimation of the probability of being ill as a function of age, and
the grey zone informs us about the uncertainty of this estimate, i.e.
given the hypotheses and the dataset, there is a 95% chance for the
true curve to lie somewhere in the grey zone.
In this model, the assumption is $P[\textsf{Ill}] = \pi(\textsf{Age})$
with $\displaystyle\pi(x)=\frac{e^{\alpha.x + \beta}}{1+e^{\alpha.x +
\beta}}$. This at first look strange formule has the nice property of
always yielding a value between zero and one, and to approach 0 and 1
rapidly as the age tends to $-\infty$ or $+\infty$, but this is not
the only motivation for this choice.
In summary, when we have event-like data (binary) and we wish to
estimate the influence of a parameter on the probability of the event
occurring (illness, failure, ...), the most natural and simple model
is logistic regression. Note that even if we restrain ourselves to a
small part of the dta, e.g. only patients less than 50 years old, it
is possible to get a reasonable estimate, even though, as is to be
expected, the uncertainty grows rapidly.
#+begin_src R :results output graphics :file fig4.svg :exports both :width 4 :height 3 :session *R*
ggplot(df[df$Age<50,],aes(x=Age,y=Ill)) + geom_point(alpha=.3,size=3) +
theme_bw() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),fullrange = TRUE) + xlim(20,80)
#+end_src
#+ATTR_LATEX: :width 8cm
#+RESULTS:
[[file:fig4.svg]]
* Emacs Setup :noexport:
This document has local variables in its postembule, which should
allow Org-mode (9) to work seamlessly without any setup. If you'
re
uncomfortable
using
such
variables
,
you
can
safely
ignore
them
at
startup
.
Exporting
may
require
that
you
copy
them
in
your
.
emacs
.
#
Local
Variables
:
#
eval
:
(
add
-
to
-
list
'org-latex-packages-alist '
(
""
"minted"
))
#
eval
:
(
setq
org
-
latex
-
listings
'minted)
# eval: (setq org-latex-minted-options '
((
"style"
"Tango"
)
(
"bgcolor"
"Moccasin"
)
(
"frame"
"lines"
)
(
"linenos"
"true"
)
(
"fontsize"
"
\\
small"
)))
#
eval
:
(
setq
org
-
latex
-
pdf
-
process
'("pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f" "pdflatex -shell-escape -interaction nonstopmode -output-directory %o %f"))
# End:
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment