Ex 04-1 : analyse Challenger en Julia

b1342638 · François Févotte · 7ad30490 · b1342638 · b1342638 · b1342638
Commit b1342638 authored Apr 16, 2020 by François Févotte
9 changed files
--- a/module4/Manifest.toml
+++ b/module4/Manifest.toml
--- a/module4/Project.toml
+++ b/module4/Project.toml
+name = "Challenger"
+uuid = "457123a2-77f0-4d56-a035-49d1b3c810fd"
+authors = ["François Févotte"]
+version = "0.1.0"
+
+[deps]
+CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
+DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
+DataInterpolations = "82cc6244-b520-54b8-b5a6-8a565e85f1d0"
+GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a"
+HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
+PackageCompiler = "9b87118b-4619-50d2-8e1e-99f35a4d4d9d"
+Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
+Weave = "44d3d7a6-8a23-5bf8-98c5-b353f8df5ec9"
--- a/module4/challenger.html
+++ b/module4/challenger.html
--- a/module4/challenger.jmd
+++ b/module4/challenger.jmd
+---
+title : "Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure"
+options:
+  css: skeleton_css.css
+  template: julia_html.tpl
+---
+
+In this document we reperform some of the analysis provided in *Risk Analysis of
+the Space Shuttle: Pre-Challenger Prediction of Failure* by Siddhartha R. Dalal,
+Edward B. Fowlkes, Bruce Hoadley published in Journal of the American
+Statistical Association, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and
+available at [http://www.jstor.org/stable/2290069](http://www.jstor.org/stable/2290069).
+
+On the fourth page of this article, they indicate that the maximum likelihood
+estimates of the logistic regression using only temperature are:
+$\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard
+errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of
+fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our
+goal is to reproduce the computation behind these values and the Figure 4 of
+this article, possibly in a nicer looking way.
+
+
+# Technical information on the computer on which the analysis is run
+
+We will be using the [Julia](http://julialang.org) language:
+```julia
+using InteractiveUtils
+versioninfo()
+```
+
+The computations rely on a number of packages in the Julia ecosystem. The direct
+dependencies are summarized hereafter; the complete environment is described in
+the [`Manifest.toml`](Manifest.toml) file.
+
+```julia
+# Setup environment
+using Pkg
+Pkg.activate(@__DIR__)
+Pkg.instantiate()
+
+# Load dependencies
+using HTTP, CSV
+using Plots; plotly()
+using GLM
+using DataFrames
+using Printf
+include("utils.jl")
+
+# Summary
+Pkg.status()
+```
+
+
+# Loading and inspecting data
+
+Let's start by reading data.
+
+```julia
+res = HTTP.request(:GET, "https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv?inline=false")
+data = CSV.read(res.body)
+```
+
+We know from our previous experience on this data set that filtering data is a
+really bad idea. We will therefore process it as such.
+
+```julia; results="raw"
+data.Frequency = data.Malfunction ./ data.Count
+
+plot(xlabel="Temperature [F]", ylabel="Frequency")
+plot!(data.Temperature, data.Frequency, seriestype=:scatter, label=nothing)
+disp()
+```
+
+
+# Logistic regression
+
+Let's assume O-rings independently fail with the same probability which solely
+depends on temperature. A logistic regression should allow us to estimate the
+influence of temperature.
+
+```julia; wrap=false; hold=true
+model = glm(@formula(Frequency ~ Temperature), data,
+            Binomial(), LogitLink())
+
+α, β   = coef(model)
+σα, σβ = stderror(model)
+
+G²     = deviance(model)
+nDOF   = Int(dof_residual(model))
+
+model
+```
+
+The maximum likelyhood estimator of the intercept and of Temperature are thus
+$\hat{\alpha}=`j @printf "%.3f" α`$ and
+$\hat{\beta}=`j @printf "%.3f" β`$.
+
+This corresponds to the values from the article of Dalal et al. The standard
+errors are
+$s_{\hat{\alpha}} = `j @printf "%.3f" σα`$ and
+$s_{\hat{\beta}} = `j @printf "%.3f" σβ`$,
+which is different from the $3.052$ and $0.047$ reported by Dallal et al. The
+deviance is
+$G^2 = `j @printf "%.3f" G²`$ with `j nDOF` degrees of freedom.
+
+I cannot find any value similar to the Goodness of fit ($G^2=18.086$) reported
+by Dalal et al. However, the number of degrees of freedom is similar to theirs
+(21).
+
+There seems to be something wrong. Oh I know, I haven't indicated that my
+observations are actually the result of 6 observations for each rocket
+launch. The correct way to do this would be to weight the data using the `Count`
+column. Since I don't know how to do that with the
+[GLM](https://github.com/JuliaStats/GLM.jl) package I'm using, I will simply
+duplicate the data:
+
+```julia; wrap=false; hold=true
+weighted_data = DataFrame(Temperature=Int[], Frequency=Float64[])
+for row in eachrow(data)
+    for _ in 1:row.Count
+        push!(weighted_data, (Temperature=row.Temperature,
+                              Frequency=row.Frequency))
+    end
+end
+
+model = glm(@formula(Frequency ~ Temperature), weighted_data,
+            Binomial(), LogitLink())
+
+α, β   = coef(model)
+σα, σβ = stderror(model)
+
+G²     = deviance(model)
+nDOF   = Int(dof_residual(model))
+
+model
+```
+
+Good, now I have recovered the asymptotic standard errors
+$s_{\hat{\alpha}} = `j @printf "%.3f" σα`$ and
+$s_{\hat{\beta}} = `j @printf "%.3f" σβ`$,
+
+The Goodness of fit (Deviance) indicated for this model is
+$G^2 = `j @printf "%.3f" G²`$ with `j nDOF` degrees of freedom. Now $G^2$ is in
+good accordance to the results of the Dalal *et al.* article, but the number of
+degrees of freedom is 6 times larger than i should, due to my tampering of the
+data to duplicate them instead of weighting them.
+
+# Predicting failure probability
+
+The temperature when launching the shuttle was 31°F. Let's try to estimate the
+failure probability for such temperature using our model:
+
+```julia; results="raw"
+prediction = DataFrame(Temperature=30:0.25:90)
+prediction.Frequency = predict(model, prediction)
+
+plot(xlabel="Temperature [F]", ylabel="Frequency")
+plot!(data.Temperature, data.Frequency, seriestype=:scatter, label="data")
+plot!(prediction.Temperature, prediction.Frequency, label="prediction")
+disp()
+```
+
+This figure is very similar to the Figure 4 of Dalal *et al*.
+
+<!-- Local Variables: -->
+<!-- mode: markdown -->
+<!-- ispell-local-dictionary: "french" -->
+<!-- End: -->
--- a/module4/challenger.pdf
+++ b/module4/challenger.pdf
--- a/module4/julia_html.tpl
+++ b/module4/julia_html.tpl
+<!DOCTYPE html>
+<HTML lang = "en">
+<HEAD>
+  <meta charset="UTF-8"/>
+  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
+  {{#:title}}<title>{{:title}}</title>{{/:title}}
+  {{{ :header_script }}}
+
+  <script src="plotly-latest.min.js"></script>
+
+  <script type="text/x-mathjax-config">
+    MathJax.Hub.Config({
+      tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
+      TeX: { equationNumbers: { autoNumber: "AMS" } }
+    });
+  </script>
+
+  <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
+  </script>
+
+  {{{ :highlightcss }}}
+
+  <style type="text/css">
+  {{{ :themecss }}}
+  </style>
+
+
+
+</HEAD>
+  <BODY>
+    <div class ="container">
+      <div class = "row">
+        <div class = "col-md-12 twelve columns">
+
+          <div class="title">
+            {{#:title}}<h1 class="title">{{:title}}</h1>{{/:title}}
+            {{#:author}}<h5>{{{:author}}}</h5>{{/:author}}
+            {{#:date}}<h5>{{{:date}}}</h5>{{/:date}}
+          </div>
+
+          {{{ :body }}}
+
+
+          <HR/>
+          <div class="footer"><p>
+          Published from <a href="{{{:source}}}">{{{:source}}}</a> using
+          <a href="http://github.com/mpastell/Weave.jl">Weave.jl</a>
+          {{:wversion}} on {{:wtime}}.
+          <p></div>
+
+
+        </div>
+      </div>
+    </div>
+  </BODY>
+</HTML>
--- a/module4/plotly-latest.min.js
+++ b/module4/plotly-latest.min.js
--- a/module4/skeleton_css.css
+++ b/module4/skeleton_css.css
--- a/module4/utils.jl
+++ b/module4/utils.jl
+# Affichage des DataFrames dans la sortie HTML
+using DataFrames
+using Printf
+function info(df::DataFrame)
+    try
+        WEAVE_ARGS
+    catch
+        display(df)
+        return
+    end
+
+    pretty(x) = x
+    pretty(x::Float64) = @sprintf "%.6f" x
+    pretty(::Type{Union{T,Missing}}) where {T} = "$T?"
+
+    function printrow(i)
+        print("<tr>")
+        print("<th>$i</th>")
+        for j in 1:ncol(df)
+            print("<td>$(pretty(df[i,j]))</td>")
+        end
+        print("</tr>")
+    end
+
+    function printrows()
+        i = 1
+        while i<=nrow(df) && i<=3
+            printrow(i)
+            i += 1
+        end
+
+        i>nrow(df) && return
+
+        if i < nrow(df)-2
+            print("<tr><td>...</td></tr>")
+        end
+
+        i = max(i, nrow(df)-2)
+        while i<=nrow(df)
+            printrow(i)
+            i += 1
+        end
+    end
+
+    println("$(nrow(df)) rows × $(ncol(df)) columns")
+    print("<table>")
+    print("<thead>")
+    print("<tr><th></th>")
+    for col in names(df)
+        print("<th>$col</th>")
+    end
+    print("</tr>")
+    print("<tr><th></th>")
+    for col in names(df)
+        print("<th>$(pretty(eltype(df[!,col])))</th>")
+    end
+    print("</tr>")
+    print("</thead>")
+    print("<tbody>")
+    printrows()
+    print("</tbody>")
+    print("</table>\n")
+end
+
+
+
+# Meilleure intégration entre Plotly et Weave
+function disp()
+    try
+        WEAVE_ARGS
+    catch
+        display(plot!())
+        return
+    end
+
+    buf = IOBuffer()
+    show(buf, MIME("text/html"), plot!())
+    seekstart(buf)
+    str = String(read(buf))
+    m = match(r"<body>(.*)</body>"s, str)
+
+    return if m === nothing
+        str
+    else
+        m.captures[1]
+    end |> print
+end