Merge branch 'master' of...

Merge branch 'master' of https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study

Merge branch 'master' of...
Merge branch 'master' of https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study
8a5d424d · Helene31 · 4afaffe3 · 527f6951 · 8a5d424d · 8a5d424d
Commit 8a5d424d authored Apr 06, 2020 by Helene31
7 changed files
--- a/README.md
+++ b/README.md
@@ -49,4 +49,8 @@ Non functional (expected values are $`5.085`$ and $`-0.1156`$)
 | -------- | ---------------- | ------------------------------------------------------------- | ------- | ----------------------------   | -------------                            | ----------                                | ----------                   | --------- | ----------   | -----------------------------------------------------------------                                                                                                                                                                                                                                                                                            | ----------- |
 | R        |            3.5.1 | ggplot2 3.0.0                                                 | RStudio | Debian GNU/Linux buster/sid    | Identical                                | Identical                                 | Identical                    | Identical | Identical    | [Rmd](src/R/challenger.Rmd), [pdf](src/R/challenger_debian_alegrand.pdf)                                                                                                                                                                                                                                                                                     | A. Legrand  |
 | Python   |            3.6.4 | statsmodels 0.9.0 numpy 1.13.3 pandas 0.22.0 matplotlib 2.2.2 | Jupyter | Linux Ubuntu 4.4.0-116-generic | Identical                                | Identical                                 | Identical                    | Identical | Similar      | [ipynb](src/Python3/challenger.ipynb), [pdf](src/Python3/challenger_ubuntuMOOC_alegrand.pdf)                                                                                                                                                                                                                                                                 | A. Legrand  |
-
+| R        |            3.5.1 | ggplot2 3.0.0                                                 | RSrudio | Windows >= 8 x64 (build 9200)  | Identical                                | Identical                                 | Identical                    | Identical | Similar      | [Rmd](https://app-learninglab.inria.fr/moocrr/gitlab/8517fa92e97b3a318e653caefbfde6b5/mooc-rr/blob/master/module4/MOOC_exercice_module4.Rmd), [Pdf](https://app-learninglab.inria.fr/moocrr/gitlab/8517fa92e97b3a318e653caefbfde6b5/mooc-rr/blob/master/module4/MOOC_exercice_module4.pdf)                                                                   | M. Saubin   |
+| Python   |            3.6.4 | statsmodels 0.9.0 numpy 1.15.2 pandas 0.22.0 matplotlib 2.2.3 | Jupyter | Linux Ubuntu 4.4.0-164-generic | Identical                                | Identical                                 | Identical                    | Identical | Similar      | [ipynb](module4/challenger_Python_ipynb.ipynb), [pdf](module4/challenger_Python_ipynb.pdf)                                                                                                                                                                                                                                                                   |2992438755465b7fe3afd7856bde0599|
+| R        |            3.4.4 | ggplot2_3.3.0                                                 | RStudio | Linux Mint 19                  | Identical                                | Identical                                 | Identical                    | Identical | Identical    | [Rmd](https://app-learninglab.inria.fr/moocrr/gitlab/b2c48a7ab4afbff5f4d26650b09eb6b4/mooc-rr/blob/master/module4/challenger_reexecuted.Rmd), [html](https://app-learninglab.inria.fr/moocrr/gitlab/b2c48a7ab4afbff5f4d26650b09eb6b4/mooc-rr/blob/master/module4/challenger_reexecuted.html)                                                                 | b2c48a7ab4afbff5f4d26650b09eb6b4 |
+| Python   |            3.6.4 | statsmodels 0.9.0 numpy 1.15.2 pandas 0.22.0 matplotlib 2.2.3 | Jupyter | Linux Ubuntu 4.4.0-164-generic | Identical                                | Identical                                 | Identical                    | Identical | Similar      | [ipynb](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.ipynb), [pdf](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.pdf)                                                                                     | 34ea1ee296fc8711adf020d9cc2cb571 |
+| Matlab   | 9.6.0.1072779 (R2019a) |                                              | Matlab Live Script | Windows 10.0.18362             | Identical                                | Identical                                 | Non Functionnal              | Similar   | Did not succeed | [mlx](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger.mlx), [pdf](https://app-learninglab.inria.fr/moocrr/gitlab/34ea1ee296fc8711adf020d9cc2cb571/mooc-rr/blob/master/module4/challenger_matlab.pdf)                                                                               | 34ea1ee296fc8711adf020d9cc2cb571 |
\ No newline at end of file
--- a/src/Python3/challenger.ipynb
+++ b/src/Python3/challenger.ipynb
@@ -434,7 +434,7 @@
    }
   ],
   "source": [
-    "data = pd.read_csv(\"https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n",
+    "data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/data/shuttle.csv\")\n",
    "data"
   ]
  },
@@ -751,7 +751,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
+    "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
   ]
  }
 ],

--- a/src/Python3/challenger_Python_ipynb.ipynb
+++ b/src/Python3/challenger_Python_ipynb.ipynb
@@ -424,7 +424,7 @@
    }
   ],
   "source": [
-    "data = pd.read_csv(\"https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n",
+    "data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/blob/master/data/data/shuttle.csv\")\n",
    "data"
   ]
  },
@@ -833,7 +833,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
+    "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"."
   ]
  }
 ],

--- a/src/Python3/challenger_Python_org.org
+++ b/src/Python3/challenger_Python_org.org
@@ -5,12 +5,12 @@

 * Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure

-In this document we reperform some of the analysis provided in 
+In this document we reperform some of the analysis provided in
 /Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of
 Failure/ by /Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley/
 published in /Journal of the American Statistical Association/, Vol. 84,
 No. 408 (Dec., 1989), pp. 945-957 and available at
-http://www.jstor.org/stable/2290069. 
+http://www.jstor.org/stable/2290069.

 On the fourth page of this article, they indicate that the maximum
 likelihood estimates of the logistic regression using only temperature
@@ -30,7 +30,7 @@ and numpy library.
 def print_imported_modules():
    import sys
    for name, val in sorted(sys.modules.items()):
-        if(hasattr(val, '__version__')): 
+        if(hasattr(val, '__version__')):
            print(val.__name__, val.__version__)
 #        else:
 #            print(val.__name__, "(unknown version)")
@@ -55,7 +55,7 @@ print_imported_modules()
 Let's start by reading data.

 #+begin_src python :results output :session :exports both
-data = pd.read_csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv")
+data = pd.read_csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv")
 print(data)
 #+end_src

@@ -87,7 +87,7 @@ import statsmodels.api as sm
 data["Success"]=data.Count-data.Malfunction
 data["Intercept"]=1

-logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], 
+logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
                family=sm.families.Binomial(sm.families.links.logit)).fit()

 print(logmodel.summary())
@@ -95,7 +95,7 @@ print(logmodel.summary())

 The maximum likelyhood estimator of the intercept and of Temperature
 are thus *$\hat{\alpha}$ = 5.0850* and *$\hat{\beta}$ = -0.1156*. This *corresponds*
-to the values from the article of Dalal /et al./ The standard errors are 
+to the values from the article of Dalal /et al./ The standard errors are
 /$s_{\hat{\alpha}}$ = 7.477/ and /$s_{\hat{\beta}}$ = 0.115/, which is *different* from
 the *3.052* and *0.04702* reported by Dallal /et al./ The deviance is
 /3.01444/ with *21* degrees of freedom. I cannot find any value similar
@@ -107,7 +107,7 @@ same throughout all experiments, it does not change the estimates of
 the fit but it does influence the variance estimates).

 #+begin_src python :results output :session :exports both
-logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], 
+logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']],
                family=sm.families.Binomial(sm.families.links.logit),
                var_weights=data['Count']).fit()

@@ -128,7 +128,7 @@ The temperature when launching the shuttle was 31°F. Let's try to
 estimate the failure probability for such temperature using our model:

 #+begin_src python :results output :session :exports both
-data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 
+data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121),
                          'Intercept': 1})
 data_pred['Frequency'] = logmodel.predict(data_pred)
 print(data_pred.head())
@@ -157,7 +157,7 @@ et tracer la courbe :
 def logit_inv(x):
    return(np.exp(x)/(np.exp(x)+1))

-data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] + 
+data_pred['Prob']=logit_inv(data_pred['Temperature'] * logmodel.params['Temperature'] +
                            logmodel.params['Intercept'])
 print(data_pred.head())
 #+end_src
@@ -195,7 +195,7 @@ matplot_lib_filename
 **I think I have managed to correctly compute and plot the uncertainty
  of my prediction.** Although the shaded area seems very similar to
  [the one obtained by with
-  R](https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/5c9dbef11b4d7638b7ddf2ea71026e7bf00fcfb0/challenger.pdf),
+  R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf),
  I can spot a few differences (e.g., the blue point for temperature
  63 is outside)... Could this be a numerical error ? Or a difference
  in the statistical method ? It is not clear which one is "right".
--- a/src/R/challenger.Rmd
+++ b/src/R/challenger.Rmd
@@ -5,8 +5,8 @@ date: "25 October 2018"
 output: pdf_document
 ---

-In this document we reperform some of the analysis provided in 
-*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069. 
+In this document we reperform some of the analysis provided in
+*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069.

 On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\hat{\alpha}=5.085$ and $\hat{\beta}=-0.1156$ and their asymptotic standard errors are $s_{\hat{\alpha}}=3.052$ and $s_{\hat{\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way.

@@ -26,7 +26,7 @@ devtools::session_info()
 # Loading and inspecting data
 Let's start by reading data:
 ```{r}
-data = read.csv("https://app-learninglab.inria.fr/gitlab/moocrr-session1/moocrr-reproducibility-study/raw/master/data/shuttle.csv",header=T)
+data = read.csv("https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/data/shuttle.csv",header=T)
 data
 ```

@@ -42,7 +42,7 @@ plot(data=data, Malfunction/Count ~ Temperature, ylim=c(0,1))
 Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature.

 ```{r}
-logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count, 
+logistic_reg = glm(data=data, Malfunction/Count ~ Temperature, weights=Count,
                   family=binomial(link='logit'))
 summary(logistic_reg)
 ```
@@ -50,10 +50,10 @@ summary(logistic_reg)
 The maximum likelyhood estimator of the intercept and of Temperature are thus $\hat{\alpha}=5.0849$ and $\hat{\beta}=-0.1156$ and their standard errors are $s_{\hat{\alpha}} = 3.052$ and $s_{\hat{\beta}} = 0.04702$. The Residual deviance corresponds to the Goodness of fit $G^2=18.086$ with 21 degrees of freedom. **I have therefore managed to replicate the results of the Dalal *et al.* article**.

 # Predicting failure probability
-The temperature when launching the shuttle was 31°F. Let's try to 
+The temperature when launching the shuttle was 31°F. Let's try to
 estimate the failure probability for such temperature using our model.:
 ```{r}
-# shuttle=shuttle[shuttle$r!=0,] 
+# shuttle=shuttle[shuttle$r!=0,]
 tempv = seq(from=30, to=90, by = .5)
 rmv <- predict(logistic_reg,list(Temperature=tempv),type="response")
 plot(tempv,rmv,type="l",ylim=c(0,1))
@@ -65,7 +65,7 @@ This figure is very similar to the Figure 4 of Dalal et al. **I have managed to
 # Confidence on the prediction
 Let's try to plot confidence intervals with ggplot2.
 ```{r, fig.height=3.3}
-ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") + 
+ggplot(data, aes(y=Malfunction/Count, x=Temperature)) + geom_point(alpha=.2, size = 2, color="blue") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
  xlim(30,90) + ylim(0,1) + theme_bw()
 ```
@@ -96,10 +96,10 @@ summary(logistic_reg)
 Perfect. The estimates and the standard errors are the same although the Residual deviance is difference since the distance is now measured with respect to each 0/1 measurement and not to ratios. Let's use plot the regression for *data_flat* along with the ratios (*data*).

 ```{r, fig.height=3.3}
-ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) + 
+ggplot(data=data_flat, aes(y=Malfunction, x=Temperature)) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), fullrange=T) +
-  geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") + 
-  geom_point(alpha=.5, size = .5) + 
+  geom_point(data=data, aes(y=Malfunction/Count, x=Temperature),alpha=.2, size = 2, color="blue") +
+  geom_point(alpha=.5, size = .5) +
  xlim(30,90) + ylim(0,1) + theme_bw()
 ```

@@ -121,7 +121,7 @@ logistic_reg$family$linkinv(pred_link$fit)
 I recover $0.834$ for the estimated Failure probability at 30°. But now, going through the *linkinv* function, we can use $se.fit$:
 ```{r}
 critval = 1.96
-logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit, 
+logistic_reg$family$linkinv(c(pred_link$fit-critval*pred_link$se.fit,
                              pred_link$fit+critval*pred_link$se.fit))
 ```
 The 95% confidence interval for our estimation is thus [0.163,0.992]. This is what ggplot2 just plotted me. This seems coherent.

--- a/src/R/challenger_R_org.org
+++ b/src/R/challenger_R_org.org
--- a/src/R/challenger_R_org_Windows_64bits.org
+++ b/src/R/challenger_R_org_Windows_64bits.org