diff --git a/module3/exo1/Untitled.ipynb b/module3/exo1/Untitled.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..7fec51502cbc3200b3d0ffc6bbba1fe85e197f3d --- /dev/null +++ b/module3/exo1/Untitled.ipynb @@ -0,0 +1,6 @@ +{ + "cells": [], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/module3/exo2/.ipynb b/module3/exo2/.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..14d057d745584329b45180a9684827be14aa39db --- /dev/null +++ b/module3/exo2/.ipynb @@ -0,0 +1,393 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Incidence du syndrome grippal" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "import matplotlib.pyplot as plt\n", + "import pandas as pd\n", + "import isoweek" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Les données de l'incidence du syndrome grippal sont disponibles du site Web du [Réseau Sentinelles](http://www.sentiweb.fr/). Nous les récupérons sous forme d'un fichier en format CSV dont chaque ligne correspond à une semaine de la période demandée. Nous téléchargeons toujours le jeu de données complet, qui commence en 1984 et se termine avec une semaine récente." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "data_url = \"http://www.sentiweb.fr/datasets/incidence-PAY-3.csv\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Voici l'explication des colonnes données [sur le site d'origine](https://ns.sentiweb.fr/incidence/csv-schema-v1.json):\n", + "\n", + "| Nom de colonne | Libellé de colonne |\n", + "|----------------|-----------------------------------------------------------------------------------------------------------------------------------|\n", + "| week | Semaine calendaire (ISO 8601) |\n", + "| indicator | Code de l'indicateur de surveillance |\n", + "| inc | Estimation de l'incidence de consultations en nombre de cas |\n", + "| inc_low | Estimation de la borne inférieure de l'IC95% du nombre de cas de consultation |\n", + "| inc_up | Estimation de la borne supérieure de l'IC95% du nombre de cas de consultation |\n", + "| inc100 | Estimation du taux d'incidence du nombre de cas de consultation (en cas pour 100,000 habitants) |\n", + "| inc100_low | Estimation de la borne inférieure de l'IC95% du taux d'incidence du nombre de cas de consultation (en cas pour 100,000 habitants) |\n", + "| inc100_up | Estimation de la borne supérieure de l'IC95% du taux d'incidence du nombre de cas de consultation (en cas pour 100,000 habitants) |\n", + "| geo_insee | Code de la zone géographique concernée (Code INSEE) http://www.insee.fr/fr/methodes/nomenclatures/cog/ |\n", + "| geo_name | Libellé de la zone géographique (ce libellé peut être modifié sans préavis) |\n", + "\n", + "La première ligne du fichier CSV est un commentaire, que nous ignorons en précisant `skiprows=1`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "On vérifie qu'une copie locale n'existe pas. Pour cela, on se donne un nom de fichier, data_file, ainsi qu'un répertoire, folder_path. S'il n'existe pas, on le crée à partir de data_url, en sauvegardant les données localement. On travaillera ensuite avec ces données locales." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os.path\n", + "data_file = \"inciden-PAY-3.csv\"\n", + "folder_path = \"myLocalisation/\"\n", + "if not (os.path.exists(folder_path + data_file)) : \n", + " df = pd.read_csv(data_url)\n", + " df.to_csv(folder_path + data_file, sep = '\\t')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "raw_data = pd.read_csv(data_file, skiprows=1)\n", + "raw_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Y a-t-il des points manquants dans ce jeux de données ? Oui, la semaine 19 de l'année 1989 n'a pas de valeurs associées." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "raw_data[raw_data.isnull().any(axis=1)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nous éliminons ce point, ce qui n'a pas d'impact fort sur notre analyse qui est assez simple." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = raw_data.dropna().copy()\n", + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nos données utilisent une convention inhabituelle: le numéro de\n", + "semaine est collé à l'année, donnant l'impression qu'il s'agit\n", + "de nombre entier. C'est comme ça que Pandas les interprète.\n", + " \n", + "Un deuxième problème est que Pandas ne comprend pas les numéros de\n", + "semaine. Il faut lui fournir les dates de début et de fin de\n", + "semaine. Nous utilisons pour cela la bibliothèque `isoweek`.\n", + "\n", + "Comme la conversion des semaines est devenu assez complexe, nous\n", + "écrivons une petite fonction Python pour cela. Ensuite, nous\n", + "l'appliquons à tous les points de nos donnés. Les résultats vont\n", + "dans une nouvelle colonne 'period'." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def convert_week(year_and_week_int):\n", + " year_and_week_str = str(year_and_week_int)\n", + " year = int(year_and_week_str[:4])\n", + " week = int(year_and_week_str[4:])\n", + " w = isoweek.Week(year, week)\n", + " return pd.Period(w.day(0), 'W')\n", + "\n", + "data['period'] = [convert_week(yw) for yw in data['week']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Il restent deux petites modifications à faire.\n", + "\n", + "Premièrement, nous définissons les périodes d'observation\n", + "comme nouvel index de notre jeux de données. Ceci en fait\n", + "une suite chronologique, ce qui sera pratique par la suite.\n", + "\n", + "Deuxièmement, nous trions les points par période, dans\n", + "le sens chronologique." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "sorted_data = data.set_index('period').sort_index()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nous vérifions la cohérence des données. Entre la fin d'une période et\n", + "le début de la période qui suit, la différence temporelle doit être\n", + "zéro, ou au moins très faible. Nous laissons une \"marge d'erreur\"\n", + "d'une seconde.\n", + "\n", + "Ceci s'avère tout à fait juste sauf pour deux périodes consécutives\n", + "entre lesquelles il manque une semaine.\n", + "\n", + "Nous reconnaissons ces dates: c'est la semaine sans observations\n", + "que nous avions supprimées !" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "periods = sorted_data.index\n", + "for p1, p2 in zip(periods[:-1], periods[1:]):\n", + " delta = p2.to_timestamp() - p1.end_time\n", + " if delta > pd.Timedelta('1s'):\n", + " print(p1, p2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Un premier regard sur les données !" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sorted_data['inc'].plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Un zoom sur les dernières années montre mieux la situation des pics en hiver. Le creux des incidences se trouve en été." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sorted_data['inc'][-200:].plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Etude de l'incidence annuelle" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Etant donné que le pic de l'épidémie se situe en hiver, à cheval\n", + "entre deux années civiles, nous définissons la période de référence\n", + "entre deux minima de l'incidence, du 1er août de l'année $N$ au\n", + "1er août de l'année $N+1$.\n", + "\n", + "Notre tâche est un peu compliquée par le fait que l'année ne comporte\n", + "pas un nombre entier de semaines. Nous modifions donc un peu nos périodes\n", + "de référence: à la place du 1er août de chaque année, nous utilisons le\n", + "premier jour de la semaine qui contient le 1er août.\n", + "\n", + "Comme l'incidence de syndrome grippal est très faible en été, cette\n", + "modification ne risque pas de fausser nos conclusions.\n", + "\n", + "Encore un petit détail: les données commencent an octobre 1984, ce qui\n", + "rend la première année incomplète. Nous commençons donc l'analyse en 1985." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "first_august_week = [pd.Period(pd.Timestamp(y, 8, 1), 'W')\n", + " for y in range(1985,\n", + " sorted_data.index[-1].year)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "En partant de cette liste des semaines qui contiennent un 1er août, nous obtenons nos intervalles d'environ un an comme les périodes entre deux semaines adjacentes dans cette liste. Nous calculons les sommes des incidences hebdomadaires pour toutes ces périodes.\n", + "\n", + "Nous vérifions également que ces périodes contiennent entre 51 et 52 semaines, pour nous protéger contre des éventuelles erreurs dans notre code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "year = []\n", + "yearly_incidence = []\n", + "for week1, week2 in zip(first_august_week[:-1],\n", + " first_august_week[1:]):\n", + " one_year = sorted_data['inc'][week1:week2-1]\n", + " assert abs(len(one_year)-52) < 2\n", + " yearly_incidence.append(one_year.sum())\n", + " year.append(week2.year)\n", + "yearly_incidence = pd.Series(data=yearly_incidence, index=year)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Voici les incidences annuelles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yearly_incidence.plot(style='*')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Une liste triée permet de plus facilement répérer les valeurs les plus élevées (à la fin)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yearly_incidence.sort_values()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Enfin, un histogramme montre bien que les épidémies fortes, qui touchent environ 10% de la population\n", + " française, sont assez rares: il y en eu trois au cours des 35 dernières années." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yearly_incidence.hist(xrot=20)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/module3/exo2/analyse-syndrome-var.ipynb b/module3/exo2/analyse-syndrome-var.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..be611faf77d2bc491de08147f09e26911100e792 --- /dev/null +++ b/module3/exo2/analyse-syndrome-var.ipynb @@ -0,0 +1,2395 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Incidence du syndrome varicelle" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "import matplotlib.pyplot as plt\n", + "import pandas as pd\n", + "import isoweek" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Les données de l'incidence du syndrome varicelle sont disponibles du site Web du [Réseau Sentinelles](http://www.sentiweb.fr/). Nous les récupérons sous forme d'un fichier en format CSV dont chaque ligne correspond à une semaine de la période demandée. Nous téléchargeons toujours le jeu de données complet, qui commence en 1984 et se termine avec une semaine récente." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "data_url = \"http://www.sentiweb.fr/datasets/incidence-PAY-7.csv\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Voici l'explication des colonnes données [sur le site d'origine](https://ns.sentiweb.fr/incidence/csv-schema-v1.json):\n", + "\n", + "| Nom de colonne | Libellé de colonne |\n", + "|----------------|-----------------------------------------------------------------------------------------------------------------------------------|\n", + "| week | Semaine calendaire (ISO 8601) |\n", + "| indicator | Code de l'indicateur de surveillance |\n", + "| inc | Estimation de l'incidence de consultations en nombre de cas |\n", + "| inc_low | Estimation de la borne inférieure de l'IC95% du nombre de cas de consultation |\n", + "| inc_up | Estimation de la borne supérieure de l'IC95% du nombre de cas de consultation |\n", + "| inc100 | Estimation du taux d'incidence du nombre de cas de consultation (en cas pour 100,000 habitants) |\n", + "| inc100_low | Estimation de la borne inférieure de l'IC95% du taux d'incidence du nombre de cas de consultation (en cas pour 100,000 habitants) |\n", + "| inc100_up | Estimation de la borne supérieure de l'IC95% du taux d'incidence du nombre de cas de consultation (en cas pour 100,000 habitants) |\n", + "| geo_insee | Code de la zone géographique concernée (Code INSEE) http://www.insee.fr/fr/methodes/nomenclatures/cog/ |\n", + "| geo_name | Libellé de la zone géographique (ce libellé peut être modifié sans préavis) |\n", + "\n", + "La première ligne du fichier CSV est un commentaire, que nous ignorons en précisant `skiprows=1`." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
weekindicatorincinc_lowinc_upinc100inc100_lowinc100_upgeo_inseegeo_name
02020207848251671102FRFrance
120201973100753001FRFrance
22020187849981600102FRFrance
320201772720658001FRFrance
42020167758781438102FRFrance
5202015719186753161315FRFrance
62020147387922275531639FRFrance
7202013773265236941611814FRFrance
82020127812357901045612816FRFrance
9202011710198756812828151119FRFrance
1020201079011669111331141018FRFrance
112020097136311054416718211626FRFrance
12202008710424770813140161220FRFrance
1320200778959657411344141018FRFrance
1420200679264692511603141018FRFrance
1520200578505631410696131016FRFrance
162020047799158311015112915FRFrance
1720200375968410078369612FRFrance
18202002765344530853810713FRFrance
1920200179835701912651151119FRFrance
202019527794152461063612816FRFrance
2120195175823367579719612FRFrance
22201950764244276857210713FRFrance
23201949766214540870210713FRFrance
2420194875542338377018511FRFrance
252019477753650581001411715FRFrance
262019467263813163960426FRFrance
2720194574492261563697410FRFrance
2820194475728362778299612FRFrance
2920194374834275169177410FRFrance
.................................
15071991267176081130423912312042FRFrance
15081991257161691070021638281838FRFrance
15091991247161711007122271281739FRFrance
1510199123711947767116223211329FRFrance
1511199122715452995320951271737FRFrance
1512199121714903897520831261636FRFrance
15131991207190531274225364342345FRFrance
15141991197167391124622232291939FRFrance
15151991187213851388228888382551FRFrance
1516199117713462887718047241632FRFrance
15171991167148571006819646261834FRFrance
1518199115713975978118169251832FRFrance
1519199114712265768416846221430FRFrance
152019911379567604113093171123FRFrance
1521199112710864733114397191325FRFrance
15221991117155741118419964271935FRFrance
15231991107166431137221914292038FRFrance
1524199109713741878018702241533FRFrance
1525199108713289881317765231531FRFrance
1526199107712337807716597221529FRFrance
1527199106710877701314741191226FRFrance
1528199105710442654414340181125FRFrance
15291991047791345631126314820FRFrance
15301991037153871048420290271836FRFrance
15311991027162771104621508292038FRFrance
15321991017155651027120859271836FRFrance
15331990527193751329525455342345FRFrance
15341990517190801380724353342543FRFrance
1535199050711079666015498201228FRFrance
15361990497114302610205FRFrance
\n", + "

1537 rows × 10 columns

\n", + "
" + ], + "text/plain": [ + " week indicator inc inc_low inc_up inc100 inc100_low \\\n", + "0 202020 7 848 25 1671 1 0 \n", + "1 202019 7 310 0 753 0 0 \n", + "2 202018 7 849 98 1600 1 0 \n", + "3 202017 7 272 0 658 0 0 \n", + "4 202016 7 758 78 1438 1 0 \n", + "5 202015 7 1918 675 3161 3 1 \n", + "6 202014 7 3879 2227 5531 6 3 \n", + "7 202013 7 7326 5236 9416 11 8 \n", + "8 202012 7 8123 5790 10456 12 8 \n", + "9 202011 7 10198 7568 12828 15 11 \n", + "10 202010 7 9011 6691 11331 14 10 \n", + "11 202009 7 13631 10544 16718 21 16 \n", + "12 202008 7 10424 7708 13140 16 12 \n", + "13 202007 7 8959 6574 11344 14 10 \n", + "14 202006 7 9264 6925 11603 14 10 \n", + "15 202005 7 8505 6314 10696 13 10 \n", + "16 202004 7 7991 5831 10151 12 9 \n", + "17 202003 7 5968 4100 7836 9 6 \n", + "18 202002 7 6534 4530 8538 10 7 \n", + "19 202001 7 9835 7019 12651 15 11 \n", + "20 201952 7 7941 5246 10636 12 8 \n", + "21 201951 7 5823 3675 7971 9 6 \n", + "22 201950 7 6424 4276 8572 10 7 \n", + "23 201949 7 6621 4540 8702 10 7 \n", + "24 201948 7 5542 3383 7701 8 5 \n", + "25 201947 7 7536 5058 10014 11 7 \n", + "26 201946 7 2638 1316 3960 4 2 \n", + "27 201945 7 4492 2615 6369 7 4 \n", + "28 201944 7 5728 3627 7829 9 6 \n", + "29 201943 7 4834 2751 6917 7 4 \n", + "... ... ... ... ... ... ... ... \n", + "1507 199126 7 17608 11304 23912 31 20 \n", + "1508 199125 7 16169 10700 21638 28 18 \n", + "1509 199124 7 16171 10071 22271 28 17 \n", + "1510 199123 7 11947 7671 16223 21 13 \n", + "1511 199122 7 15452 9953 20951 27 17 \n", + "1512 199121 7 14903 8975 20831 26 16 \n", + "1513 199120 7 19053 12742 25364 34 23 \n", + "1514 199119 7 16739 11246 22232 29 19 \n", + "1515 199118 7 21385 13882 28888 38 25 \n", + "1516 199117 7 13462 8877 18047 24 16 \n", + "1517 199116 7 14857 10068 19646 26 18 \n", + "1518 199115 7 13975 9781 18169 25 18 \n", + "1519 199114 7 12265 7684 16846 22 14 \n", + "1520 199113 7 9567 6041 13093 17 11 \n", + "1521 199112 7 10864 7331 14397 19 13 \n", + "1522 199111 7 15574 11184 19964 27 19 \n", + "1523 199110 7 16643 11372 21914 29 20 \n", + "1524 199109 7 13741 8780 18702 24 15 \n", + "1525 199108 7 13289 8813 17765 23 15 \n", + "1526 199107 7 12337 8077 16597 22 15 \n", + "1527 199106 7 10877 7013 14741 19 12 \n", + "1528 199105 7 10442 6544 14340 18 11 \n", + "1529 199104 7 7913 4563 11263 14 8 \n", + "1530 199103 7 15387 10484 20290 27 18 \n", + "1531 199102 7 16277 11046 21508 29 20 \n", + "1532 199101 7 15565 10271 20859 27 18 \n", + "1533 199052 7 19375 13295 25455 34 23 \n", + "1534 199051 7 19080 13807 24353 34 25 \n", + "1535 199050 7 11079 6660 15498 20 12 \n", + "1536 199049 7 1143 0 2610 2 0 \n", + "\n", + " inc100_up geo_insee geo_name \n", + "0 2 FR France \n", + "1 1 FR France \n", + "2 2 FR France \n", + "3 1 FR France \n", + "4 2 FR France \n", + "5 5 FR France \n", + "6 9 FR France \n", + "7 14 FR France \n", + "8 16 FR France \n", + "9 19 FR France \n", + "10 18 FR France \n", + "11 26 FR France \n", + "12 20 FR France \n", + "13 18 FR France \n", + "14 18 FR France \n", + "15 16 FR France \n", + "16 15 FR France \n", + "17 12 FR France \n", + "18 13 FR France \n", + "19 19 FR France \n", + "20 16 FR France \n", + "21 12 FR France \n", + "22 13 FR France \n", + "23 13 FR France \n", + "24 11 FR France \n", + "25 15 FR France \n", + "26 6 FR France \n", + "27 10 FR France \n", + "28 12 FR France \n", + "29 10 FR France \n", + "... ... ... ... \n", + "1507 42 FR France \n", + "1508 38 FR France \n", + "1509 39 FR France \n", + "1510 29 FR France \n", + "1511 37 FR France \n", + "1512 36 FR France \n", + "1513 45 FR France \n", + "1514 39 FR France \n", + "1515 51 FR France \n", + "1516 32 FR France \n", + "1517 34 FR France \n", + "1518 32 FR France \n", + "1519 30 FR France \n", + "1520 23 FR France \n", + "1521 25 FR France \n", + "1522 35 FR France \n", + "1523 38 FR France \n", + "1524 33 FR France \n", + "1525 31 FR France \n", + "1526 29 FR France \n", + "1527 26 FR France \n", + "1528 25 FR France \n", + "1529 20 FR France \n", + "1530 36 FR France \n", + "1531 38 FR France \n", + "1532 36 FR France \n", + "1533 45 FR France \n", + "1534 43 FR France \n", + "1535 28 FR France \n", + "1536 5 FR France \n", + "\n", + "[1537 rows x 10 columns]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "raw_data = pd.read_csv(data_url, skiprows=1)\n", + "raw_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Y a-t-il des points manquants dans ce jeux de données ? A priori, non." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
weekindicatorincinc_lowinc_upinc100inc100_lowinc100_upgeo_inseegeo_name
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: [week, indicator, inc, inc_low, inc_up, inc100, inc100_low, inc100_up, geo_insee, geo_name]\n", + "Index: []" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "raw_data[raw_data.isnull().any(axis=1)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nous éliminons ce point, ce qui n'a pas d'impact fort sur notre analyse qui est assez simple." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
weekindicatorincinc_lowinc_upinc100inc100_lowinc100_upgeo_inseegeo_name
02020207848251671102FRFrance
120201973100753001FRFrance
22020187849981600102FRFrance
320201772720658001FRFrance
42020167758781438102FRFrance
5202015719186753161315FRFrance
62020147387922275531639FRFrance
7202013773265236941611814FRFrance
82020127812357901045612816FRFrance
9202011710198756812828151119FRFrance
1020201079011669111331141018FRFrance
112020097136311054416718211626FRFrance
12202008710424770813140161220FRFrance
1320200778959657411344141018FRFrance
1420200679264692511603141018FRFrance
1520200578505631410696131016FRFrance
162020047799158311015112915FRFrance
1720200375968410078369612FRFrance
18202002765344530853810713FRFrance
1920200179835701912651151119FRFrance
202019527794152461063612816FRFrance
2120195175823367579719612FRFrance
22201950764244276857210713FRFrance
23201949766214540870210713FRFrance
2420194875542338377018511FRFrance
252019477753650581001411715FRFrance
262019467263813163960426FRFrance
2720194574492261563697410FRFrance
2820194475728362778299612FRFrance
2920194374834275169177410FRFrance
.................................
15071991267176081130423912312042FRFrance
15081991257161691070021638281838FRFrance
15091991247161711007122271281739FRFrance
1510199123711947767116223211329FRFrance
1511199122715452995320951271737FRFrance
1512199121714903897520831261636FRFrance
15131991207190531274225364342345FRFrance
15141991197167391124622232291939FRFrance
15151991187213851388228888382551FRFrance
1516199117713462887718047241632FRFrance
15171991167148571006819646261834FRFrance
1518199115713975978118169251832FRFrance
1519199114712265768416846221430FRFrance
152019911379567604113093171123FRFrance
1521199112710864733114397191325FRFrance
15221991117155741118419964271935FRFrance
15231991107166431137221914292038FRFrance
1524199109713741878018702241533FRFrance
1525199108713289881317765231531FRFrance
1526199107712337807716597221529FRFrance
1527199106710877701314741191226FRFrance
1528199105710442654414340181125FRFrance
15291991047791345631126314820FRFrance
15301991037153871048420290271836FRFrance
15311991027162771104621508292038FRFrance
15321991017155651027120859271836FRFrance
15331990527193751329525455342345FRFrance
15341990517190801380724353342543FRFrance
1535199050711079666015498201228FRFrance
15361990497114302610205FRFrance
\n", + "

1537 rows × 10 columns

\n", + "
" + ], + "text/plain": [ + " week indicator inc inc_low inc_up inc100 inc100_low \\\n", + "0 202020 7 848 25 1671 1 0 \n", + "1 202019 7 310 0 753 0 0 \n", + "2 202018 7 849 98 1600 1 0 \n", + "3 202017 7 272 0 658 0 0 \n", + "4 202016 7 758 78 1438 1 0 \n", + "5 202015 7 1918 675 3161 3 1 \n", + "6 202014 7 3879 2227 5531 6 3 \n", + "7 202013 7 7326 5236 9416 11 8 \n", + "8 202012 7 8123 5790 10456 12 8 \n", + "9 202011 7 10198 7568 12828 15 11 \n", + "10 202010 7 9011 6691 11331 14 10 \n", + "11 202009 7 13631 10544 16718 21 16 \n", + "12 202008 7 10424 7708 13140 16 12 \n", + "13 202007 7 8959 6574 11344 14 10 \n", + "14 202006 7 9264 6925 11603 14 10 \n", + "15 202005 7 8505 6314 10696 13 10 \n", + "16 202004 7 7991 5831 10151 12 9 \n", + "17 202003 7 5968 4100 7836 9 6 \n", + "18 202002 7 6534 4530 8538 10 7 \n", + "19 202001 7 9835 7019 12651 15 11 \n", + "20 201952 7 7941 5246 10636 12 8 \n", + "21 201951 7 5823 3675 7971 9 6 \n", + "22 201950 7 6424 4276 8572 10 7 \n", + "23 201949 7 6621 4540 8702 10 7 \n", + "24 201948 7 5542 3383 7701 8 5 \n", + "25 201947 7 7536 5058 10014 11 7 \n", + "26 201946 7 2638 1316 3960 4 2 \n", + "27 201945 7 4492 2615 6369 7 4 \n", + "28 201944 7 5728 3627 7829 9 6 \n", + "29 201943 7 4834 2751 6917 7 4 \n", + "... ... ... ... ... ... ... ... \n", + "1507 199126 7 17608 11304 23912 31 20 \n", + "1508 199125 7 16169 10700 21638 28 18 \n", + "1509 199124 7 16171 10071 22271 28 17 \n", + "1510 199123 7 11947 7671 16223 21 13 \n", + "1511 199122 7 15452 9953 20951 27 17 \n", + "1512 199121 7 14903 8975 20831 26 16 \n", + "1513 199120 7 19053 12742 25364 34 23 \n", + "1514 199119 7 16739 11246 22232 29 19 \n", + "1515 199118 7 21385 13882 28888 38 25 \n", + "1516 199117 7 13462 8877 18047 24 16 \n", + "1517 199116 7 14857 10068 19646 26 18 \n", + "1518 199115 7 13975 9781 18169 25 18 \n", + "1519 199114 7 12265 7684 16846 22 14 \n", + "1520 199113 7 9567 6041 13093 17 11 \n", + "1521 199112 7 10864 7331 14397 19 13 \n", + "1522 199111 7 15574 11184 19964 27 19 \n", + "1523 199110 7 16643 11372 21914 29 20 \n", + "1524 199109 7 13741 8780 18702 24 15 \n", + "1525 199108 7 13289 8813 17765 23 15 \n", + "1526 199107 7 12337 8077 16597 22 15 \n", + "1527 199106 7 10877 7013 14741 19 12 \n", + "1528 199105 7 10442 6544 14340 18 11 \n", + "1529 199104 7 7913 4563 11263 14 8 \n", + "1530 199103 7 15387 10484 20290 27 18 \n", + "1531 199102 7 16277 11046 21508 29 20 \n", + "1532 199101 7 15565 10271 20859 27 18 \n", + "1533 199052 7 19375 13295 25455 34 23 \n", + "1534 199051 7 19080 13807 24353 34 25 \n", + "1535 199050 7 11079 6660 15498 20 12 \n", + "1536 199049 7 1143 0 2610 2 0 \n", + "\n", + " inc100_up geo_insee geo_name \n", + "0 2 FR France \n", + "1 1 FR France \n", + "2 2 FR France \n", + "3 1 FR France \n", + "4 2 FR France \n", + "5 5 FR France \n", + "6 9 FR France \n", + "7 14 FR France \n", + "8 16 FR France \n", + "9 19 FR France \n", + "10 18 FR France \n", + "11 26 FR France \n", + "12 20 FR France \n", + "13 18 FR France \n", + "14 18 FR France \n", + "15 16 FR France \n", + "16 15 FR France \n", + "17 12 FR France \n", + "18 13 FR France \n", + "19 19 FR France \n", + "20 16 FR France \n", + "21 12 FR France \n", + "22 13 FR France \n", + "23 13 FR France \n", + "24 11 FR France \n", + "25 15 FR France \n", + "26 6 FR France \n", + "27 10 FR France \n", + "28 12 FR France \n", + "29 10 FR France \n", + "... ... ... ... \n", + "1507 42 FR France \n", + "1508 38 FR France \n", + "1509 39 FR France \n", + "1510 29 FR France \n", + "1511 37 FR France \n", + "1512 36 FR France \n", + "1513 45 FR France \n", + "1514 39 FR France \n", + "1515 51 FR France \n", + "1516 32 FR France \n", + "1517 34 FR France \n", + "1518 32 FR France \n", + "1519 30 FR France \n", + "1520 23 FR France \n", + "1521 25 FR France \n", + "1522 35 FR France \n", + "1523 38 FR France \n", + "1524 33 FR France \n", + "1525 31 FR France \n", + "1526 29 FR France \n", + "1527 26 FR France \n", + "1528 25 FR France \n", + "1529 20 FR France \n", + "1530 36 FR France \n", + "1531 38 FR France \n", + "1532 36 FR France \n", + "1533 45 FR France \n", + "1534 43 FR France \n", + "1535 28 FR France \n", + "1536 5 FR France \n", + "\n", + "[1537 rows x 10 columns]" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data = raw_data.dropna().copy()\n", + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nos données utilisent une convention inhabituelle: le numéro de\n", + "semaine est collé à l'année, donnant l'impression qu'il s'agit\n", + "de nombre entier. C'est comme ça que Pandas les interprète.\n", + " \n", + "Un deuxième problème est que Pandas ne comprend pas les numéros de\n", + "semaine. Il faut lui fournir les dates de début et de fin de\n", + "semaine. Nous utilisons pour cela la bibliothèque `isoweek`.\n", + "\n", + "Comme la conversion des semaines est devenu assez complexe, nous\n", + "écrivons une petite fonction Python pour cela. Ensuite, nous\n", + "l'appliquons à tous les points de nos donnés. Les résultats vont\n", + "dans une nouvelle colonne 'period'." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "def convert_week(year_and_week_int):\n", + " year_and_week_str = str(year_and_week_int)\n", + " year = int(year_and_week_str[:4])\n", + " week = int(year_and_week_str[4:])\n", + " w = isoweek.Week(year, week)\n", + " return pd.Period(w.day(0), 'W')\n", + "\n", + "data['period'] = [convert_week(yw) for yw in data['week']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Il restent deux petites modifications à faire.\n", + "\n", + "Premièrement, nous définissons les périodes d'observation\n", + "comme nouvel index de notre jeux de données. Ceci en fait\n", + "une suite chronologique, ce qui sera pratique par la suite.\n", + "\n", + "Deuxièmement, nous trions les points par période, dans\n", + "le sens chronologique." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "sorted_data = data.set_index('period').sort_index()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nous vérifions la cohérence des données. Entre la fin d'une période et\n", + "le début de la période qui suit, la différence temporelle doit être\n", + "zéro, ou au moins très faible. Nous laissons une \"marge d'erreur\"\n", + "d'une seconde.\n", + "\n", + "Ceci s'avère tout à fait juste sauf pour deux périodes consécutives\n", + "entre lesquelles il manque une semaine.\n", + "\n", + "Nous reconnaissons ces dates: c'est la semaine sans observations\n", + "que nous avions supprimées !" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "periods = sorted_data.index\n", + "for p1, p2 in zip(periods[:-1], periods[1:]):\n", + " delta = p2.to_timestamp() - p1.end_time\n", + " if delta > pd.Timedelta('1s'):\n", + " print(p1, p2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Un premier regard sur les données !" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "sorted_data['inc'].plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Un zoom sur les dernières années montre mieux la situation des pics au printemps et en été. Le creux des incidences se trouve à la fin de l'été." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "sorted_data['inc'][-100:].plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Etude de l'incidence annuelle" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Etant donné que le pic de l'épidémie se situe en hiver, à cheval\n", + "entre deux années civiles, nous définissons la période de référence\n", + "entre deux minima de l'incidence, du 1er août de l'année $N$ au\n", + "1er août de l'année $N+1$.\n", + "\n", + "Notre tâche est un peu compliquée par le fait que l'année ne comporte\n", + "pas un nombre entier de semaines. Nous modifions donc un peu nos périodes\n", + "de référence: à la place du 1er août de chaque année, nous utilisons le\n", + "premier jour de la semaine qui contient le 1er août.\n", + "\n", + "Comme l'incidence de syndrome grippal est très faible en été, cette\n", + "modification ne risque pas de fausser nos conclusions.\n", + "\n", + "Encore un petit détail: les données commencent an octobre 1984, ce qui\n", + "rend la première année incomplète. Nous commençons donc l'analyse en 1985." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "first_august_week = [pd.Period(pd.Timestamp(y, 8, 1), 'W')\n", + " for y in range(1991,\n", + " sorted_data.index[-1].year)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "En partant de cette liste des semaines qui contiennent un 1er août, nous obtenons nos intervalles d'environ un an comme les périodes entre deux semaines adjacentes dans cette liste. Nous calculons les sommes des incidences hebdomadaires pour toutes ces périodes.\n", + "\n", + "Nous vérifions également que ces périodes contiennent entre 51 et 52 semaines, pour nous protéger contre des éventuelles erreurs dans notre code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "year = []\n", + "yearly_incidence = []\n", + "for week1, week2 in zip(first_august_week[:-1],\n", + " first_august_week[1:]):\n", + " one_year = sorted_data['inc'][week1:week2-1]\n", + " assert abs(len(one_year)-52) < 2\n", + " yearly_incidence.append(one_year.sum())\n", + " year.append(week2.year)\n", + "yearly_incidence = pd.Series(data=yearly_incidence, index=year)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Voici les incidences annuelles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yearly_incidence.plot(style='*')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Une liste triée permet de plus facilement répérer les valeurs les plus élevées (à la fin)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yearly_incidence.sort_values()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Enfin, un histogramme montre bien que les épidémies fortes, qui touchent environ 10% de la population\n", + " française, sont assez rares: il y en eu trois au cours des 35 dernières années." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "yearly_incidence.hist(xrot=20)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/module4/src_Python3_challenger.ipynb b/module4/src_Python3_challenger.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..59b4d0ae2af9be254a47255677d1604736f33bdc --- /dev/null +++ b/module4/src_Python3_challenger.ipynb @@ -0,0 +1,812 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this document we reperform some of the analysis provided in \n", + "*Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure* by *Siddhartha R. Dalal, Edward B. Fowlkes, Bruce Hoadley* published in *Journal of the American Statistical Association*, Vol. 84, No. 408 (Dec., 1989), pp. 945-957 and available at http://www.jstor.org/stable/2290069. \n", + "\n", + "On the fourth page of this article, they indicate that the maximum likelihood estimates of the logistic regression using only temperature are: $\\hat{\\alpha}=5.085$ and $\\hat{\\beta}=-0.1156$ and their asymptotic standard errors are $s_{\\hat{\\alpha}}=3.052$ and $s_{\\hat{\\beta}}=0.047$. The Goodness of fit indicated for this model was $G^2=18.086$ with 21 degrees of freedom. Our goal is to reproduce the computation behind these values and the Figure 4 of this article, possibly in a nicer looking way." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Technical information on the computer on which the analysis is run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will be using the python3 language using the pandas, statsmodels, numpy, matplotlib and seaborn libraries." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3.6.4 |Anaconda, Inc.| (default, Mar 13 2018, 01:15:57) \n", + "[GCC 7.2.0]\n", + "uname_result(system='Linux', node='a7ade41352bb', release='4.4.0-164-generic', version='#192-Ubuntu SMP Fri Sep 13 12:02:50 UTC 2019', machine='x86_64', processor='x86_64')\n", + "IPython 7.12.0\n", + "IPython.core.release 7.12.0\n", + "PIL 7.0.0\n", + "PIL.Image 7.0.0\n", + "PIL._version 7.0.0\n", + "_csv 1.0\n", + "_ctypes 1.1.0\n", + "_curses b'2.2'\n", + "decimal 1.70\n", + "argparse 1.1\n", + "backcall 0.1.0\n", + "cffi 1.13.2\n", + "csv 1.0\n", + "ctypes 1.1.0\n", + "cycler 0.10.0\n", + "dateutil 2.8.1\n", + "decimal 1.70\n", + "decorator 4.4.1\n", + "distutils 3.6.4\n", + "ipaddress 1.0\n", + "ipykernel 5.1.4\n", + "ipykernel._version 5.1.4\n", + "ipython_genutils 0.2.0\n", + "ipython_genutils._version 0.2.0\n", + "ipywidgets 7.2.1\n", + "ipywidgets._version 7.2.1\n", + "jedi 0.16.0\n", + "json 2.0.9\n", + "jupyter_client 6.0.0\n", + "jupyter_client._version 6.0.0\n", + "jupyter_core 4.6.3\n", + "jupyter_core.version 4.6.3\n", + "kiwisolver 1.1.0\n", + "logging 0.5.1.2\n", + "matplotlib 2.2.3\n", + "matplotlib.backends.backend_agg 2.2.3\n", + "numpy 1.15.2\n", + "numpy.core 1.15.2\n", + "numpy.core.multiarray 3.1\n", + "numpy.lib 1.15.2\n", + "numpy.linalg._umath_linalg b'0.1.5'\n", + "numpy.matlib 1.15.2\n", + "optparse 1.5.3\n", + "pandas 0.22.0\n", + "_libjson 1.33\n", + "parso 0.6.0\n", + "patsy 0.5.1\n", + "patsy.version 0.5.1\n", + "pexpect 4.8.0\n", + "pickleshare 0.7.5\n", + "platform 1.0.8\n", + "prompt_toolkit 3.0.3\n", + "ptyprocess 0.6.0\n", + "pygments 2.5.2\n", + "pyparsing 2.4.6\n", + "pytz 2019.3\n", + "re 2.2.1\n", + "scipy 1.1.0\n", + "scipy._lib.decorator 4.0.5\n", + "scipy._lib.six 1.2.0\n", + "scipy.fftpack._fftpack b'$Revision: $'\n", + "scipy.fftpack.convolve b'$Revision: $'\n", + "scipy.integrate._dop b'$Revision: $'\n", + "scipy.integrate._ode $Id$\n", + "scipy.integrate._odepack 1.9 \n", + "scipy.integrate._quadpack 1.13 \n", + "scipy.integrate.lsoda b'$Revision: $'\n", + "scipy.integrate.vode b'$Revision: $'\n", + "scipy.interpolate._fitpack 1.7 \n", + "scipy.interpolate.dfitpack b'$Revision: $'\n", + "scipy.linalg 0.4.9\n", + "scipy.linalg._fblas b'$Revision: $'\n", + "scipy.linalg._flapack b'$Revision: $'\n", + "scipy.linalg._flinalg b'$Revision: $'\n", + "scipy.ndimage 2.0\n", + "scipy.optimize._cobyla b'$Revision: $'\n", + "scipy.optimize._lbfgsb b'$Revision: $'\n", + "scipy.optimize._minpack 1.10 \n", + "scipy.optimize._nnls b'$Revision: $'\n", + "scipy.optimize._slsqp b'$Revision: $'\n", + "scipy.optimize.minpack2 b'$Revision: $'\n", + "scipy.signal.spline 0.2\n", + "scipy.sparse.linalg.eigen.arpack._arpack b'$Revision: $'\n", + "scipy.sparse.linalg.isolve._iterative b'$Revision: $'\n", + "scipy.special.specfun b'$Revision: $'\n", + "scipy.stats.mvn b'$Revision: $'\n", + "scipy.stats.statlib b'$Revision: $'\n", + "seaborn 0.8.1\n", + "seaborn.external.husl 2.1.0\n", + "seaborn.external.six 1.10.0\n", + "six 1.14.0\n", + "statsmodels 0.9.0\n", + "statsmodels.__init__ 0.9.0\n", + "traitlets 4.3.3\n", + "traitlets._version 4.3.3\n", + "urllib.request 3.6\n", + "zlib 1.0\n", + "zmq 17.1.2\n", + "zmq.sugar 17.1.2\n", + "zmq.sugar.version 17.1.2\n" + ] + } + ], + "source": [ + "def print_imported_modules():\n", + " import sys\n", + " for name, val in sorted(sys.modules.items()):\n", + " if(hasattr(val, '__version__')): \n", + " print(val.__name__, val.__version__)\n", + "# else:\n", + "# print(val.__name__, \"(unknown version)\")\n", + "def print_sys_info():\n", + " import sys\n", + " import platform\n", + " print(sys.version)\n", + " print(platform.uname())\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import statsmodels.api as sm\n", + "import seaborn as sns\n", + "\n", + "print_sys_info()\n", + "print_imported_modules()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading and inspecting data\n", + "Let's start by reading data." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DateCountTemperaturePressureMalfunction
04/12/81666500
111/12/81670501
23/22/82669500
311/11/82668500
44/04/83667500
56/18/82672500
68/30/836731000
711/28/836701000
82/03/846572001
94/06/846632001
108/30/846702001
1110/05/846782000
1211/08/846672000
131/24/856532002
144/12/856672000
154/29/856752000
166/17/856702000
177/2903/856812000
188/27/856762000
1910/03/856792000
2010/30/856752002
2111/26/856762000
221/12/866582001
\n", + "
" + ], + "text/plain": [ + " Date Count Temperature Pressure Malfunction\n", + "0 4/12/81 6 66 50 0\n", + "1 11/12/81 6 70 50 1\n", + "2 3/22/82 6 69 50 0\n", + "3 11/11/82 6 68 50 0\n", + "4 4/04/83 6 67 50 0\n", + "5 6/18/82 6 72 50 0\n", + "6 8/30/83 6 73 100 0\n", + "7 11/28/83 6 70 100 0\n", + "8 2/03/84 6 57 200 1\n", + "9 4/06/84 6 63 200 1\n", + "10 8/30/84 6 70 200 1\n", + "11 10/05/84 6 78 200 0\n", + "12 11/08/84 6 67 200 0\n", + "13 1/24/85 6 53 200 2\n", + "14 4/12/85 6 67 200 0\n", + "15 4/29/85 6 75 200 0\n", + "16 6/17/85 6 70 200 0\n", + "17 7/2903/85 6 81 200 0\n", + "18 8/27/85 6 76 200 0\n", + "19 10/03/85 6 79 200 0\n", + "20 10/30/85 6 75 200 2\n", + "21 11/26/85 6 76 200 0\n", + "22 1/12/86 6 58 200 1" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data = pd.read_csv(\"https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/raw/master/data/shuttle.csv\")\n", + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We know from our previous experience on this data set that filtering data is a really bad idea. We will therefore process it as such." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEKCAYAAAD9xUlFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAF9JJREFUeJzt3X2UXXV97/H3d5IACYlAg02VQAFJsVyBCOFJtDfx6Qa7JPUCBbyCl940ZUlul9y2htvVa6m1a1V8qHpFY+SiQldNVRBom14e1Ii0IASM4UHBuYBhEhogBshASGYy3/vH2bN7Mkxmzhlmz5lzeL/WmpWz9/mdne939pz5zN5nn9+JzESSJICuVhcgSZo8DAVJUslQkCSVDAVJUslQkCSVDAVJUqmyUIiIqyPiqYh4YC/3R0R8PiK6I2JDRJxQVS2SpMZUeaTwNWDxCPefAcwrvpYBX6qwFklSAyoLhcy8HfjlCEOWANdkzV3AgRHxuqrqkSSNbmoL/+9DgCfqlnuKdU8OHRgRy6gdTTB9+vQTDz300AkpsFEDAwN0dXXmyzOd2pt9tZ9O7W2i+nrkkUeeyczXjjaulaEQw6wbds6NzFwFrAJYsGBBrlu3rsq6mrZ27VoWLlzY6jIq0am92Vf76dTeJqqviPhFI+NaGbs9QP2f/HOBzS2qRZJEa0PhJuDC4iqkU4HnMvNlp44kSROnstNHEfENYCFwcET0AH8OTAPIzJXAGuA9QDfwInBRVbVIkhpTWShk5vmj3J/AJVX9/5Kk5nXeS/mSpDEzFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklQyFCRJJUNBklSqNBQiYnFEPBwR3RFx2TD3HxAR/xARP4mIByPioirrkSSNrLJQiIgpwJXAGcAxwPkRccyQYZcAD2Xm8cBC4NMRsU9VNUmSRlblkcLJQHdmPpqZu4DVwJIhYxKYFREBzAR+CfRXWJMkaQSRmdVsOOJsYHFmLi2WLwBOyczldWNmATcBbwRmAedm5j8Ns61lwDKAOXPmnLh69epKah6r3t5eZs6c2eoyKtGpvdlX++nU3iaqr0WLFt2bmQtGGze1whpimHVDE+g/AeuBtwNvAG6NiB9m5vN7PChzFbAKYMGCBblw4cLxr/YVWLt2LZOtpvHSqb3ZV/vp1N4mW19Vnj7qAQ6tW54LbB4y5iLg+qzpBh6jdtQgSWqBKkPhHmBeRBxRvHh8HrVTRfU2Au8AiIg5wNHAoxXWJEkaQWWnjzKzPyKWAzcDU4CrM/PBiLi4uH8l8JfA1yLifmqnm1Zk5jNV1SRJGlmVrymQmWuANUPWray7vRl4d5U1SJIa5zuaJUklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVKo0FCJicUQ8HBHdEXHZXsYsjIj1EfFgRPygynokSSOb2sigiHhTZj7QzIYjYgpwJfAuoAe4JyJuysyH6sYcCHwRWJyZGyPiV5v5PyRJ46vRI4WVEXF3RHyo+EXeiJOB7sx8NDN3AauBJUPGvB+4PjM3AmTmUw1uW5JUgcjMxgZGzAN+DzgHuBv4ambeOsL4s6kdASwtli8ATsnM5XVjPgtMA/4DMAv4XGZeM8y2lgHLAObMmXPi6tWrG+tugvT29jJz5sxWl1GJTu3NvtpPp/Y2UX0tWrTo3sxcMOrAzGz4C5gCnAVsAn4K/Az4z3sZew5wVd3yBcD/HjLmC8BdwP7AwcDPgd8YqYYTTzwxJ5vvf//7rS6hMp3am321n07tbaL6AtZlA7/nG31N4TjgIuC3gVuB92bmfRHxeuBO4PphHtYDHFq3PBfYPMyYZzLzBeCFiLgdOB54pJG6JEnjq9HXFL4A3Accn5mXZOZ9AJm5GfizvTzmHmBeRBwREfsA5wE3DRlzI/C2iJgaETOAU6gdgUiSWqChIwXgPcCOzNwNEBFdwH6Z+WJmXjvcAzKzPyKWAzdTO+10dWY+GBEXF/evzMyfRsT/BTYAA9RONzV1lZMkafw0Ggq3Ae8EeovlGcAtwFtGelBmrgHWDFm3csjyJ4FPNliHJKlCjZ4+2i8zBwOB4vaMakqSJLVKo6HwQkScMLgQEScCO6opSZLUKo2ePvow8K2IGLx66HXAudWUJElqlYZCITPviYg3AkcDAfwsM/sqrUySNOEaPVIAOAk4vHjMmyOCHObdx5Kk9tXom9euBd4ArAd2F6sTMBQkqYM0eqSwADimeKu0JKlDNXr10QPAr1VZiCSp9Ro9UjgYeCgi7gZ2Dq7MzDMrqUqS1BKNhsLlVRYhSZocGr0k9QcR8evAvMy8rZi8bkq1pUmSJlpDrylExO8D3wa+XKw6BLihqqIkSa3R6AvNlwCnA88DZObPAT9PWZI6TKOhsDNrn7MMQERMpfY+BUlSB2k0FH4QEX8KTI+IdwHfAv6hurIkSa3QaChcBjwN3A/8AbXPSNjbJ65JktpUo1cfDQBfKb4kSR2q0bmPHmOY1xAy88hxr0iS1DLNzH00aD/gHOBXxr8cSVIrNfSaQmZurfvalJmfBd5ecW2SpAnW6OmjE+oWu6gdOcyqpCJJUss0evro03W3+4HHgd8d92okSS3V6NVHi6ouRJLUeo2ePvofI92fmZ8Zn3IkSa3UzNVHJwE3FcvvBW4HnqiiKElSazTzITsnZOZ2gIi4HPhWZi6tqjBJ0sRrdJqLw4Bddcu7gMPHvRpJUks1eqRwLXB3RHyH2jub3wdcU1lVkqSWaPTqo7+KiH8G3lasuigzf1xdWZKkVmj09BHADOD5zPwc0BMRR1RUkySpRRr9OM4/B1YA/7NYNQ3426qKkiS1RqNHCu8DzgReAMjMzTjNhSR1nEZDYVdmJsX02RGxf3UlSZJapdFQ+GZEfBk4MCJ+H7gNP3BHkjpOo1cffar4bObngaOBj2bmrZVWJkmacKMeKUTElIi4LTNvzcw/ycw/bjQQImJxRDwcEd0RcdkI406KiN0RcXYzxUuSxteooZCZu4EXI+KAZjYcEVOAK4EzgGOA8yPimL2M+wRwczPblySNv0bf0fwScH9E3EpxBRJAZv7hCI85GejOzEcBImI1sAR4aMi4/w5cR23CPUlSCzUaCv9UfDXjEPacRbUHOKV+QEQcQu1y17czQihExDJgGcCcOXNYu3Ztk6VUq7e3d9LVNF46tTf7aj+d2ttk62vEUIiIwzJzY2Z+fQzbjmHW5ZDlzwIrMnN3xHDDiwdlrgJWASxYsCAXLlw4hnKqs3btWiZbTeOlU3uzr/bTqb1Ntr5Ge03hhsEbEXFdk9vuAQ6tW54LbB4yZgGwOiIeB84GvhgRv9Pk/yNJGiejnT6q//P9yCa3fQ8wr5gjaRNwHvD++gGZWc6fFBFfA/4xM29AktQSo4VC7uX2qDKzPyKWU7uqaApwdWY+GBEXF/evbKpSSVLlRguF4yPieWpHDNOL2xTLmZmvGenBmbkGWDNk3bBhkJn/taGKJUmVGTEUMnPKRBUiSWq9Zj5PQZLU4QwFSVLJUJAklQwFSVLpVRMKW3t38pMnnmVr785WlyKpSVt7d7Kjb7fP3wnwqgiFG9dv4vRPfI8PXPUjTv/E97hp/aZWlySpQYPP38eefsHn7wTo+FDY2ruTFddt4KW+Abbv7OelvgE+ct0G/+KQ2kD983d3ps/fCdDxodCzbQfTuvZsc1pXFz3bdrSoIkmN8vk78To+FOYeNJ2+gYE91vUNDDD3oOktqkhSo3z+TryOD4XZM/flirOOY79pXczadyr7TeviirOOY/bMfVtdmqRR1D9/p0T4/J0AjX7ITls7c/4hnH7UwfRs28Hcg6b7AyW1kcHn79133sG/nPlWn78Ve1WEAtT+4vCHSWpPs2fuy/RpU3wOT4COP30kSWqcoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqRSpaEQEYsj4uGI6I6Iy4a5/79ExIbi618j4vgq65EkjayyUIiIKcCVwBnAMcD5EXHMkGGPAf8xM48D/hJYVVU9kqTRVXmkcDLQnZmPZuYuYDWwpH5AZv5rZm4rFu8C5lZYjyRpFJGZ1Ww44mxgcWYuLZYvAE7JzOV7Gf/HwBsHxw+5bxmwDGDOnDknrl69upKax6q3t5eZM2e2uoxKdGpv9tV+OrW3iepr0aJF92bmgtHGTa2whhhm3bAJFBGLgP8GvHW4+zNzFcWppQULFuTChQvHqcTxsXbtWiZbTeOlU3uzr/bTqb1Ntr6qDIUe4NC65bnA5qGDIuI44CrgjMzcWmE9kqRRVPmawj3AvIg4IiL2Ac4DbqofEBGHAdcDF2TmIxXWIklqQGVHCpnZHxHLgZuBKcDVmflgRFxc3L8S+CgwG/hiRAD0N3LOS5JUjSpPH5GZa4A1Q9atrLu9FHjZC8uCrb076dm2g7kHTWf2zH3HbWw76dS+qtK9ZTvbXuyje8t2jpozq9XlqE1VGgoamxvXb2LFdRuY1tVF38AAV5x1HGfOP+QVj20nndpXVT56w/1cc9dG/ujYfi79m9u58LTD+NiSY1tdltqQ01xMMlt7d7Liug281DfA9p39vNQ3wEeu28DW3p2vaGw76dS+qtK9ZTvX3LVxj3XX3LmR7i3bW1SR2pmhMMn0bNvBtK49d8u0ri56tu14RWPbSaf2VZX1Tzzb1HppJIbCJDP3oOn0DQzssa5vYIC5B01/RWPbSaf2VZX5hx7Y1HppJIbCJDN75r5ccdZx7Deti1n7TmW/aV1ccdZxw77Q2szYdtKpfVXlqDmzuPC0w/ZYd+Fph/lis8bEF5onoTPnH8LpRx3c0JU3zYxtJ53aV1U+tuRYLjz1cO6/9y5uu/RUA0FjZihMUrNn7tvwL8JmxraTTu2rKkfNmUXPjGkGgl4RTx9JkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpVGkoRMTiiHg4Iroj4rJh7o+I+Hxx/4aIOKHKeqRmbe3dyU+eeJatvTtHHbvusa185paHWffY1nHbZjNju7dsZ9uLfXRv2T7q2GZUVW+zNezo2z3qdru3bOfb657o2O9BFdsdampVG46IKcCVwLuAHuCeiLgpMx+qG3YGMK/4OgX4UvGv1HI3rt/Eius2MK2ri76BAa446zjOnH/IsGM/cNVd3NFdC4PPf6+btx01m2uXnvqKttnM2I/ecD/X3LWRPzq2n0v/5nYuPO0wPrbk2DF2Xn29Y6nhD3+zj0s/8b29bnfwezCoE78H473d4VR5pHAy0J2Zj2bmLmA1sGTImCXANVlzF3BgRLyuwpqkhmzt3cmK6zbwUt8A23f281LfAB+5bsOwf6Wte2xrGQiDfti99WVHDM1ss5mx3Vu27/HLEOCaOze+4r+Wq6p3rDXsztzrdl8t34Px3O7eRGZWs+GIs4HFmbm0WL4AOCUzl9eN+UfgrzPzjmL5u8CKzFw3ZFvLgGXF4tHAw5UUPXYHA8+0uoiKdGpvI/YV06bPmHrQ634jurqmDK7LgYHd/duefCT7drxYP3bKrINfP2X/A1/2x8zuF559cvf2ZzaPZZvNjO2accDsqa957eEAu198jikzDgCg//mnHx948bmRz2WNoKp6x1rDYG/Dbbf+e1CvTb4H4/azOIpfz8zXjjaostNHQAyzbmgCNTKGzFwFrBqPoqoQEesyc0Gr66hCp/bWyX31P/dUx/UFndvbZPtZrPL0UQ9waN3yXGDzGMZIkiZIlaFwDzAvIo6IiH2A84Cbhoy5CbiwuArpVOC5zHyywpokSSOo7PRRZvZHxHLgZmAKcHVmPhgRFxf3rwTWAO8BuoEXgYuqqqdik/bU1jjo1N7sq/10am+Tqq/KXmiWJLUf39EsSSoZCpKkkqEwBhHxeETcHxHrI2Jdse7yiNhUrFsfEe9pdZ3NiogDI+LbEfGziPhpRJwWEb8SEbdGxM+Lfw9qdZ3N2ktfnbC/jq6rf31EPB8RH273fTZCX52wzy6NiAcj4oGI+EZE7DfZ9pevKYxBRDwOLMjMZ+rWXQ70ZuanWlXXKxURXwd+mJlXFVeMzQD+FPhlZv51MX/VQZm5oqWFNmkvfX2YNt9f9YppZTZRmybmEtp8nw0a0tdFtPE+i4hDgDuAYzJzR0R8k9rFNscwifaXRwoCICJeA/wW8H8AMnNXZj5LbSqSrxfDvg78TmsqHJsR+uo07wD+X2b+gjbfZ0PU99UJpgLTI2IqtT9ONjPJ9pehMDYJ3BIR9xZTcAxaXsz2enWrDwHH4EjgaeCrEfHjiLgqIvYH5gy+d6T491dbWeQY7K0vaO/9NdR5wDeK2+2+z+rV9wVtvM8ycxPwKWAj8CS192XdwiTbX4bC2JyemSdQm+X1koj4LWozvL4BmE9th3+6hfWNxVTgBOBLmflm4AXgZdOdt6G99dXu+6tUnBI7E/hWq2sZT8P01db7rAixJcARwOuB/SPiA62t6uUMhTHIzM3Fv08B3wFOzswtmbk7MweAr1CbJbad9AA9mfmjYvnb1H6Zbhmcubb496kW1TdWw/bVAfur3hnAfZm5pVhu9302aI++OmCfvRN4LDOfzsw+4HrgLUyy/WUoNCki9o+IWYO3gXcDD8SeU36/D3igFfWNVWb+G/BERBxdrHoH8BC1qUg+WKz7IHBjC8obs7311e77a4jz2fMUS1vvszp79NUB+2wjcGpEzIiIoPaz+FMm2f7y6qMmRcSR1I4OoHZq4u8y868i4lpqh7UJPA78QbvN4xQR84GrgH2AR6ld7dEFfBM4jNoP9TmZ+cuWFTkGe+nr87T5/gKIiBnAE8CRmflcsW427b/PhuurE55jfwGcC/QDPwaWAjOZRPvLUJAklTx9JEkqGQqSpJKhIEkqGQqSpJKhIEkqVfbJa9JEKy7F/G6x+GvAbmpTXEDtDYa7WlLYCCLi94A1xfsppJbzklR1pMk0a21ETMnM3Xu57w5geWaub2J7UzOzf9wKlOp4+kivChHxwYi4u5iH/4sR0RURUyPi2Yj4ZETcFxE3R8QpEfGDiHh0cL7+iFgaEd8p7n84Iv6swe1+PCLuBk6OiL+IiHuKefRXRs251N6M9ffF4/eJiJ6IOLDY9qkRcVtx++MR8eWIuJXa5H5TI+Izxf+9ISKWTvx3VZ3IUFDHi4g3UZsW4S2ZOZ/aadPzirsPAG4pJjjcBVxObfqBc4CP1W3m5OIxJwDvj4j5DWz3vsw8OTPvBD6XmScBxxb3Lc7MvwfWA+dm5vwGTm+9GXhvZl4ALAOeysyTgZOoTcx42Fi+P1I9X1PQq8E7qf3iXFebcobp1KZQANiRmbcWt++nNp1xf0TcDxxet42bM3MbQETcALyV2vNnb9vdxb9PhwLwjoj4E2A/4GDgXuCfm+zjxsx8qbj9buA3I6I+hOZRmyZBGjNDQa8GAVydmf9rj5W1Dzqp/+t8ANhZd7v++TH0xbccZbs7snjBrpjH5wvUZmfdFBEfpxYOw+nn34/gh455YUhPH8rM7yKNI08f6dXgNuB3I+JgqF2lNIZTLe+O2mc9z6A2J/6/NLHd6dRC5pliht2z6u7bDsyqW34cOLG4XT9uqJuBDxUBNPi5xtOb7El6GY8U1PEy8/5idsrbIqIL6AMupvZRiI26A/g7ah/ycu3g1UKNbDczt0btc6IfAH4B/Kju7q8CV0XEDmqvW1wOfCUi/g24e4R6vkxtVs31xamrp6iFlfSKeEmqNIriyp43ZeaHW12LVDVPH0mSSh4pSJJKHilIkkqGgiSpZChIkkqGgiSpZChIkkr/HzHofwgP0tIHAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "%matplotlib inline\n", + "pd.set_option('mode.chained_assignment',None) # this removes a useless warning from pandas\n", + "import matplotlib.pyplot as plt\n", + "\n", + "data[\"Frequency\"]=data.Malfunction/data.Count\n", + "data.plot(x=\"Temperature\",y=\"Frequency\",kind=\"scatter\",ylim=[0,1])\n", + "plt.grid(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Logistic regression\n", + "\n", + "Let's assume O-rings independently fail with the same probability which solely depends on temperature. A logistic regression should allow us to estimate the influence of temperature." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "
Generalized Linear Model Regression Results
Dep. Variable: Frequency No. Observations: 23
Model: GLM Df Residuals: 21
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -3.9210
Date: Tue, 26 May 2020 Deviance: 3.0144
Time: 10:19:52 Pearson chi2: 5.00
No. Iterations: 6 Covariance Type: nonrobust
\n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "
coef std err z P>|z| [0.025 0.975]
Intercept 5.0850 7.477 0.680 0.496 -9.570 19.740
Temperature -0.1156 0.115 -1.004 0.316 -0.341 0.110
" + ], + "text/plain": [ + "\n", + "\"\"\"\n", + " Generalized Linear Model Regression Results \n", + "==============================================================================\n", + "Dep. Variable: Frequency No. Observations: 23\n", + "Model: GLM Df Residuals: 21\n", + "Model Family: Binomial Df Model: 1\n", + "Link Function: logit Scale: 1.0000\n", + "Method: IRLS Log-Likelihood: -3.9210\n", + "Date: Tue, 26 May 2020 Deviance: 3.0144\n", + "Time: 10:19:52 Pearson chi2: 5.00\n", + "No. Iterations: 6 Covariance Type: nonrobust\n", + "===============================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "-------------------------------------------------------------------------------\n", + "Intercept 5.0850 7.477 0.680 0.496 -9.570 19.740\n", + "Temperature -0.1156 0.115 -1.004 0.316 -0.341 0.110\n", + "===============================================================================\n", + "\"\"\"" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import statsmodels.api as sm\n", + "\n", + "data[\"Success\"]=data.Count-data.Malfunction\n", + "data[\"Intercept\"]=1\n", + "\n", + "logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], \n", + " family=sm.families.Binomial(sm.families.links.logit)).fit()\n", + "\n", + "logmodel.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The maximum likelyhood estimator of the intercept and of Temperature are thus $\\hat{\\alpha}=5.0849$ and $\\hat{\\beta}=-0.1156$. This **corresponds** to the values from the article of Dalal *et al.* The standard errors are $s_{\\hat{\\alpha}} = 7.477$ and $s_{\\hat{\\beta}} = 0.115$, which is **different** from the $3.052$ and $0.04702$ reported by Dallal *et al.* The deviance is $3.01444$ with 21 degrees of freedom. I cannot find any value similar to the Goodness of fit ($G^2=18.086$) reported by Dalal *et al.* There seems to be something wrong. Oh I know, I haven't indicated that my observations are actually the result of 6 observations for each rocket launch. Let's indicate these weights (since the weights are always the same throughout all experiments, it does not change the estimates of the fit but it does influence the variance estimates)." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "
Generalized Linear Model Regression Results
Dep. Variable: Frequency No. Observations: 23
Model: GLM Df Residuals: 21
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -23.526
Date: Tue, 26 May 2020 Deviance: 18.086
Time: 10:19:58 Pearson chi2: 30.0
No. Iterations: 6 Covariance Type: nonrobust
\n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "\n", + "
coef std err z P>|z| [0.025 0.975]
Intercept 5.0850 3.052 1.666 0.096 -0.898 11.068
Temperature -0.1156 0.047 -2.458 0.014 -0.208 -0.023
" + ], + "text/plain": [ + "\n", + "\"\"\"\n", + " Generalized Linear Model Regression Results \n", + "==============================================================================\n", + "Dep. Variable: Frequency No. Observations: 23\n", + "Model: GLM Df Residuals: 21\n", + "Model Family: Binomial Df Model: 1\n", + "Link Function: logit Scale: 1.0000\n", + "Method: IRLS Log-Likelihood: -23.526\n", + "Date: Tue, 26 May 2020 Deviance: 18.086\n", + "Time: 10:19:58 Pearson chi2: 30.0\n", + "No. Iterations: 6 Covariance Type: nonrobust\n", + "===============================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "-------------------------------------------------------------------------------\n", + "Intercept 5.0850 3.052 1.666 0.096 -0.898 11.068\n", + "Temperature -0.1156 0.047 -2.458 0.014 -0.208 -0.023\n", + "===============================================================================\n", + "\"\"\"" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "logmodel=sm.GLM(data['Frequency'], data[['Intercept','Temperature']], \n", + " family=sm.families.Binomial(sm.families.links.logit),\n", + " var_weights=data['Count']).fit()\n", + "\n", + "logmodel.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Good, now I have recovered the asymptotic standard errors $s_{\\hat{\\alpha}}=3.052$ and $s_{\\hat{\\beta}}=0.047$.\n", + "The Goodness of fit (Deviance) indicated for this model is $G^2=18.086$ with 21 degrees of freedom (Df Residuals).\n", + "\n", + "**I have therefore managed to fully replicate the results of the Dalal *et al.* article**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Predicting failure probability\n", + "The temperature when launching the shuttle was 31°F. Let's try to estimate the failure probability for such temperature using our model.:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "%matplotlib inline\n", + "data_pred = pd.DataFrame({'Temperature': np.linspace(start=30, stop=90, num=121), 'Intercept': 1})\n", + "data_pred['Frequency'] = logmodel.predict(data_pred)\n", + "data_pred.plot(x=\"Temperature\",y=\"Frequency\",kind=\"line\",ylim=[0,1])\n", + "plt.scatter(x=data[\"Temperature\"],y=data[\"Frequency\"])\n", + "plt.grid(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "hideCode": false, + "hidePrompt": false, + "scrolled": true + }, + "source": [ + "This figure is very similar to the Figure 4 of Dalal *et al.* **I have managed to replicate the Figure 4 of the Dalal *et al.* article.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Computing and plotting uncertainty" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Following the documentation of [Seaborn](https://seaborn.pydata.org/generated/seaborn.regplot.html), I use regplot." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.\n", + " return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "sns.set(color_codes=True)\n", + "plt.xlim(30,90)\n", + "plt.ylim(0,1)\n", + "sns.regplot(x='Temperature', y='Frequency', data=data, logistic=True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**I think I have managed to correctly compute and plot the uncertainty of my prediction.** Although the shaded area seems very similar to [the one obtained by with R](https://app-learninglab.inria.fr/moocrr/gitlab/moocrr-session3/moocrr-reproducibility-study/tree/master/challenger.pdf), I can spot a few differences (e.g., the blue point for temperature 63 is outside)... Could this be a numerical error ? Or a difference in the statistical method ? It is not clear which one is \"right\"." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "celltoolbar": "Hide code", + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}