Commit be6e3673 authored by rloic's avatar rloic

Analyse replicable

parent 960c9f5d
{
"cells": [],
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Analyse reproductible : Paradoxe de Simpson\n",
"\n",
"## Description\n",
"\n",
"Dans cette analyse, nous nous intéressons à l'impact du tabagisme sur la durée de vie.\n",
"\n",
"### Description des données\n",
"\n",
"| Nom | Type | Description |\n",
"|-----|------|-------------|\n",
"| Smoker | Enum (Yes, No) | Indique si la personne est fumeuse ou non |\n",
"| Status | Enum (Alive, Dead) | Indique si la personne est en vie |\n",
"| Age | Float | Indique l'âge de la personne |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyse\n",
"\n",
"### Chargement des librairies"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chargement du jeu de données"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Smoker</th>\n",
" <th>Status</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>19.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>No</td>\n",
" <td>Dead</td>\n",
" <td>57.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>No</td>\n",
" <td>Alive</td>\n",
" <td>47.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>81.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1309</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>35.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1310</th>\n",
" <td>No</td>\n",
" <td>Alive</td>\n",
" <td>22.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1311</th>\n",
" <td>Yes</td>\n",
" <td>Dead</td>\n",
" <td>62.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1312</th>\n",
" <td>No</td>\n",
" <td>Dead</td>\n",
" <td>88.6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1313</th>\n",
" <td>No</td>\n",
" <td>Alive</td>\n",
" <td>39.1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1314 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" Smoker Status Age\n",
"0 Yes Alive 21.0\n",
"1 Yes Alive 19.3\n",
"2 No Dead 57.5\n",
"3 No Alive 47.1\n",
"4 Yes Alive 81.4\n",
"... ... ... ...\n",
"1309 Yes Alive 35.9\n",
"1310 No Alive 22.3\n",
"1311 Yes Dead 62.1\n",
"1312 No Dead 88.6\n",
"1313 No Alive 39.1\n",
"\n",
"[1314 rows x 3 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw = pd.read_csv(\"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv\")\n",
"raw"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transformation du dataset\n",
"\n",
"Ici je transforme le data set pandas en list d'objets. Ce qui me rend le traitement plus simple par la suite.\n",
"\n",
"#### Définition des types"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from enum import IntEnum\n",
"class Smoker(IntEnum):\n",
" No = 0\n",
" Yes = 1\n",
"\n",
"class Status(IntEnum):\n",
" Dead = 0\n",
" Alive = 1\n",
"\n",
"class Record:\n",
" def __init__(self, smoker: Smoker, status: Status, age: float):\n",
" self.smoker = smoker\n",
" self.status = status\n",
" self.age = age\n",
"\n",
" def __repr__(self):\n",
" return \"Record(smoker={}, alive={}, age={}\".format(self.smoker, self.status, self.age)\n",
" \n",
"def parse(d) -> Record:\n",
" if d[1] == 'Alive':\n",
" status = Status.Alive\n",
" else:\n",
" status = Status.Dead\n",
" if d[0] == 'Yes':\n",
" smoker = Smoker.Yes\n",
" else:\n",
" smoker = Smoker.No\n",
" age = float(d[2])\n",
" return Record(smoker, status, age)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Convertion des données"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Record(smoker=1, alive=1, age=21.0,\n",
" Record(smoker=1, alive=1, age=19.3,\n",
" Record(smoker=0, alive=0, age=57.5,\n",
" Record(smoker=0, alive=1, age=47.1,\n",
" Record(smoker=1, alive=1, age=81.4,\n",
" Record(smoker=0, alive=1, age=36.8,\n",
" Record(smoker=0, alive=1, age=23.8,\n",
" Record(smoker=1, alive=0, age=57.5,\n",
" Record(smoker=1, alive=1, age=24.8,\n",
" Record(smoker=1, alive=1, age=49.5]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = [parse(line[1]) for line in raw.iterrows()]\n",
"df[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Definitions des features\n",
"\n",
"Ici, nous définissons comment sont convertis les différentes informations en valeur numérique."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"def features(r: Record):\n",
" return [ int(r.smoker), r.age ]\n",
" \n",
"def label(r: Record) -> int:\n",
" return int(r.status)\n",
" \n",
"def smoker(r: Record) -> Smoker:\n",
" return r.smoker\n",
"\n",
"def is_smoker(r: Record) -> bool:\n",
" return r.smoker == Smoker.Yes\n",
"\n",
"def is_not_smoker(r: Record) -> bool:\n",
" return r.smoker == Smoker.No\n",
"\n",
"def status(r: Record) -> Status:\n",
" return r.status\n",
"\n",
"def is_alive(r: Record) -> bool:\n",
" return r.status == Status.Alive\n",
"\n",
"def age(r: Record) -> float:\n",
" return r.age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Première Analyse\n",
"\n",
"#### Calcul du taux de survie des deux groupes\n",
"\n",
"Pour cela, on scinde le jeu de données en deux parties, le groupe Fumeur (Smoker.Yes) et le groupe non fumeur (Smoker.No).\n",
"Ensuite, on scinde le jeu de données en 4 parties :\n",
"1. Le groupe fumeur\n",
" - En vie\n",
" - Décédé\n",
"2. Le groupe non fumeur\n",
" - En vie\n",
" - Décédé"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"<table>\n",
" <tr>\n",
" <th></th>\n",
" <th>En vie</th>\n",
" <th>Décédé</th>\n",
" <th>Ratio (Déces / Total)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Fumeur</th>\n",
" <td>443</td>\n",
" <td>139</td>\n",
" <td>0.23883161512027493</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Non Fumeur</th>\n",
" <td>502</td>\n",
" <td>230</td>\n",
" <td>0.31420765027322406</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def group_by(list, key):\n",
" groups = {}\n",
" for element in list:\n",
" current_key = key(element)\n",
" if current_key not in groups:\n",
" groups[current_key] = []\n",
" groups[current_key].append(element)\n",
" return groups\n",
"\n",
"groups = group_by(df, smoker)\n",
"sub_groups = group_by(df, lambda r: (r.smoker, r.status))\n",
"\n",
"from IPython.core.display import HTML\n",
"HTML(\"\"\"\n",
"<table>\n",
" <tr>\n",
" <th></th>\n",
" <th>En vie</th>\n",
" <th>Décédé</th>\n",
" <th>Taux de mortalité</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Fumeur</th>\n",
" <td>{smoker_alive}</td>\n",
" <td>{smoker_dead}</td>\n",
" <td>{smoker_ratio}</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Non Fumeur</th>\n",
" <td>{non_smoker_alive}</td>\n",
" <td>{non_smoker_dead}</td>\n",
" <td>{non_smoker_ratio}</td>\n",
" </tr>\n",
"</table>\"\"\".format(\n",
" smoker_alive = len(sub_groups[(Smoker.Yes, Status.Alive)]),\n",
" smoker_dead = len(sub_groups[(Smoker.Yes, Status.Dead)]),\n",
" smoker_ratio = len(sub_groups[(Smoker.Yes, Status.Dead)]) / len(groups[Smoker.Yes]),\n",
" non_smoker_alive = len(sub_groups[(Smoker.No, Status.Alive)]),\n",
" non_smoker_dead = len(sub_groups[(Smoker.No, Status.Dead)]),\n",
" non_smoker_ratio = len(sub_groups[(Smoker.No, Status.Dead)]) / len(groups[Smoker.No]))\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ce premier résultat indique que le groupe de fumeur à un taux de mortalité moins élevé que le groupe de non fumeurs. Mais il faut aller plus loin dans l'analyse.\n",
"Nous commençont par représenter la distribution des deux groupes."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x576 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"smokers = list(filter(is_smoker, df))\n",
"non_smokers = list(filter(is_not_smoker, df))\n",
"\n",
"plt.figure(figsize=(12,8))\n",
"plt.hist([list(map(age, smokers)), list(map(age, non_smokers))])\n",
"plt.legend([\"Fumeurs\", \"Non Fumeurs\"])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On s'aperçoit que la distribution des données est différente dans les deux groupes. La population des non fumeurs est globalement plus âgée (ci-dessous)."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(44.26975945017182, 49.815846994535534)"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def mean(arr):\n",
" sum = 0\n",
" for el in arr:\n",
" sum += el\n",
" return sum / len(arr)\n",
"\n",
"( mean([ age(r) for r in smokers ]), mean([ age(r) for r in non_smokers ]) )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"On calcule alors la regression logistique pour chacun des deux groupes."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x576 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"X_smokers = [ [x.age] for x in smokers ]\n",
"Y_smokers = [ label(x) for x in smokers ]\n",
"smokers_model_logit = LogisticRegression(penalty='l2',solver='newton-cg')\n",
"smokers_model_logit.fit(X_smokers, Y_smokers)\n",
"smokers_P = [p[1] for p in smokers_model_logit.predict_proba(X_smokers)]\n",
"\n",
"X_non_smokers = [ [x.age] for x in non_smokers ]\n",
"Y_non_smokers = [ label(x) for x in non_smokers]\n",
"non_smokers_model_logit = LogisticRegression(penalty='l2',solver='newton-cg')\n",
"non_smokers_model_logit.fit(X_non_smokers, Y_non_smokers)\n",
"non_smokers_P = [p[1] for p in non_smokers_model_logit.predict_proba(X_non_smokers)]\n",
"\n",
"plt.figure(figsize=(12,8))\n",
"plt.scatter(list(map(age, smokers)), smokers_P)\n",
"plt.scatter(list(map(age, non_smokers)), non_smokers_P)\n",
"plt.legend([\"Fumeur\", \"Non Fumeur\"])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On voit sur les courbes ci dessus que le groupe non fumeur à une probabilité plus élevée jeune alors que le groupe fumeur à une probabilié plus élevée en fin de vie.\n",
"Je ne suis pas arrivé à calculer l'erreur standard de la regression, mais on voit clairement que le faible nombre de non fumeur au dela de 65 ans à un impact sur les résultats.\n",
"Pour palier à se problème, j'effectue une regression, non seuleument sur l'âge mais j'introduis également le status de fumeur dans les coordonnées."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x576 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"global_model_logit = LogisticRegression(penalty='l2',solver='newton-cg')\n",
"X = [ features(x) for x in df ]\n",
"Y = [ label(x) for x in df ]\n",
"\n",
"global_model_logit.fit(X, Y)\n",
"\n",
"P_smokers = [ p[1] for p in global_model_logit.predict_proba([ features(x) for x in smokers ]) ]\n",
"P_non_smokers = [ p[1] for p in global_model_logit.predict_proba([ features(x) for x in non_smokers ]) ]\n",
"plt.figure(figsize=(12,8))\n",
"plt.scatter(list(map(age, smokers)), P_smokers, c=\"#0078ba\")\n",
"plt.scatter(list(map(age, non_smokers)), P_non_smokers, c=\"#ff7f0e\")\n",
"plt.legend([\"Fumeur\", \"Non Fumeur\"])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Le graphique ci-dessus représente la regression en tenant compte de l'âge de la personne et de son status de fumeur. On voit ici que le fait de fumer à un impact notable sur la santé, la courbe de survie étant en dessous de celle des non fumeurs."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
......@@ -16,10 +653,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment