{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sujet 6 : Autour du Paradoxe de Simpson"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Contexte de l'étude"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cette étude porte sur le [Paradoxe de Simpson](https://fr.wikipedia.org/wiki/Paradoxe_de_Simpson) (Simpson 1951, Undy 1903). Ce paradoxe est un paradoxe statistique \"dans lequel un phénomène observé de plusieurs groupes semble s'inverser lorsque les groupes sont combinés. Ce résultat qui semble impossible au premier abord est lié à des éléments qui ne sont pas pris en compte (comme la présence de variables non indépendantes ou de différences d'effectifs entre les groupes, etc.) est souvent rencontré dans la réalité, en particulier dans les sciences sociales et les statistiques médicales\" (Wikipédia). \n",
"\n",
"Pour représenter ce paradoxe, on utilisera les données d'un sondage des années 1970 d'une ville du nord-est de l'Angleterre sur un sixième des électeurs, complété par une seconde étude 20 ans plus tard (Vanderpump et al. 1995) sur les mêmes personnes. Le sondage initial avait été réalisé afin d'expliciter les travaux sur les maladies thyroïdiennes et cardiaques (Tunbridge et al. 1977). Le second sondage avait pour objectif de savoir si les individus étaient envore en vie, notamment au vu de leur tabagisme.\n",
"\n",
"Pour ce MOOC : \"Nous nous restreindrons aux femmes et parmi celles-ci aux 1314 qui ont été catégorisées comme \"fumant\n",
"actuellement\" ou \"n'ayant jamais fumé\". Il y avait relativement peu de femmes dans le sondage initial ayant fumé et ayant arrêté depuis (162) et très peu pour lesquelles l'information n'était pas disponible (18). La survie à 20 ans a été déterminée pour l'ensemble des femmes du premier sondage\" (MOOC Recherche Reproductible)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importation des librairies python"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import urllib.request\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import statsmodels.api as sm\n",
"from statsmodels.formula.api import logit\n",
"%matplotlib inline\n",
"\n",
"# Supprime l'affichage des UserWarnings avec toutes les dépréciations de fonctions\n",
"import warnings \n",
"warnings.simplefilter('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Traitement des données\n",
"\n",
"Les donnés sont disponibles sur le GitLab du MOOC Reproductibilité. Par soucis d'accessibilité et pour éviter toute disparition ou de modification de lien vers les données, on enregistrera les données récupérées de manière locale. Elles seront uniquement téléchargées si la copie locale n'existe pas.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data_url = 'https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false'\n",
"data_file = 'simpson_paradox.csv'\n",
"\n",
"if not os.path.exists(data_file):\n",
" urllib.request.urlretrieve(data_url, data_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Chaque ligne des données représente une personne avec comme information:\n",
"- Si la personne fume (Yes/No)\n",
"- Si elle est vivante ou morte au moment de la 2ème étude (Alive/Dead)\n",
"- Son âge au 1er sondage (arrondi à la 1ère décimale)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Smoker
\n",
"
Status
\n",
"
Age
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Yes
\n",
"
Alive
\n",
"
21.0
\n",
"
\n",
"
\n",
"
1
\n",
"
Yes
\n",
"
Alive
\n",
"
19.3
\n",
"
\n",
"
\n",
"
2
\n",
"
No
\n",
"
Dead
\n",
"
57.5
\n",
"
\n",
"
\n",
"
3
\n",
"
No
\n",
"
Alive
\n",
"
47.1
\n",
"
\n",
"
\n",
"
4
\n",
"
Yes
\n",
"
Alive
\n",
"
81.4
\n",
"
\n",
"
\n",
"
5
\n",
"
No
\n",
"
Alive
\n",
"
36.8
\n",
"
\n",
"
\n",
"
6
\n",
"
No
\n",
"
Alive
\n",
"
23.8
\n",
"
\n",
"
\n",
"
7
\n",
"
Yes
\n",
"
Dead
\n",
"
57.5
\n",
"
\n",
"
\n",
"
8
\n",
"
Yes
\n",
"
Alive
\n",
"
24.8
\n",
"
\n",
"
\n",
"
9
\n",
"
Yes
\n",
"
Alive
\n",
"
49.5
\n",
"
\n",
"
\n",
"
10
\n",
"
Yes
\n",
"
Alive
\n",
"
30.0
\n",
"
\n",
"
\n",
"
11
\n",
"
No
\n",
"
Dead
\n",
"
66.0
\n",
"
\n",
"
\n",
"
12
\n",
"
Yes
\n",
"
Alive
\n",
"
49.2
\n",
"
\n",
"
\n",
"
13
\n",
"
No
\n",
"
Alive
\n",
"
58.4
\n",
"
\n",
"
\n",
"
14
\n",
"
No
\n",
"
Dead
\n",
"
60.6
\n",
"
\n",
"
\n",
"
15
\n",
"
No
\n",
"
Alive
\n",
"
25.1
\n",
"
\n",
"
\n",
"
16
\n",
"
No
\n",
"
Alive
\n",
"
43.5
\n",
"
\n",
"
\n",
"
17
\n",
"
No
\n",
"
Alive
\n",
"
27.1
\n",
"
\n",
"
\n",
"
18
\n",
"
No
\n",
"
Alive
\n",
"
58.3
\n",
"
\n",
"
\n",
"
19
\n",
"
Yes
\n",
"
Alive
\n",
"
65.7
\n",
"
\n",
"
\n",
"
20
\n",
"
No
\n",
"
Dead
\n",
"
73.2
\n",
"
\n",
"
\n",
"
21
\n",
"
Yes
\n",
"
Alive
\n",
"
38.3
\n",
"
\n",
"
\n",
"
22
\n",
"
No
\n",
"
Alive
\n",
"
33.4
\n",
"
\n",
"
\n",
"
23
\n",
"
Yes
\n",
"
Dead
\n",
"
62.3
\n",
"
\n",
"
\n",
"
24
\n",
"
No
\n",
"
Alive
\n",
"
18.0
\n",
"
\n",
"
\n",
"
25
\n",
"
No
\n",
"
Alive
\n",
"
56.2
\n",
"
\n",
"
\n",
"
26
\n",
"
Yes
\n",
"
Alive
\n",
"
59.2
\n",
"
\n",
"
\n",
"
27
\n",
"
No
\n",
"
Alive
\n",
"
25.8
\n",
"
\n",
"
\n",
"
28
\n",
"
No
\n",
"
Dead
\n",
"
36.9
\n",
"
\n",
"
\n",
"
29
\n",
"
No
\n",
"
Alive
\n",
"
20.2
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
1284
\n",
"
Yes
\n",
"
Dead
\n",
"
36.0
\n",
"
\n",
"
\n",
"
1285
\n",
"
Yes
\n",
"
Alive
\n",
"
48.3
\n",
"
\n",
"
\n",
"
1286
\n",
"
No
\n",
"
Alive
\n",
"
63.1
\n",
"
\n",
"
\n",
"
1287
\n",
"
No
\n",
"
Alive
\n",
"
60.8
\n",
"
\n",
"
\n",
"
1288
\n",
"
Yes
\n",
"
Dead
\n",
"
39.3
\n",
"
\n",
"
\n",
"
1289
\n",
"
No
\n",
"
Alive
\n",
"
36.7
\n",
"
\n",
"
\n",
"
1290
\n",
"
No
\n",
"
Alive
\n",
"
63.8
\n",
"
\n",
"
\n",
"
1291
\n",
"
No
\n",
"
Dead
\n",
"
71.3
\n",
"
\n",
"
\n",
"
1292
\n",
"
No
\n",
"
Alive
\n",
"
57.7
\n",
"
\n",
"
\n",
"
1293
\n",
"
No
\n",
"
Alive
\n",
"
63.2
\n",
"
\n",
"
\n",
"
1294
\n",
"
No
\n",
"
Alive
\n",
"
46.6
\n",
"
\n",
"
\n",
"
1295
\n",
"
Yes
\n",
"
Dead
\n",
"
82.4
\n",
"
\n",
"
\n",
"
1296
\n",
"
Yes
\n",
"
Alive
\n",
"
38.3
\n",
"
\n",
"
\n",
"
1297
\n",
"
Yes
\n",
"
Alive
\n",
"
32.7
\n",
"
\n",
"
\n",
"
1298
\n",
"
No
\n",
"
Alive
\n",
"
39.7
\n",
"
\n",
"
\n",
"
1299
\n",
"
Yes
\n",
"
Dead
\n",
"
60.0
\n",
"
\n",
"
\n",
"
1300
\n",
"
No
\n",
"
Dead
\n",
"
71.0
\n",
"
\n",
"
\n",
"
1301
\n",
"
No
\n",
"
Alive
\n",
"
20.5
\n",
"
\n",
"
\n",
"
1302
\n",
"
No
\n",
"
Alive
\n",
"
44.4
\n",
"
\n",
"
\n",
"
1303
\n",
"
Yes
\n",
"
Alive
\n",
"
31.2
\n",
"
\n",
"
\n",
"
1304
\n",
"
Yes
\n",
"
Alive
\n",
"
47.8
\n",
"
\n",
"
\n",
"
1305
\n",
"
Yes
\n",
"
Alive
\n",
"
60.9
\n",
"
\n",
"
\n",
"
1306
\n",
"
No
\n",
"
Dead
\n",
"
61.4
\n",
"
\n",
"
\n",
"
1307
\n",
"
Yes
\n",
"
Alive
\n",
"
43.0
\n",
"
\n",
"
\n",
"
1308
\n",
"
No
\n",
"
Alive
\n",
"
42.1
\n",
"
\n",
"
\n",
"
1309
\n",
"
Yes
\n",
"
Alive
\n",
"
35.9
\n",
"
\n",
"
\n",
"
1310
\n",
"
No
\n",
"
Alive
\n",
"
22.3
\n",
"
\n",
"
\n",
"
1311
\n",
"
Yes
\n",
"
Dead
\n",
"
62.1
\n",
"
\n",
"
\n",
"
1312
\n",
"
No
\n",
"
Dead
\n",
"
88.6
\n",
"
\n",
"
\n",
"
1313
\n",
"
No
\n",
"
Alive
\n",
"
39.1
\n",
"
\n",
" \n",
"
\n",
"
1314 rows × 3 columns
\n",
"
"
],
"text/plain": [
" Smoker Status Age\n",
"0 Yes Alive 21.0\n",
"1 Yes Alive 19.3\n",
"2 No Dead 57.5\n",
"3 No Alive 47.1\n",
"4 Yes Alive 81.4\n",
"5 No Alive 36.8\n",
"6 No Alive 23.8\n",
"7 Yes Dead 57.5\n",
"8 Yes Alive 24.8\n",
"9 Yes Alive 49.5\n",
"10 Yes Alive 30.0\n",
"11 No Dead 66.0\n",
"12 Yes Alive 49.2\n",
"13 No Alive 58.4\n",
"14 No Dead 60.6\n",
"15 No Alive 25.1\n",
"16 No Alive 43.5\n",
"17 No Alive 27.1\n",
"18 No Alive 58.3\n",
"19 Yes Alive 65.7\n",
"20 No Dead 73.2\n",
"21 Yes Alive 38.3\n",
"22 No Alive 33.4\n",
"23 Yes Dead 62.3\n",
"24 No Alive 18.0\n",
"25 No Alive 56.2\n",
"26 Yes Alive 59.2\n",
"27 No Alive 25.8\n",
"28 No Dead 36.9\n",
"29 No Alive 20.2\n",
"... ... ... ...\n",
"1284 Yes Dead 36.0\n",
"1285 Yes Alive 48.3\n",
"1286 No Alive 63.1\n",
"1287 No Alive 60.8\n",
"1288 Yes Dead 39.3\n",
"1289 No Alive 36.7\n",
"1290 No Alive 63.8\n",
"1291 No Dead 71.3\n",
"1292 No Alive 57.7\n",
"1293 No Alive 63.2\n",
"1294 No Alive 46.6\n",
"1295 Yes Dead 82.4\n",
"1296 Yes Alive 38.3\n",
"1297 Yes Alive 32.7\n",
"1298 No Alive 39.7\n",
"1299 Yes Dead 60.0\n",
"1300 No Dead 71.0\n",
"1301 No Alive 20.5\n",
"1302 No Alive 44.4\n",
"1303 Yes Alive 31.2\n",
"1304 Yes Alive 47.8\n",
"1305 Yes Alive 60.9\n",
"1306 No Dead 61.4\n",
"1307 Yes Alive 43.0\n",
"1308 No Alive 42.1\n",
"1309 Yes Alive 35.9\n",
"1310 No Alive 22.3\n",
"1311 Yes Dead 62.1\n",
"1312 No Dead 88.6\n",
"1313 No Alive 39.1\n",
"\n",
"[1314 rows x 3 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.read_csv(data_url)\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On vérifir que toutes nos lignes sont bien remplies et que les âges sont cohérents"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Smoker
\n",
"
Status
\n",
"
Age
\n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [Smoker, Status, Age]\n",
"Index: []"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" data[data.isnull().any(axis=1)]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Ages minimaux et maximaux: [18.0, 89.9]\n"
]
}
],
"source": [
"print('Ages minimaux et maximaux: ' + str([data.Age.min(), data.Age.max()]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Etudes\n",
"\n",
"### Décès en fonction des habitudes de tabagisme\n",
"\n",
"Le tableau suivant récapitule le nombre de femmes mortes ou vivantes selon sa relation au tabac."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"x = np.arange(2) # the label locations\n",
"width = 0.35 # the width of the bars\n",
"\n",
"fig, ax = plt.subplots()\n",
"ax.bar(x - width/2, data_death['Alive'], width, label='Alive')\n",
"ax.bar(x + width/2, data_death['Dead'], width, label='Dead')\n",
"ax2 = ax.twinx()\n",
"ax2.plot(x, data_death['Mortality'], color='r', marker='o', label='Mortality')\n",
"\n",
"ax.set_ylabel('Number of women')\n",
"ax2.set_ylabel('Mortality rate')\n",
"ax2.set_ylim(0,1)\n",
"ax.set_xticks(x)\n",
"ax.set_xticklabels(['Non Smoker', 'Smoker'])\n",
"ax.legend()\n",
"ax2.legend(bbox_to_anchor=(0.8, 1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A partir de ces graphiques et résultats il serait logique de conclure que les non fumeuses ont une mortalité plus importante (31%) par rapport aux fumeuses (24%) et que donc fumer aide à vivre longtemps. Même en regardant les intervales de confiance sur la condition (morte **1** ou vivante **0**) de la personne suivant son statut de fumeur nous indique que les fumeurs ont plus de chance de survie."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEKCAYAAAAB0GKPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAE+VJREFUeJzt3X+MXWd95/H3h/G62XWT0uIpYeO4CcVq1moTatxQSJYStURxu7sGsiJhUUMJyGstJmK1qRupWihFZQX9sQs01OulpqRSCEWtJbd140AQZFGI8Lib2nGEWeMGMjFubEJJQrNxTL77xz2j3Ewmnmccn7kT5v2SRvec58ed71jWfOacc89zUlVIkjSbF4y6AEnS84OBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYGhiSpyZJRF3A6LV++vM4777xRlyFJzxt79uw5VlXjLWN/oALjvPPOY2JiYtRlSNLzRpJvtI71lJQkqYmBIUlqYmBIkpr0GhhJrkhyIMnBJDfM0L8+yd4kdyeZSHLpUN99SfZN9fVZpyRpdr1d9E4yBtwIvA6YBHYn2VFV9w4Nux3YUVWV5ELgz4ALhvovq6pjfdUoSWrX5xHGxcDBqjpUVceBW4D1wwOq6tF66glOywCf5iRJC1SfgXEOcP/Q/mTX9jRJ3pDkq8BfA9cOdRVwW5I9STb0WKckqUGfgZEZ2p5xBFFV26vqAuD1wPuHui6pqjXAOuCdSV4z4zdJNnTXPyaOHj16OuqWJM2gz8CYBM4d2l8BHH62wVV1B/CTSZZ3+4e71weB7QxOcc00b2tVra2qtePjTTcrSnoe27x5M9dccw2bN28edSmLTp+BsRtYleT8JEuBq4EdwwOSvCxJuu01wFLg20mWJTmza18GXA7c02Otkp4njhw5wgMPPMCRI0dGXcqi09unpKrqRJJNwC5gDNhWVfuTbOz6twBXAtckeQJ4DLiq+8TUi4HtXZYsAW6uqlv7qlWSNLte15Kqqp3AzmltW4a2Pwh8cIZ5h4CL+qxNkjQ33uktSWpiYEiSmhgYkqQmBoYkqYmBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYGhiSpiYEhSWpiYEiSmhgYkqQmBoYkqUmvD1CSdPp887d/ZtQlLAgnHvoxYAknHvqG/ybAyvfsm7fv5RGGJKmJgSFJamJgSJKaGBiSpCa9BkaSK5IcSHIwyQ0z9K9PsjfJ3UkmklzaOleSNL96C4wkY8CNwDpgNfDmJKunDbsduKiqXg5cC3x8DnMlSfOozyOMi4GDVXWoqo4DtwDrhwdU1aNVVd3uMqBa50qS5lefgXEOcP/Q/mTX9jRJ3pDkq8BfMzjKaJ4rSZo/fQZGZmirZzRUba+qC4DXA++fy1yAJBu66x8TR48ePeViJUkn12dgTALnDu2vAA4/2+CqugP4ySTL5zK3qrZW1dqqWjs+Pv7cq5YkzajPwNgNrEpyfpKlwNXAjuEBSV6WJN32GmAp8O2WuZKk+dXbWlJVdSLJJmAXMAZsq6r9STZ2/VuAK4FrkjwBPAZc1V0En3FuX7VKkmbX6+KDVbUT2DmtbcvQ9geBD7bOlSSNjnd6S5KaGBiSpCYGhiSpiYEhSWpiYEiSmhgYkqQmBoYkqUmv92FI0um2/IwngRPdq+aTgSHpeeX6C/9x1CUsWp6SkiQ1MTAkSU08JaVn2Lx5M0eOHOHss8/mQx/60KjLkbRAGBh6hiNHjvDAAw+MugxJC4ynpCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNek1MJJckeRAkoNJbpih/y1J9nZfdya5aKjvviT7ktydZKLPOiVJs+vtTu8kY8CNwOuASWB3kh1Vde/QsL8HfqGqvpNkHbAVeOVQ/2VVdayvGiVJ7fo8wrgYOFhVh6rqOHALsH54QFXdWVXf6XbvAlb0WI8k6TnoMzDOAe4f2p/s2p7N24G/Gdov4LYke5Js6KE+SdIc9Ln4YGZoqxkHJpcxCIxLh5ovqarDSX4c+GySr1bVHTPM3QBsAFi5cuVzr1qSNKM+jzAmgXOH9lcAh6cPSnIh8HFgfVV9e6q9qg53rw8C2xmc4nqGqtpaVWurau34+PhpLF+SNKzPwNgNrEpyfpKlwNXAjuEBSVYCfwH8alV9bah9WZIzp7aBy4F7eqxVkjSL3k5JVdWJJJuAXcAYsK2q9ifZ2PVvAd4DvAj4WBKAE1W1FngxsL1rWwLcXFW39lWrJGl2vT5Aqap2AjuntW0Z2n4H8I4Z5h0CLpreLkkaHe/0liQ18RGtQ17x6zeNuoQF4cxjjzAGfPPYI/6bAHt+95pRlyAtCB5hSJKaGBiSpCYGhiSpiYEhSWpiYEiSmhgYkqQmBoYkqYmBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYGhiSpiYEhSWpiYEiSmhgYkqQmBoYkqcmcAyPJjya5sHHsFUkOJDmY5IYZ+t+SZG/3dWeSi1rnSpLmV1NgJPlCkrOS/Bjwd8AnkvzBLHPGgBuBdcBq4M1JVk8b9vfAL1TVhcD7ga1zmCtJmketRxg/UlUPA28EPlFVrwB+aZY5FwMHq+pQVR0HbgHWDw+oqjur6jvd7l3Aita56s+TS5fx/R86iyeXLht1KZIWkCWt45K8BHgT8JuNc84B7h/anwReeZLxbwf+Zq5zk2wANgCsXLmysTSdzPdWXT7qEiQtQK1HGL8N7GLwV//uJC8F/u8sczJDW804MLmMQWD8xlznVtXWqlpbVWvHx8dnKUmSdKqajjCq6jPAZ4b2DwFXzjJtEjh3aH8FcHj6oO4C+seBdVX17bnMlSTNn6bASPIJZvgLv6quPcm03cCqJOcDDwBXA/9h2vuuBP4C+NWq+tpc5kqS5lfrNYy/Gto+A3gDs/zFX1UnkmxicCprDNhWVfuTbOz6twDvAV4EfCwJwInu9NKMc+fwc0mSTrPWU1J/Pryf5FPA5xrm7QR2TmvbMrT9DuAdrXMlSaNzqnd6rwL8SJIkLSKt1zAe4enXMI7w1CeaJEmLQOspqTP7LkSStLC1Lg1ye0ubJOkH10mPMJKcAfwLYHmSH+WpG+rOAv5lz7VJkhaQ2U5J/Ufg3QzCYQ9PBcbDDBYHlCQtEicNjKr6MPDhJO+qqo/OU02SpAWo9aL3R5P8NIOlxs8Yar+pr8IkSQtL68dq3wu8lkFg7GTwnIovAQaGJC0SrTfu/XvgF4EjVfU24CLgh3qrSpK04LQGxmNV9SRwIslZwIPAS/srS5K00LQuPjiR5IXA/2LwaalHga/0VpUkacFpvej9n7rNLUluBc6qqr39lSVJWmjmfKd3Vd1XVXu901uSFhfv9JYkNZnrnd5THsE7vSVpUZntlNSdwKuB66vqpcD7gHuALwI391ybJGkBmS0w/ifweHen92uA/wZ8EvgusLXv4iRJC8dsp6TGquqhbvsqYGv3uNY/T3J3v6VJkhaS2Y4wxpJMhcovAp8f6mu9h0OS9ANgtl/6nwK+mOQY8BjwvwGSvIzBaSlJ0iJx0iOMqvod4L8AfwJcWlVTz/V+AfCu2d48yRVJDiQ5mOSGGfovSPLlJI8nuX5a331J9iW5O8lE6w8kSerHrKeVququGdq+Ntu8JGMMPnr7OmAS2J1kR1XdOzTsIeA64PXP8jaXVdWx2b6XJKl/rYsPnoqLgYNVdaiqjgO3AOuHB1TVg1W1G3iixzokSadBn4FxDnD/0P5k19aqgNuS7Emy4dkGJdmQZCLJxNGjR0+xVEnSbPoMjMzQVjO0PZtLqmoNg4c1vbO7D+SZb1i1tarWVtXa8fHxU6lTktSgz8CYBM4d2l8BHG6dXFWHu9cHge0MTnFJkkakz8DYDaxKcn6SpcDVwI6WiUmWJTlzahu4nMGSJJKkEent5ruqOpFkE7ALGAO2VdX+JBu7/i1JzgYmGKx++2SSdzN4bvhyYHuSqRpvrqpb+6pVkjS7Xu/WrqqdwM5pbVuGto8wOFU13cMMnhsuSVog+jwlJUn6AWJgSJKaGBiSpCYGhiSpiYEhSWpiYEiSmhgYkqQmBoYkqYmBIUlqYmBIkpoYGJKkJgaGJKmJgSFJamJgSJKaGBiSpCYGhiSpiYEhSWpiYEiSmhgYkqQmBoYkqUmvgZHkiiQHkhxMcsMM/Rck+XKSx5NcP5e5kqT51VtgJBkDbgTWAauBNydZPW3YQ8B1wO+dwlxJ0jzq8wjjYuBgVR2qquPALcD64QFV9WBV7QaemOtcSdL86jMwzgHuH9qf7Nr6nitJ6kGfgZEZ2up0z02yIclEkomjR482FydJmps+A2MSOHdofwVw+HTPraqtVbW2qtaOj4+fUqGSpNn1GRi7gVVJzk+yFLga2DEPcyVJPVjS1xtX1Ykkm4BdwBiwrar2J9nY9W9JcjYwAZwFPJnk3cDqqnp4prl91SpJml1vgQFQVTuBndPatgxtH2FwuqlpriRpdLzTW5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktSk18BIckWSA0kOJrlhhv4k+UjXvzfJmqG++5LsS3J3kok+65QkzW5JX2+cZAy4EXgdMAnsTrKjqu4dGrYOWNV9vRL4o+51ymVVdayvGiVJ7fo8wrgYOFhVh6rqOHALsH7amPXATTVwF/DCJC/psSZJ0inqMzDOAe4f2p/s2lrHFHBbkj1JNvRWpSSpSW+npIDM0FZzGHNJVR1O8uPAZ5N8tarueMY3GYTJBoCVK1c+l3olSSfR5xHGJHDu0P4K4HDrmKqaen0Q2M7gFNczVNXWqlpbVWvHx8dPU+mSpOn6DIzdwKok5ydZClwN7Jg2ZgdwTfdpqZ8HvltV30qyLMmZAEmWAZcD9/RYqyRpFr2dkqqqE0k2AbuAMWBbVe1PsrHr3wLsBH4ZOAj8E/C2bvqLge1Jpmq8uapu7atWSdLs+ryGQVXtZBAKw21bhrYLeOcM8w4BF/VZmyRpbrzTW5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktTEwJAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1MTAkCQ1MTAkSU0MDElSEwNDktSk18BIckWSA0kOJrlhhv4k+UjXvzfJmta5kqT51VtgJBkDbgTWAauBNydZPW3YOmBV97UB+KM5zJUkzaM+jzAuBg5W1aGqOg7cAqyfNmY9cFMN3AW8MMlLGudKkuZRn4FxDnD/0P5k19YypmWuJGkeLenxvTNDWzWOaZk7eINkA4PTWQCPJjnQXKFOZjlwbNRFLAT5vbeOugQ9k/8/p7x3pl+Xc/ITrQP7DIxJ4Nyh/RXA4cYxSxvmAlBVW4Gtz7VYPV2SiapaO+o6pJn4/3M0+jwltRtYleT8JEuBq4Ed08bsAK7pPi3188B3q+pbjXMlSfOotyOMqjqRZBOwCxgDtlXV/iQbu/4twE7gl4GDwD8BbzvZ3L5qlSTNLlUzXhrQIpdkQ3e6T1pw/P85GgaGJKmJS4NIkpoYGItU90GDLyVZN9T2piS3jrIuaViSSvL7Q/vXJ/mtEZa0qBkYi1QNzkVuBP4gyRlJlgG/A7xztJVJT/M48MYky0ddiAyMRa2q7gH+EvgN4L0Mlmn5epK3JvlKkruTfCzJC5IsSfKnSfYluSfJdaOtXovECQb3Wf3n6R1JfiLJ7d3CpbcnWTn/5S0ufd64p+eH9wF/CxwH1ib5aeANwKu7jzdvZXAfzNeB5VX1MwBJXjiqgrXo3AjsTfKhae1/yOCPnE8muRb4CPD6ea9uETEwFrmq+l6STwOPVtXjSX4J+DlgIgnAP2ewrtcu4KeSfJjB/TO3japmLS5V9XCSm4DrgMeGul4FvLHb/lNgeqDoNDMwBPBk9wWDdby2VdV/nT4oyYUMlpy/DriSp9bwkvr2PxgcCX/iJGO8R6BnXsPQdJ8D3jR1kTHJi5KsTDLO4L6dzzC43rHmZG8inU5V9RDwZ8Dbh5rvZHC6FOAtwJfmu67FxiMMPU1V7UvyPuBzSV4APMHg01TfB/44g/NUxeBCuTSffh/YNLR/HbAtya8DR+mWFlJ/vNNbktTEU1KSpCYGhiSpiYEhSWpiYEiSmhgYkqQmBoY0iyS/mWR/t2bR3Ule+Rzf77VJ/up01SfNF+/DkE4iyauAfwOs6ZZOWQ4sHWE9S6rqxKi+vxY3jzCkk3sJcKyqHgeoqmNVdTjJfUk+kOTLSSaSrEmyK8nXp55b3z1z5He71X33Jblq+psn+bkk/yfJS5MsS7Itye6ubX035teSfCbJX+IaXhohjzCkk7sNeE+SrzFYNuXTVfXFru/+qnpVkv8O/AlwCXAGsB/YwmBhvJcDFwHLgd1J7ph64ySvBj4KrK+qbyb5APD5qrq2Ww34K0k+1w1/FXBht0SGNBIGhnQSVfVoklcA/xq4DPh0khu67h3d6z7gh6vqEeCRJP+v+4V/KfCpqvo+8A9JvshgJeCHgX/F4DkPl1fV4e59Lgf+XZLru/0zgKlnPHzWsNCoGRjSLLpf+F8AvpBkH/DWruvx7vXJoe2p/SUMVv59Nt9iEAg/C0wFRoArq+rA8MDuIvv3nsOPIJ0WXsOQTiLJTyVZNdT0cuAbjdPvAK5KMtat9vsa4Ctd3z8CvwJ8IMlru7ZdwLu6BR5J8rPPtX7pdDIwpJP7YeCTSe5NshdYDfxW49ztwF7g74DPA5ur6shUZ1X9A/BvgRu7o4j3A/+MwdPl7un2pQXD1WolSU08wpAkNTEwJElNDAxJUhMDQ5LUxMCQJDUxMCRJTQwMSVITA0OS1OT/A93m54EwE3MoAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x='Smoker', y='Status', ci=95, data=data.replace('Alive', 0).replace('Dead', 1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mais il est de connaissance publique que \"fumer tue\". **Alors comment les données nous trompent-elles ?** Nous avons regardé les données de manière globale sans rentrer dans les détails. Si l'on regarde l'âge des femmes suivant leur statut de fumeur un paradoxe commence à apparaître:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEKCAYAAAAfGVI8AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAADzVJREFUeJzt3XuQXnV9x/H3x0QHB+0IZgmpiGk7kcp4AV0viG1VhMF6CUK9TS/bykzGGS1qq2naTr116jixWnuhtmmlLtRacFqGyFgxRhFtrbAoEigi4igK2WSBosC0aMi3fzwn7RoTdlM5z9nk937NZM5zznP7wmT2nXPO85xNVSFJatdDhh5AkjQsQyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktS45UMPsBgrVqyo1atXDz2GJB1Urr766turamKhxx0UIVi9ejUzMzNDjyFJB5Uk31rM4zw0JEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1LiD4gtlkg5969evZ3Z2lqOPPpqNGzcOPU5TDIGkJWF2dpZbb7116DGa1GsIknwTuBu4H9hVVZNJjgQuBFYD3wReUVX/2ecckqT9G8c5gudV1QlVNdmtbwC2VtUaYGu3LkkayBAni9cC093taeCMAWaQJHX6DkEBn0xydZJ13baVVbUdoFse1fMMkqQH0PfJ4pOr6rYkRwFbknx1sU/swrEO4Nhjj+1rPklqXq8hqKrbuuXOJBcDzwB2JFlVVduTrAJ27ue5m4BNAJOTk9XnnNKQbnnnk4YeYUnYdeeRwHJ23fkt/58Ax75129jeq7dDQ0kOT/LIPbeB04DrgM3AVPewKeCSvmaQJC2szz2ClcDFSfa8zz9U1SeSXAVclORs4Bbg5T3OIElaQG8hqKpvAE/Zx/Y7gFP6el9J0oHxWkOS1DhDIEmNMwSS1DgvOidpSVhx2G5gV7fUOBkCSUvCm59819AjNMtDQ5LUOEMgSY0zBJLUOEMgSY0zBJLUOEMgSY0zBJLUOL9H0Jj169czOzvL0UcfzcaNG4ceR9ISYAgaMzs7y6233jr0GJKWEA8NSVLjDIEkNc4QSFLjDIEkNa6Zk8VPe8v5Q4+wJDzy9rtZBtxy+93+PwGufs+vDT2CNDj3CCSpcYZAkhpnCCSpcYZAkhpnCCSpcc18akgjux92+A8tJckQNObeNacNPYKkJcZDQ5LUOEMgSY0zBJLUOEMgSY0zBJLUOEMgSY3rPQRJliX5cpJLu/Ujk2xJclO3PKLvGSRJ+zeOPYI3ADfMW98AbK2qNcDWbl2SNJBeQ5DkGOBFwN/O27wWmO5uTwNn9DmDJOmB9b1H8H5gPbB73raVVbUdoFseta8nJlmXZCbJzNzcXM9jSlK7egtBkhcDO6vq6v/P86tqU1VNVtXkxMTEgzydJGmPPq81dDLw0iS/CBwG/ESSvwd2JFlVVduTrAJ29jiDJGkBve0RVNXvVtUxVbUaeBXw6ar6FWAzMNU9bAq4pK8ZJEkLG+J7BO8GTk1yE3Bqty5JGshYLkNdVZcDl3e37wBOGcf7SpIW5jeLJalxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxhkCSGmcIJKlxvYUgyWFJrkzylSTXJ3lHt/3IJFuS3NQtj+hrBknSwvrcI7gPeH5VPQU4ATg9ybOADcDWqloDbO3WJUkD6S0ENXJPt/rQ7k8Ba4Hpbvs0cEZfM0iSFtbrOYIky5JcA+wEtlTVF4GVVbUdoFse1ecMkqQH1msIqur+qjoBOAZ4RpInLva5SdYlmUkyMzc319+QktS4sXxqqKruAi4HTgd2JFkF0C137uc5m6pqsqomJyYmxjGmJDWpz08NTSR5VHf74cALgK8Cm4Gp7mFTwCV9zSBJWtjyHl97FTCdZBmj4FxUVZcm+QJwUZKzgVuAl/c4gyRpAb2FoKquBU7cx/Y7gFP6el9J0oFZ8NBQkpVJPpjkX7r147t/zUuSDgGLOUfwIeAy4Ce79a8Bb+xrIEnSeC0mBCuq6iJgN0BV7QLu73UqSdLYLCYE9yZ5NKNvBdNdJuK7vU4lSRqbxZws/i1GH/n8mST/CkwAv9TrVJKksVkwBFX1pSS/ABwHBLixqn7Q+2SSpLFYMARJztxr0+OTfBfYVlX7/FawJOngsZhDQ2cDJwGf6dafC/w7oyC8s6ou6Gk2SdIYLCYEu4EnVNUOGH2vAPgA8EzgCsAQSNJBbDGfGlq9JwKdncDjq+pOwHMFknSQW8weweeSXAp8tFs/C7giyeHAXb1NJkkai8WE4HXAmcBzuvUrgVVVdS/wvL4GkySNx4KHhqqqgJsZHQZ6GaMLxt3Q81ySpDHZ7x5BkscDrwJeDdwBXAikqtwLkKRDyAMdGvoq8DngJVX1dYAkbxrLVJKksXmgQ0NnAbPAZ5L8TZJTGH2zWJJ0CNlvCKrq4qp6JfCzjH7f8JuAlUk+kOS0Mc0nSerZYk4W31tVH66qFwPHANcAG3qfTJI0Fgf0y+ur6s6q+uuqen5fA0mSxuuAQiBJOvQYAklqnCGQpMYZAklqnCGQpMYZAklqnCGQpMYZAklqnCGQpMYZAklqnCGQpMYZAklqXG8hSPLYJJ9JckOS65O8odt+ZJItSW7qlkf0NYMkaWF97hHsAn67qp4APAt4XZLjGV3CemtVrQG24iWtJWlQvYWgqrZX1Ze623cz+oX3jwHWAtPdw6aBM/qaQZK0sLGcI0iyGjgR+CKwsqq2wygWwFHjmEGStG+9hyDJI4B/At5YVd87gOetSzKTZGZubq6/ASWpcb2GIMlDGUXgw1X1z93mHUlWdfevAnbu67lVtamqJqtqcmJios8xJalpfX5qKMAHgRuq6n3z7toMTHW3p4BL+ppBkrSw5T2+9snArwLbklzTbfs94N3ARUnOBm4BXt7jDJKkBfQWgqr6PJD93H1KX+8rSTowfrNYkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcb2FIMl5SXYmuW7etiOTbElyU7c8oq/3lyQtTp97BB8CTt9r2wZga1WtAbZ265KkAfUWgqq6Arhzr81rgenu9jRwRl/vL0lanHGfI1hZVdsBuuVR+3tgknVJZpLMzM3NjW1ASWrNkj1ZXFWbqmqyqiYnJiaGHkeSDlnjDsGOJKsAuuXOMb+/JGkv4w7BZmCquz0FXDLm95ck7aXPj49+BPgCcFyS7yQ5G3g3cGqSm4BTu3VJ0oCW9/XCVfXq/dx1Sl/vKUk6cEv2ZLEkaTwMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuMGCUGS05PcmOTrSTYMMYMkaWTsIUiyDDgXeCFwPPDqJMePew5J0sgQewTPAL5eVd+oqu8D/wisHWAOSRLDhOAxwLfnrX+n2yZJGsDyAd4z+9hWP/KgZB2wrlu9J8mNvU7VlhXA7UMPsRTkj6eGHkE/zL+be7xtXz8qD9jjFvOgIULwHeCx89aPAW7b+0FVtQnYNK6hWpJkpqomh55D2pt/N4cxxKGhq4A1SX4qycOAVwGbB5hDksQAewRVtSvJ64HLgGXAeVV1/bjnkCSNDHFoiKr6OPDxId5bgIfctHT5d3MAqfqR87SSpIZ4iQlJapwhOMRk5PNJXjhv2yuSfGLIuaT5klSS985bf3OStw84UtMMwSGmRsf6Xgu8L8lhSQ4H/gh43bCTST/kPuDMJCuGHkSG4JBUVdcBHwN+B3gbcH5V3ZxkKsmVSa5J8pdJHpJkeZILkmxLcl2Sc4adXo3YxejE8Jv2viPJ45JsTXJttzx2/OO1ZZBPDWks3gF8Cfg+MJnkicDLgGd3H+HdxOg7HDcDK6rqSQBJHjXUwGrOucC1STbutf0vGP3jZTrJa4A/A84Y+3QNMQSHqKq6N8mFwD1VdV+SFwBPB2aSADyc0TWfLgOOS/KnjD7S+8mhZlZbqup7Sc4HzgH+a95dJwFndrcvAPYOhR5khuDQtrv7A6NrPJ1XVX+w94OSPJnRZcHPAc7i/67xJPXt/Yz2XP/uAR7jZ9x75jmCdnwKeMWek3NJHp3k2CQTjL5P8lFG5xOeOuSQaktV3QlcBJw9b/O/MTpsCfDLwOfHPVdr3CNoRFVtS/IO4FNJHgL8gNGni+4HPpjR8aJidIJZGqf3Aq+ft34OcF6StwBzwG8MMlVD/GaxJDXOQ0OS1DhDIEmNMwSS1DhDIEmNMwSS1DhDoGYl+f0k13fXtLkmyTN/zNd7bpJLH6z5pHHxewRqUpKTgBcDT+0uwbECeNiA8yyvql1Dvb/a5h6BWrUKuL2q7gOoqtur6rYk30zyriRfSDKT5KlJLktyc5LXwv/+zof3dFdr3ZbklXu/eJKnJ/lykp9OcniS85Jc1W1b2z3m15N8NMnH8BpPGpB7BGrVJ4G3Jvkao8tvXFhVn+3u+3ZVnZTkT4APAScDhwHXA3/F6IJoJwBPAVYAVyW5Ys8LJ3k28OfA2qq6Jcm7gE9X1Wu6q7temeRT3cNPAp7cXWpBGoQhUJOq6p4kTwN+DngecGGSDd3dm7vlNuARVXU3cHeS/+5+kD8H+EhV3Q/sSPJZRld2/R7wBEbX2T+tqm7rXuc04KVJ3tytHwbsucb+FiOgoRkCNav7QX45cHmSbcBUd9d93XL3vNt71pczupLr/mxn9IP+RGBPCAKcVVU3zn9gd3L63h/jP0F6UHiOQE1KclySNfM2nQB8a5FPvwJ4ZZJl3dVbfx64srvvLuBFwLuSPLfbdhnwm92F/Uhy4o87v/RgMgRq1SOA6ST/keRa4Hjg7Yt87sXAtcBXgE8D66tqds+dVbUDeAlwbvev/j8EHsrot3Fd161LS4ZXH5WkxrlHIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1Lj/AZp5iWFf4mMkAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(x='Smoker', y='Age', ci=95, data=data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On voit bien que l'âge des non fumeuses est en moyenne plus élevé, et donc que les observations ne sont pas bien réparties. Mais alors, comment l'âge rentre-t-il en jeu ?\n",
"\n",
"La prochaine étape est donc d'étudier les données plus précisément, notamment suivant les tranches d'âges.\n",
"\n",
"## Décès liés au tabagisme suivant l'âge\n",
"\n",
"En reprenant les données précédentes et en rajoutant une catégorie d'âge (18-34 ans, 34-54 ans, 55-64 ans, plus de 65 ans), on réalise les mêmes analyses."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Visualisation du taux de mortalité suivant les âges et le statut de fumeur\n",
"\n",
"tranche_age_label = ['[18-34]', '[35-54]', '[55-64]', '[65-100]'] # the label text\n",
"x = np.arange(len(tranche_age_label)) # the label locations\n",
"width = 0.35 # the width of the bars\n",
"\n",
"fig, ax = plt.subplots()\n",
"ax.bar(x - width/2, data_age.reset_index()[data_age.reset_index().Smoker == 'Yes']['Mortality'], width, label='Smoker')\n",
"ax.bar(x + width/2, data_age.reset_index()[data_age.reset_index().Smoker == 'No']['Mortality'], width, label='Non smoker')\n",
"\n",
"ax.set_ylabel('Mortality rate')\n",
"ax.set_xlabel(\"Age group\")\n",
"ax.set_xticks(x)\n",
"ax.set_xticklabels(tranche_age_label)\n",
"ax.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On remarque sur le graphique ci-dessus que finalement pour chaque classe d'âge le taux de mortalité chez les fumeuses est supérieur ou égal à celui des non fumeuses !\n",
"\n",
"En s'intéressant à l'histogramme des âges chez ces deux populations ci-dessous, on s'aperçoit qu'il y a plus de non fumeuses d'âge supérieur à 65ans, qui ont donc plus de chance de décéder naturellement. Cette tranche est donc sur-représentée chez les non-fumeuses, amenant en moyenne à un taux de mortalité plus élevé.\n",
"\n",
"**Etudier des données dans leur ensemble peut donner des résultats très différents par rapport à des études sur des sous-groupes. Cela peut amener à des erreurs d'interprétation importantes.**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Visualisation du nombre de femmes vivantes et décédées par tranche d'âge\n",
"\n",
"sns.distplot(data[data.Smoker == 'Yes']['Age'], label='Smoker', kde=True)\n",
"sns.distplot(data[data.Smoker == 'No']['Age'], label='Non smoker')\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ainsi 2 conclusions peuvent être tirées sur ce biais d'étude:\n",
"- Ce biais arrive notamment à cause de la **non homogénéité de l'échantillon**. On voit bien ci-dessus que toutes les tranches d'âge ne sont pas représentées de la même manière si les femmes sont fumeuses ou non fumeuses. Il faut cependant faire attention à étudier des *tranches d'âge régulières et adaptés à l'étude*.\n",
"- De plus, dans la 1ère partie l'âge des participantes avait été mis de côté au profit d'une moyenne sur l'ensemble. Cette **mise à l'écart de ce paramètre** a induit une mauvaise interprétation.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Décès et régression logistique\n",
"\n",
"En dernière partie une régression logistique est réalisée afin de supprimer le biais induit par des tranches d'âges arbitraires et non régulières.\n",
"\n",
"Tout d'abord une nouvelle colonne est créée avec :\n",
"- Si la femme est décédée: 1\n",
"- Si la femme est vivante: 0\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Exemple :\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Smoker
\n",
"
Status
\n",
"
Age
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Yes
\n",
"
0
\n",
"
21.0
\n",
"
\n",
"
\n",
"
1
\n",
"
Yes
\n",
"
0
\n",
"
19.3
\n",
"
\n",
"
\n",
"
2
\n",
"
No
\n",
"
1
\n",
"
57.5
\n",
"
\n",
"
\n",
"
3
\n",
"
No
\n",
"
0
\n",
"
47.1
\n",
"
\n",
"
\n",
"
4
\n",
"
Yes
\n",
"
0
\n",
"
81.4
\n",
"
\n",
"
\n",
"
5
\n",
"
No
\n",
"
0
\n",
"
36.8
\n",
"
\n",
"
\n",
"
6
\n",
"
No
\n",
"
0
\n",
"
23.8
\n",
"
\n",
"
\n",
"
7
\n",
"
Yes
\n",
"
1
\n",
"
57.5
\n",
"
\n",
"
\n",
"
8
\n",
"
Yes
\n",
"
0
\n",
"
24.8
\n",
"
\n",
"
\n",
"
9
\n",
"
Yes
\n",
"
0
\n",
"
49.5
\n",
"
\n",
"
\n",
"
10
\n",
"
Yes
\n",
"
0
\n",
"
30.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Smoker Status Age\n",
"0 Yes 0 21.0\n",
"1 Yes 0 19.3\n",
"2 No 1 57.5\n",
"3 No 0 47.1\n",
"4 Yes 0 81.4\n",
"5 No 0 36.8\n",
"6 No 0 23.8\n",
"7 Yes 1 57.5\n",
"8 Yes 0 24.8\n",
"9 Yes 0 49.5\n",
"10 Yes 0 30.0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_reg = data.replace('Alive', 0).replace('Dead', 1)\n",
"\n",
"print ('Exemple :')\n",
"data_reg.loc[0:10, ]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On réalise pour chacun des groupes *'Smoker'* et *'Non smoker'* une régresion logistique pour visualiser la corrélation entre l'âge et le décès (et donc la probabilité de décès en fonction de l'âge)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.412727\n",
" Iterations 7\n"
]
},
{
"data": {
"text/html": [
"
\n",
"
Logit Regression Results
\n",
"
\n",
"
Dep. Variable:
Status
No. Observations:
582
\n",
"
\n",
"
\n",
"
Model:
Logit
Df Residuals:
580
\n",
"
\n",
"
\n",
"
Method:
MLE
Df Model:
1
\n",
"
\n",
"
\n",
"
Date:
Fri, 31 Jul 2020
Pseudo R-squ.:
0.2492
\n",
"
\n",
"
\n",
"
Time:
15:58:40
Log-Likelihood:
-240.21
\n",
"
\n",
"
\n",
"
converged:
True
LL-Null:
-319.94
\n",
"
\n",
"
\n",
"
LLR p-value:
1.477e-36
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
z
P>|z|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
-5.5081
0.466
-11.814
0.000
-6.422
-4.594
\n",
"
\n",
"
\n",
"
Age
0.0890
0.009
10.203
0.000
0.072
0.106
\n",
"
\n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Logit Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Status No. Observations: 582\n",
"Model: Logit Df Residuals: 580\n",
"Method: MLE Df Model: 1\n",
"Date: Fri, 31 Jul 2020 Pseudo R-squ.: 0.2492\n",
"Time: 15:58:40 Log-Likelihood: -240.21\n",
"converged: True LL-Null: -319.94\n",
" LLR p-value: 1.477e-36\n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept -5.5081 0.466 -11.814 0.000 -6.422 -4.594\n",
"Age 0.0890 0.009 10.203 0.000 0.072 0.106\n",
"==============================================================================\n",
"\"\"\""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pour les Fumeuses\n",
"data_reg_smoker = data_reg[data_reg.Smoker == 'Yes']\n",
"model = logit('Status ~ Age', data=data_reg_smoker)\n",
"result_smoker = model.fit() #algorithme de Newton-Raphson par défaut\n",
"logit_smoker = result_smoker.predict(data_reg_smoker) # predictions\n",
"result_smoker.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pour les fumeuses on voit que l'âge est un paramètre statistiquement important (P < 0.05), avec un coefficient de pente de 0.089 (avec une erreur de 10%), compris pour un CI de 2.5% entre 0.106 et 0.072."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.354560\n",
" Iterations 7\n"
]
},
{
"data": {
"text/html": [
"
\n",
"
Logit Regression Results
\n",
"
\n",
"
Dep. Variable:
Status
No. Observations:
732
\n",
"
\n",
"
\n",
"
Model:
Logit
Df Residuals:
730
\n",
"
\n",
"
\n",
"
Method:
MLE
Df Model:
1
\n",
"
\n",
"
\n",
"
Date:
Fri, 31 Jul 2020
Pseudo R-squ.:
0.4304
\n",
"
\n",
"
\n",
"
Time:
15:58:40
Log-Likelihood:
-259.54
\n",
"
\n",
"
\n",
"
converged:
True
LL-Null:
-455.62
\n",
"
\n",
"
\n",
"
LLR p-value:
2.808e-87
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
coef
std err
z
P>|z|
[0.025
0.975]
\n",
"
\n",
"
\n",
"
Intercept
-6.7955
0.479
-14.174
0.000
-7.735
-5.856
\n",
"
\n",
"
\n",
"
Age
0.1073
0.008
13.742
0.000
0.092
0.123
\n",
"
\n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Logit Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Status No. Observations: 732\n",
"Model: Logit Df Residuals: 730\n",
"Method: MLE Df Model: 1\n",
"Date: Fri, 31 Jul 2020 Pseudo R-squ.: 0.4304\n",
"Time: 15:58:40 Log-Likelihood: -259.54\n",
"converged: True LL-Null: -455.62\n",
" LLR p-value: 2.808e-87\n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept -6.7955 0.479 -14.174 0.000 -7.735 -5.856\n",
"Age 0.1073 0.008 13.742 0.000 0.092 0.123\n",
"==============================================================================\n",
"\"\"\""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pour les non Fumeuses\n",
"\n",
"data_reg_nosmoker = data_reg[data_reg.Smoker == 'No']\n",
"model = logit('Status ~ Age', data=data_reg_nosmoker)\n",
"result_nosmoker = model.fit()\n",
"logit_nosmoker = result_nosmoker.predict(data_reg_nosmoker) \n",
"result_nosmoker.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pour les non-fumeuses on voit que l'âge est un paramètre statistiquement important (P < 0.05), avec un coefficient de pente de 0.1073 (avec une erreur de moins de 10%, suffisamment faible pour comparer avec les résultats des fumeuses), compris pour un CI de 2.5% entre 0.123 et 0.092. Ce coefficient est plus élevé que pour les femmes fumeuses, avec cependant un coefficient d'interception plus important.\n",
"\n",
"Afin de mieux visualiser cette variation en fonction de l'âge, les fonctions logistiques sont tracées. Seaborn utilisant le package statsmodel pour la fonction lmplot, il est possible de l'utiliser pour visualiser de manière simple les deux courbes sur un même graphe avec les intervales de confiance pour chacune des courbes."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.lmplot('Age', 'Status', data=data_reg, logistic=True, ci=97.5, hue='Smoker')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A partir des données précédentes il est possible de voir : \n",
"- Pour des âges entre 35 et 60 ans, il y a plus de probabilité de décès pour les fumeuses que les non-fumeuses\n",
"- Pour des âges plus élevés les courbes se rejoignent et les intervalles de confiance se recoupent, ne permettant pas de conclure sur des probabilités plus fortes de décès dans l'un ou l'autre des cas.\n",
"- Le coefficient de régression des non-fumeuses est plus élevé avec une interception négative plus grande notamment parce que la probabilité de décès augmente fortement au-delà de 60 ans, comparativement à celle des non-fumeuses qui augmente de manière plus constante.\n",
"\n",
"**Ainsi ces régressions nous montre que l'effet du tabagisme est important pour une certaine tranche d'âge mais qu'au delà d'autres causes de décès entrent en jeu alignant le nombre de mort de manière identique entre les deux status.**\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}