{ "cells": [ { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "# Sujet 6 : Autour du Paradoxe de Simpson" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contexte :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En 1972-1974, à Whickham, une ville du nord-est de l'Angleterre, située à environ 6,5 kilomètres au sud-ouest de Newcastle upon Tyne, un sondage d'un sixième des électeurs a été effectué afin d'éclairer des travaux sur les maladies thyroïdiennes et cardiaques (Tunbridge et al. 1977). Une suite de cette étude a été menée vingt ans plus tard (Vanderpump et al. 1995). Certains des résultats avaient trait au tabagisme et cherchaient à savoir si les individus étaient toujours en vie lors de la seconde étude. Par simplicité, nous nous restreindrons aux femmes et parmi celles-ci aux 1314 qui ont été catégorisées comme \"fumant actuellement\" ou \"n'ayant jamais fumé\". Il y avait relativement peu de femmes dans le sondage initial ayant fumé et ayant arrêté depuis (162) et très peu pour lesquelles l'information n'était pas disponible (18). La survie à 20 ans a été déterminée pour l'ensemble des femmes du premier sondage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## L'étude de ce sujet se fera en 3 étapes :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Représenter dans un tableau le nombre total de femmes vivantes et décédées sur la période en fonction de leur habitude de tabagisme. Calculer dans chaque groupe (fumeuses / non fumeuses) le taux de mortalité (le rapport entre le nombre de femmes décédées dans un groupe et le nombre total de femmes dans ce groupe). Analyser ce résultat.\n", "\n", "2. Reprendre la question 1 (effectifs et taux de mortalité) en rajoutant une nouvelle catégorie liée à la classe d'âge. On considérera les classes suivantes : 18-34 ans, 35-54 ans, 55-64 ans, plus de 65 ans. Analyser le résultat.\n", "\n", "3. Etablir une régression logistique en introduisant un variable Death valant 1 ou 0 si la personne est morte ou pas au cours des 20 années entre le premier sondage et la suite de l'étude. Conclure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Etape 1 : Calcul du taux de mortalité pour les fumeuses et les non fumeuses" ] }, { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "Tout d'abord, il faut commencer par inclure les bibliothèques dont nous aurons besoin." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import statsmodels.api as sm\n", "import numpy as np\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Il faut ensuite charger et lire le fichier" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "data_file = \"Subject6_smoking.csv\"\n", "#data_file = \"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/blob/master/module3/Practical_session/Subject6_smoking.csv\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
0YesAlive21.0
1YesAlive19.3
2NoDead57.5
3NoAlive47.1
4YesAlive81.4
5NoAlive36.8
6NoAlive23.8
7YesDead57.5
8YesAlive24.8
9YesAlive49.5
10YesAlive30.0
11NoDead66.0
12YesAlive49.2
13NoAlive58.4
14NoDead60.6
15NoAlive25.1
16NoAlive43.5
17NoAlive27.1
18NoAlive58.3
19YesAlive65.7
20NoDead73.2
21YesAlive38.3
22NoAlive33.4
23YesDead62.3
24NoAlive18.0
25NoAlive56.2
26YesAlive59.2
27NoAlive25.8
28NoDead36.9
29NoAlive20.2
............
1284YesDead36.0
1285YesAlive48.3
1286NoAlive63.1
1287NoAlive60.8
1288YesDead39.3
1289NoAlive36.7
1290NoAlive63.8
1291NoDead71.3
1292NoAlive57.7
1293NoAlive63.2
1294NoAlive46.6
1295YesDead82.4
1296YesAlive38.3
1297YesAlive32.7
1298NoAlive39.7
1299YesDead60.0
1300NoDead71.0
1301NoAlive20.5
1302NoAlive44.4
1303YesAlive31.2
1304YesAlive47.8
1305YesAlive60.9
1306NoDead61.4
1307YesAlive43.0
1308NoAlive42.1
1309YesAlive35.9
1310NoAlive22.3
1311YesDead62.1
1312NoDead88.6
1313NoAlive39.1
\n", "

1314 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "2 No Dead 57.5\n", "3 No Alive 47.1\n", "4 Yes Alive 81.4\n", "5 No Alive 36.8\n", "6 No Alive 23.8\n", "7 Yes Dead 57.5\n", "8 Yes Alive 24.8\n", "9 Yes Alive 49.5\n", "10 Yes Alive 30.0\n", "11 No Dead 66.0\n", "12 Yes Alive 49.2\n", "13 No Alive 58.4\n", "14 No Dead 60.6\n", "15 No Alive 25.1\n", "16 No Alive 43.5\n", "17 No Alive 27.1\n", "18 No Alive 58.3\n", "19 Yes Alive 65.7\n", "20 No Dead 73.2\n", "21 Yes Alive 38.3\n", "22 No Alive 33.4\n", "23 Yes Dead 62.3\n", "24 No Alive 18.0\n", "25 No Alive 56.2\n", "26 Yes Alive 59.2\n", "27 No Alive 25.8\n", "28 No Dead 36.9\n", "29 No Alive 20.2\n", "... ... ... ...\n", "1284 Yes Dead 36.0\n", "1285 Yes Alive 48.3\n", "1286 No Alive 63.1\n", "1287 No Alive 60.8\n", "1288 Yes Dead 39.3\n", "1289 No Alive 36.7\n", "1290 No Alive 63.8\n", "1291 No Dead 71.3\n", "1292 No Alive 57.7\n", "1293 No Alive 63.2\n", "1294 No Alive 46.6\n", "1295 Yes Dead 82.4\n", "1296 Yes Alive 38.3\n", "1297 Yes Alive 32.7\n", "1298 No Alive 39.7\n", "1299 Yes Dead 60.0\n", "1300 No Dead 71.0\n", "1301 No Alive 20.5\n", "1302 No Alive 44.4\n", "1303 Yes Alive 31.2\n", "1304 Yes Alive 47.8\n", "1305 Yes Alive 60.9\n", "1306 No Dead 61.4\n", "1307 Yes Alive 43.0\n", "1308 No Alive 42.1\n", "1309 Yes Alive 35.9\n", "1310 No Alive 22.3\n", "1311 Yes Dead 62.1\n", "1312 No Dead 88.6\n", "1313 No Alive 39.1\n", "\n", "[1314 rows x 3 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_data = pd.read_csv(data_file)\n", "raw_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Création de 2 DataFrames à partir du contenu du fichier csv :\n", " *nonFumeuses* contient les données des personnes qui ne fument pas (qui ont \"No\" dans la colonne \"Smoker\")\n", " et *fumeuses* contient les données des personnes qui fument (qui ont \"Yes\" dans la colonne \"Smoker\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#trier = raw_data.sort_values(by = [\"Smoker\"])\n", "masq = raw_data[\"Smoker\"] == \"Yes\"\n", "fumeuses = raw_data.loc[masq]\n", "nonFumeuses = raw_data.loc[raw_data[\"Smoker\"]==\"No\"]\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
0YesAlive21.0
1YesAlive19.3
4YesAlive81.4
7YesDead57.5
8YesAlive24.8
9YesAlive49.5
10YesAlive30.0
12YesAlive49.2
19YesAlive65.7
21YesAlive38.3
23YesDead62.3
26YesAlive59.2
30YesAlive34.6
31YesAlive51.9
32YesAlive49.9
35YesAlive46.7
36YesAlive44.4
37YesAlive29.5
38YesDead33.0
39YesAlive35.6
40YesAlive39.1
42YesAlive35.7
46YesDead44.3
48YesAlive37.5
49YesAlive22.1
53YesAlive39.0
56YesAlive40.1
60YesAlive58.1
61YesAlive37.3
63YesDead36.3
............
1240YesAlive29.7
1243YesAlive40.1
1251YesAlive27.8
1252YesAlive52.4
1253YesAlive27.8
1254YesAlive41.0
1259YesAlive40.8
1260YesAlive20.4
1263YesAlive20.9
1264YesAlive45.5
1269YesAlive38.8
1270YesAlive55.5
1271YesAlive24.9
1273YesAlive55.7
1276YesAlive58.5
1278YesAlive43.7
1282YesAlive51.2
1284YesDead36.0
1285YesAlive48.3
1288YesDead39.3
1295YesDead82.4
1296YesAlive38.3
1297YesAlive32.7
1299YesDead60.0
1303YesAlive31.2
1304YesAlive47.8
1305YesAlive60.9
1307YesAlive43.0
1309YesAlive35.9
1311YesDead62.1
\n", "

582 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "4 Yes Alive 81.4\n", "7 Yes Dead 57.5\n", "8 Yes Alive 24.8\n", "9 Yes Alive 49.5\n", "10 Yes Alive 30.0\n", "12 Yes Alive 49.2\n", "19 Yes Alive 65.7\n", "21 Yes Alive 38.3\n", "23 Yes Dead 62.3\n", "26 Yes Alive 59.2\n", "30 Yes Alive 34.6\n", "31 Yes Alive 51.9\n", "32 Yes Alive 49.9\n", "35 Yes Alive 46.7\n", "36 Yes Alive 44.4\n", "37 Yes Alive 29.5\n", "38 Yes Dead 33.0\n", "39 Yes Alive 35.6\n", "40 Yes Alive 39.1\n", "42 Yes Alive 35.7\n", "46 Yes Dead 44.3\n", "48 Yes Alive 37.5\n", "49 Yes Alive 22.1\n", "53 Yes Alive 39.0\n", "56 Yes Alive 40.1\n", "60 Yes Alive 58.1\n", "61 Yes Alive 37.3\n", "63 Yes Dead 36.3\n", "... ... ... ...\n", "1240 Yes Alive 29.7\n", "1243 Yes Alive 40.1\n", "1251 Yes Alive 27.8\n", "1252 Yes Alive 52.4\n", "1253 Yes Alive 27.8\n", "1254 Yes Alive 41.0\n", "1259 Yes Alive 40.8\n", "1260 Yes Alive 20.4\n", "1263 Yes Alive 20.9\n", "1264 Yes Alive 45.5\n", "1269 Yes Alive 38.8\n", "1270 Yes Alive 55.5\n", "1271 Yes Alive 24.9\n", "1273 Yes Alive 55.7\n", "1276 Yes Alive 58.5\n", "1278 Yes Alive 43.7\n", "1282 Yes Alive 51.2\n", "1284 Yes Dead 36.0\n", "1285 Yes Alive 48.3\n", "1288 Yes Dead 39.3\n", "1295 Yes Dead 82.4\n", "1296 Yes Alive 38.3\n", "1297 Yes Alive 32.7\n", "1299 Yes Dead 60.0\n", "1303 Yes Alive 31.2\n", "1304 Yes Alive 47.8\n", "1305 Yes Alive 60.9\n", "1307 Yes Alive 43.0\n", "1309 Yes Alive 35.9\n", "1311 Yes Dead 62.1\n", "\n", "[582 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Affichage\n", "fumeuses" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
2NoDead57.5
3NoAlive47.1
5NoAlive36.8
6NoAlive23.8
11NoDead66.0
13NoAlive58.4
14NoDead60.6
15NoAlive25.1
16NoAlive43.5
17NoAlive27.1
18NoAlive58.3
20NoDead73.2
22NoAlive33.4
24NoAlive18.0
25NoAlive56.2
27NoAlive25.8
28NoDead36.9
29NoAlive20.2
33NoAlive19.4
34NoAlive56.9
41NoDead69.7
43NoDead75.8
44NoAlive25.3
45NoDead83.0
47NoAlive18.5
50NoAlive82.8
51NoAlive45.0
52NoDead73.3
54NoAlive28.4
55NoDead73.7
............
1262NoAlive41.2
1265NoAlive26.7
1266NoAlive41.8
1267NoAlive33.7
1268NoAlive56.5
1272NoAlive33.0
1274NoAlive25.7
1275NoAlive19.5
1277NoAlive23.4
1279NoAlive34.4
1280NoDead83.9
1281NoAlive34.9
1283NoDead86.3
1286NoAlive63.1
1287NoAlive60.8
1289NoAlive36.7
1290NoAlive63.8
1291NoDead71.3
1292NoAlive57.7
1293NoAlive63.2
1294NoAlive46.6
1298NoAlive39.7
1300NoDead71.0
1301NoAlive20.5
1302NoAlive44.4
1306NoDead61.4
1308NoAlive42.1
1310NoAlive22.3
1312NoDead88.6
1313NoAlive39.1
\n", "

732 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "2 No Dead 57.5\n", "3 No Alive 47.1\n", "5 No Alive 36.8\n", "6 No Alive 23.8\n", "11 No Dead 66.0\n", "13 No Alive 58.4\n", "14 No Dead 60.6\n", "15 No Alive 25.1\n", "16 No Alive 43.5\n", "17 No Alive 27.1\n", "18 No Alive 58.3\n", "20 No Dead 73.2\n", "22 No Alive 33.4\n", "24 No Alive 18.0\n", "25 No Alive 56.2\n", "27 No Alive 25.8\n", "28 No Dead 36.9\n", "29 No Alive 20.2\n", "33 No Alive 19.4\n", "34 No Alive 56.9\n", "41 No Dead 69.7\n", "43 No Dead 75.8\n", "44 No Alive 25.3\n", "45 No Dead 83.0\n", "47 No Alive 18.5\n", "50 No Alive 82.8\n", "51 No Alive 45.0\n", "52 No Dead 73.3\n", "54 No Alive 28.4\n", "55 No Dead 73.7\n", "... ... ... ...\n", "1262 No Alive 41.2\n", "1265 No Alive 26.7\n", "1266 No Alive 41.8\n", "1267 No Alive 33.7\n", "1268 No Alive 56.5\n", "1272 No Alive 33.0\n", "1274 No Alive 25.7\n", "1275 No Alive 19.5\n", "1277 No Alive 23.4\n", "1279 No Alive 34.4\n", "1280 No Dead 83.9\n", "1281 No Alive 34.9\n", "1283 No Dead 86.3\n", "1286 No Alive 63.1\n", "1287 No Alive 60.8\n", "1289 No Alive 36.7\n", "1290 No Alive 63.8\n", "1291 No Dead 71.3\n", "1292 No Alive 57.7\n", "1293 No Alive 63.2\n", "1294 No Alive 46.6\n", "1298 No Alive 39.7\n", "1300 No Dead 71.0\n", "1301 No Alive 20.5\n", "1302 No Alive 44.4\n", "1306 No Dead 61.4\n", "1308 No Alive 42.1\n", "1310 No Alive 22.3\n", "1312 No Dead 88.6\n", "1313 No Alive 39.1\n", "\n", "[732 rows x 3 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Affichage\n", "nonFumeuses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calcul du nombre **total** de fumeuses (*nbTotalF*) et de non fumeuses (*nbTotalNF*)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Le nombre total de fumeuses est de : 582\n", "Le nombre total de non fumeuses est de : 732\n" ] } ], "source": [ "nbTotalF = len(fumeuses.axes[0])\n", "nbTotalNF = len(nonFumeuses.axes[0])\n", "print(\"Le nombre total de fumeuses est de :\", nbTotalF)\n", "print(\"Le nombre total de non fumeuses est de :\", nbTotalNF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calcul du nombre de **fumeuses décédées** (*nbDecedeesF*)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "139" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nbDecedeesF = len(fumeuses.loc[fumeuses[\"Status\"]==\"Dead\"])\n", "nbDecedeesF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calcul du nombre de **non fumeuses décédées** (*nbDecedeesNF*)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "230" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nbDecedeesNF = len(nonFumeuses.loc[nonFumeuses[\"Status\"]==\"Dead\"])\n", "nbDecedeesNF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calcul du **taux de mortalité** des fumeuses (*tauxMortF*) et des non fumeuses (*tauxMortNF*)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sur la période donnée, il y a pour les fumeuses un taux de mortalité de : 23.883161512027492 %\n", "et il y a pour les non fumeuses un taux de mortalité de : 31.420765027322407 %\n" ] } ], "source": [ "tauxMortF = nbDecedeesF/nbTotalF*100\n", "tauxMortNF = nbDecedeesNF/nbTotalNF*100\n", "print(\"Sur la période donnée, il y a pour les fumeuses un taux de mortalité de : \", tauxMortF, \"%\")\n", "print(\"et il y a pour les non fumeuses un taux de mortalité de : \", tauxMortNF, \"%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Création d'une nouvelle DataFrame pandas (*dt*) qui contient les taux de mortalité selon le statut (fumeuse ou non) en vue de la construction d'un graphique utilisant ces données." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StatuttauxMortalite
0Fumeuses23.883162
1nonFumeuses31.420765
\n", "
" ], "text/plain": [ " Statut tauxMortalite\n", "0 Fumeuses 23.883162\n", "1 nonFumeuses 31.420765" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = {\"tauxMortalite\" : [tauxMortF, tauxMortNF], \"Statut\" : [\"Fumeuses\", \"nonFumeuses\"]}\n", "dt = pd.DataFrame(data = d)\n", "dt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Création d'un diagramme en barre pour illustrer les calculs précédents." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "plt.figure(figsize=(8, 5))\n", "plt.bar(dt[\"Statut\"], dt[\"tauxMortalite\"], color=['salmon', 'skyblue'])\n", "\n", "plt.title(\"Taux de mortalité par statut de tabagisme\")\n", "plt.xlabel(\"Statut\")\n", "plt.ylabel(\"Taux de mortalité (%)\")\n", "\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On obtient des résultats assez surprenants dans le sens où, étant donné que l'on nous a souvent répété que fumer est mauvais pour la santé, nous nous attendions à retrouver ce fait dans cette étude.\n", "Or, nous pouvons observer que le résultat des calculs effectués nous montre l'inverse de ce à quoi nous nous attendions : le groupe de femmes qui ne fumaient pas a un taux de mortalité supérieur à celui composé de femmes qui fumaient." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Etape 2 : Calcul du taux de mortalité pour les fumeuses et les non fumeuses selon des classes d'âge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Première tentative pour calculer le nombre total de fumeuses et de non fumeuses ayant entre 18 et 34 ans" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "179 219\n" ] } ], "source": [ "nb18_34F = len(fumeuses.loc[fumeuses[\"Age\"]<34]) - len(fumeuses.loc[fumeuses[\"Age\"]<18])\n", "nb18_34NF = len(nonFumeuses.loc[nonFumeuses[\"Age\"]<34]) - len(nonFumeuses.loc[nonFumeuses[\"Age\"]<18])\n", "print(nb18_34F, nb18_34NF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calcul avec une autre méthode du nombre de fumeuses entre 18 et 34 ans et calcul du nombre de fumeuses de appartenant à cet intervalle d'âge qui sont mortes." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "179\n", "5 fumeuses ayant entre 18 et 34 ans lors du premier sondage sont décédées durant la période avant la suite de l'étude\n" ] } ], "source": [ "test = fumeuses.loc[fumeuses[\"Age\"]<34]\n", "t2 = test.loc[test[\"Age\"]>=18]\n", "print(len(t2))\n", "nbDecedees18_34F = len(t2.loc[t2[\"Status\"]==\"Dead\"])\n", "print(nbDecedees18_34F, \"fumeuses ayant entre 18 et 34 ans lors du premier sondage sont décédées durant la période avant la suite de l'étude\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calcul du taux de mortalité pour les fumeuses entre 18 et 34 ans." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.793296089385475" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tauxMort18_34F = nbDecedees18_34F/nb18_34F*100\n", "tauxMort18_34F" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Une fois les calculs trouvés et testés sur le premier intervalle d'âge \\[18, 34[ , il vaut mieux créer une fonction qui calcule le taux de mortalité pour un intervalle et une DataFrame donnés." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def calculTMparClAge(borneInf, borneSup, data): # la borne supérieure de l'intervalle n'est pas comprise :\n", " t1 = data.loc[data[\"Age\"]=borneInf]\n", " nb = len(t2)\n", " #print(nb)\n", " nbMort = len(t2.loc[t2[\"Status\"]==\"Dead\"])\n", " #print(nbMort)\n", " tauxM = nbMort/nb*100\n", " return tauxM\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Application de la fonction sur tous les intervalles d'âge" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Le taux de mortalité des fumeuses pour la classe d'âge 18-34 est de : 2.793296089385475 %\n", "Le taux de mortalité des non fumeuses pour la classe d'âge 18-34 est de : 2.73972602739726\n" ] } ], "source": [ "tauxMort18_34Fv2 = calculTMparClAge(18, 34, fumeuses)\n", "print(\"Le taux de mortalité des fumeuses pour la classe d'âge 18-34 est de :\", tauxMort18_34Fv2, \"%\")\n", "\n", "tauxMort18_34NF = calculTMparClAge(18, 34, nonFumeuses)\n", "print(\"Le taux de mortalité des non fumeuses pour la classe d'âge 18-34 est de :\", tauxMort18_34NF)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Le taux de mortalité des fumeuses pour la classe d'âge 34-54 est de : 17.154811715481173 %\n", "Le taux de mortalité des non fumeuses pour la classe d'âge 34-54 est de : 9.547738693467336 %\n" ] } ], "source": [ "tauxMort34_54F = calculTMparClAge(34, 54, fumeuses)\n", "print(\"Le taux de mortalité des fumeuses pour la classe d'âge 34-54 est de :\", tauxMort34_54F, \"%\")\n", "\n", "tauxMort34_54NF = calculTMparClAge(34, 54, nonFumeuses)\n", "print(\"Le taux de mortalité des non fumeuses pour la classe d'âge 34-54 est de :\", tauxMort34_54NF, \"%\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Le taux de mortalité des fumeuses pour la classe d'âge 54-64 est de : 44.34782608695652 %\n", "Le taux de mortalité des non fumeuses pour la classe d'âge 54-64 est de : 32.773109243697476 %\n" ] } ], "source": [ "tauxMort54_64F = calculTMparClAge(54, 64, fumeuses)\n", "print(\"Le taux de mortalité des fumeuses pour la classe d'âge 54-64 est de :\", tauxMort54_64F, \"%\")\n", "\n", "tauxMort54_64NF = calculTMparClAge(54, 64, nonFumeuses)\n", "print(\"Le taux de mortalité des non fumeuses pour la classe d'âge 54-64 est de :\", tauxMort54_64NF, \"%\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Le taux de mortalité des fumeuses de la classe d'âge 64-150 est de : 85.71428571428571\n", "Le taux de mortalité des fumeuses de la classe d'âge 64-150 est de : 85.12820512820512\n" ] } ], "source": [ "tauxMort64_150F = calculTMparClAge(64, 150, fumeuses)\n", "print(\"Le taux de mortalité des fumeuses de la classe d'âge 64-150 est de :\", tauxMort64_150F)\n", "\n", "tauxMort64_150NF = calculTMparClAge(64, 150, nonFumeuses)\n", "print(\"Le taux de mortalité des fumeuses de la classe d'âge 64-150 est de :\", tauxMort64_150NF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Création d'une nouvelle DataFrame *d2* contenant les classes d'âge suivies de *F* pour fumeuses ou de *NF* pour non fumeuses ainsi que les différents taux de mortalité." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "d2 = {\"classeAge\" : [\"18-34F\", \"18-34NF\", \"34-54F\", \"34-54NF\", \"54-64F\", \"54-64NF\", \"64+F\", \"64+NF\"],\n", " \"tauxMortalite\" : [tauxMort18_34Fv2, tauxMort18_34NF, tauxMort34_54F, tauxMort34_54NF, tauxMort54_64F, tauxMort54_64NF, tauxMort64_150F, tauxMort64_150NF]}\n", "dt2 = pd.DataFrame(data = d2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Création du diagramme en barre illustrant les taux de mortalité calculés précédemment selon les classes d'âge." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "plt.figure(figsize=(8, 5))\n", "plt.bar(dt2[\"classeAge\"], dt2[\"tauxMortalite\"], color=['salmon', 'skyblue'])\n", "\n", "plt.title(\"Taux de mortalité par classe d'âge\")\n", "plt.xlabel(\"Classe d'âge (F -> fumeuses et NF -> non fumeuses)\")\n", "plt.ylabel(\"Taux de mortalité (%)\")\n", "\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En faisant des classes d'âge, nous obtenons pour les classes centrales comme 34-54 et 54-64 un résultat totalement opposé à celui de l'étape précédente. Il y a, pour ces 2 classes, significativement plus de morts dans le groupe des fumeuses que dans le groupe de non fumeuses durant la période de temps entre le premier sondage et la suite de l'étude. Ce qui se rapproche plus de ce que nous aurions pu supposer avec seulement nos connaissances.\n", "Nous pouvons donc avancer que l'âge des femmes est une variable non négligeable dans cette étude puisqu'en le prenant en compte, nous obtenons des résultats différents.\n", "Ce qui entrerait en accord avec la description du [paradoxe de simpson](https://fr.wikipedia.org/wiki/Paradoxe_de_Simpson)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Etape 3 : Régression logistique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ajout d'une colonne Death contenant 1 si la personne est morte pendant la période entre le premier sondage et la suite de l'étude et 0 sinon pour toutes les lignes de la DataFrame." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAgeDeath
0YesAlive21.00
1YesAlive19.30
2NoDead57.51
3NoAlive47.10
4YesAlive81.40
5NoAlive36.80
6NoAlive23.80
7YesDead57.51
8YesAlive24.80
9YesAlive49.50
10YesAlive30.00
11NoDead66.01
12YesAlive49.20
13NoAlive58.40
14NoDead60.61
15NoAlive25.10
16NoAlive43.50
17NoAlive27.10
18NoAlive58.30
19YesAlive65.70
20NoDead73.21
21YesAlive38.30
22NoAlive33.40
23YesDead62.31
24NoAlive18.00
25NoAlive56.20
26YesAlive59.20
27NoAlive25.80
28NoDead36.91
29NoAlive20.20
...............
1284YesDead36.01
1285YesAlive48.30
1286NoAlive63.10
1287NoAlive60.80
1288YesDead39.31
1289NoAlive36.70
1290NoAlive63.80
1291NoDead71.31
1292NoAlive57.70
1293NoAlive63.20
1294NoAlive46.60
1295YesDead82.41
1296YesAlive38.30
1297YesAlive32.70
1298NoAlive39.70
1299YesDead60.01
1300NoDead71.01
1301NoAlive20.50
1302NoAlive44.40
1303YesAlive31.20
1304YesAlive47.80
1305YesAlive60.90
1306NoDead61.41
1307YesAlive43.00
1308NoAlive42.10
1309YesAlive35.90
1310NoAlive22.30
1311YesDead62.11
1312NoDead88.61
1313NoAlive39.10
\n", "

1314 rows × 4 columns

\n", "
" ], "text/plain": [ " Smoker Status Age Death\n", "0 Yes Alive 21.0 0\n", "1 Yes Alive 19.3 0\n", "2 No Dead 57.5 1\n", "3 No Alive 47.1 0\n", "4 Yes Alive 81.4 0\n", "5 No Alive 36.8 0\n", "6 No Alive 23.8 0\n", "7 Yes Dead 57.5 1\n", "8 Yes Alive 24.8 0\n", "9 Yes Alive 49.5 0\n", "10 Yes Alive 30.0 0\n", "11 No Dead 66.0 1\n", "12 Yes Alive 49.2 0\n", "13 No Alive 58.4 0\n", "14 No Dead 60.6 1\n", "15 No Alive 25.1 0\n", "16 No Alive 43.5 0\n", "17 No Alive 27.1 0\n", "18 No Alive 58.3 0\n", "19 Yes Alive 65.7 0\n", "20 No Dead 73.2 1\n", "21 Yes Alive 38.3 0\n", "22 No Alive 33.4 0\n", "23 Yes Dead 62.3 1\n", "24 No Alive 18.0 0\n", "25 No Alive 56.2 0\n", "26 Yes Alive 59.2 0\n", "27 No Alive 25.8 0\n", "28 No Dead 36.9 1\n", "29 No Alive 20.2 0\n", "... ... ... ... ...\n", "1284 Yes Dead 36.0 1\n", "1285 Yes Alive 48.3 0\n", "1286 No Alive 63.1 0\n", "1287 No Alive 60.8 0\n", "1288 Yes Dead 39.3 1\n", "1289 No Alive 36.7 0\n", "1290 No Alive 63.8 0\n", "1291 No Dead 71.3 1\n", "1292 No Alive 57.7 0\n", "1293 No Alive 63.2 0\n", "1294 No Alive 46.6 0\n", "1295 Yes Dead 82.4 1\n", "1296 Yes Alive 38.3 0\n", "1297 Yes Alive 32.7 0\n", "1298 No Alive 39.7 0\n", "1299 Yes Dead 60.0 1\n", "1300 No Dead 71.0 1\n", "1301 No Alive 20.5 0\n", "1302 No Alive 44.4 0\n", "1303 Yes Alive 31.2 0\n", "1304 Yes Alive 47.8 0\n", "1305 Yes Alive 60.9 0\n", "1306 No Dead 61.4 1\n", "1307 Yes Alive 43.0 0\n", "1308 No Alive 42.1 0\n", "1309 Yes Alive 35.9 0\n", "1310 No Alive 22.3 0\n", "1311 Yes Dead 62.1 1\n", "1312 No Dead 88.6 1\n", "1313 No Alive 39.1 0\n", "\n", "[1314 rows x 4 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_data[\"Death\"] = raw_data[\"Status\"].apply(lambda x: 1 if x == \"Dead\" else 0) #Usage d'apply pour appliquer la fonction\n", "raw_data #anonyme lambda sur chaque ligne de la DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Création de nouvelles DataFrames contenant les mêmes valeurs que *fumeuses* et *nonFumeuses* ainsi que la colonne Death ajoutée juste au-dessus." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "nonFumeusesv2 = raw_data.loc[raw_data[\"Smoker\"]==\"No\"]\n", "fumeusesv2 = raw_data.loc[raw_data[\"Smoker\"]==\"Yes\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Régression logistique sur le groupe des fumeuses" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.412727\n", " Iterations 7\n", "Fumeuses:\n", " Logit Regression Results \n", "==============================================================================\n", "Dep. Variable: Death No. Observations: 582\n", "Model: Logit Df Residuals: 580\n", "Method: MLE Df Model: 1\n", "Date: Thu, 31 Oct 2024 Pseudo R-squ.: 0.2492\n", "Time: 21:26:13 Log-Likelihood: -240.21\n", "converged: True LL-Null: -319.94\n", " LLR p-value: 1.477e-36\n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const -5.5081 0.466 -11.814 0.000 -6.422 -4.594\n", "Age 0.0890 0.009 10.203 0.000 0.072 0.106\n", "==============================================================================\n" ] } ], "source": [ "# Modèle pour les fumeuses\n", "X_fumeuses = sm.add_constant(fumeusesv2['Age']) # Ajout de l'intercept\n", "y_fumeuses = fumeusesv2['Death']\n", "model_fumeuses = sm.Logit(y_fumeuses, X_fumeuses).fit()\n", "\n", "# Affichage du résumé des résultats\n", "print(\"Fumeuses:\\n\", model_fumeuses.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Analyse** des résultats obtenus avec la régression logistique pour les fumeuses :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La p-value (P>|z|) de l'âge est inférieure à 0.005, ce qui signifie que l'âge a un effet significatif sur la probabilité du décès chez les fumeuses. Son coefficient est de 0.0890 et son intervalle de confiance est \\[0.072, 0.106]. Le coefficient étant positif, cela signifie que la probabilité de décès augmente en fonction de l'âge.\n", "Le pseudo R-carré établit la qualité du modèle. Dans le cas de la régression logistique pour les fumeuses, il est de 0.2492, ce qui n'est pas très élevé et signifie donc que le modèle actuel n'est pas d'une très grande qualité. Cependant, cela confirme toujours que l'âge a un certain effet sur la probabilité de décès.\n", "La constante représente la probabilité de base de décès pour les fumeuses lorsqu'on ne prend pas en compte l'âge. Elle est ici de -5.5081." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Régression logistique pour le groupe des non fumeuses" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.354560\n", " Iterations 7\n", "Non-fumeuses:\n", " Logit Regression Results \n", "==============================================================================\n", "Dep. Variable: Death No. Observations: 732\n", "Model: Logit Df Residuals: 730\n", "Method: MLE Df Model: 1\n", "Date: Thu, 31 Oct 2024 Pseudo R-squ.: 0.4304\n", "Time: 21:26:13 Log-Likelihood: -259.54\n", "converged: True LL-Null: -455.62\n", " LLR p-value: 2.808e-87\n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const -6.7955 0.479 -14.174 0.000 -7.735 -5.856\n", "Age 0.1073 0.008 13.742 0.000 0.092 0.123\n", "==============================================================================\n" ] } ], "source": [ "# Modèle pour les non-fumeuses\n", "X_non_fumeuses = sm.add_constant(nonFumeusesv2['Age']) # Ajout de l'intercept\n", "y_non_fumeuses = nonFumeusesv2['Death']\n", "model_non_fumeuses = sm.Logit(y_non_fumeuses, X_non_fumeuses).fit()\n", "\n", "# Affichage du résumé des résultats\n", "print(\"Non-fumeuses:\\n\", model_non_fumeuses.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Analyse** des résultats obtenus avec la régression logistique pour les non fumeuses :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La p-value (P>|z|) de l'âge est inférieure à 0.005, ce qui signifie que l'âge a un effet significatif sur la probabilité du décès chez les non fumeuses. Son coefficient est de 0.1073 et son intervalle de confiance est \\[0.092, 0.123]. Le coefficient étant positif, cela signifie que la probabilité de décès augmente en fonction de l'âge.\n", "Dans le cas de la régression logistique pour les non fumeuses, le pseudo R-carré est de 0.4304, ce qui est assez élevé et signifie donc que le modèle actuel est d'assez bonne qualité.\n", "La constante est ici de -6.7955.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Comparaison** des résultats obtenus pour les 2 régressions logistiques réalisées précédemment :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le coefficient de l'âge de la régression logistique pour les non fumeuses est plus élevé que celui de la régression logistique pour les fumeuses, ce qui signifie que l'âge a un effet un peu plus fort sur la probabilité de décès des non fumeuses. \n", "Si l'on ne prend pas en compte l'âge et que l'on regarde les chances de décès de base, c'est-à-dire que l'on regarde les constantes, on observe que celle des non fumeuses est inférieure à celle des fumeuses, ce qui veut dire que la chance de base de décès pour les non fumeuses est plus petite que celle des fumeuses. \n", "Ces résultats suggèrent que l'âge a un effet plus important sur la mortalité des non fumeuses que des fumeuses.\n", "Ce qui pourrait nous faire penser que le tabagisme semble diminuer les effets de l'âge, mais cela peut être dû à un biais dans les donnée ou à un autre facteur qui n'a pas été pris en compte dans cette étude et qui influence plus le groupe des fumeuses que celui des non fumeuses." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Création d'une série de valeurs d'âge régulièrement espacées allant de la plus petite à la plus grande avec 100 points intermédiaires." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "age_range = np.linspace(raw_data['Age'].min(), raw_data['Age'].max(), 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Création des prédictions pour les fumeuses *pred_fumeuses* et les non fumeuses *pred_non_fumeuses*" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "pred_fumeuses = model_fumeuses.predict(sm.add_constant(age_range))\n", "\n", "pred_non_fumeuses = model_non_fumeuses.predict(sm.add_constant(age_range))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Création du graphique de probabilité de décès en fonction de l'âge" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10, 6))\n", "plt.plot(age_range, pred_fumeuses, label=\"Fumeuses\", color=\"salmon\")\n", "plt.plot(age_range, pred_non_fumeuses, label=\"Non Fumeuses\", color=\"skyblue\")\n", "\n", "# Ajout d'intervalles de confiance pour chaque groupe\n", "plt.fill_between(age_range, pred_fumeuses - 1.96 * np.std(pred_fumeuses), pred_fumeuses + 1.96 * np.std(pred_fumeuses), color=\"salmon\", alpha=0.2)\n", "plt.fill_between(age_range, pred_non_fumeuses - 1.96 * np.std(pred_non_fumeuses), pred_non_fumeuses + 1.96 * np.std(pred_non_fumeuses), color=\"skyblue\", alpha=0.2)\n", "\n", "# Mise en forme du graphique\n", "plt.xlabel(\"Âge\")\n", "plt.ylabel(\"Probabilité de décès\")\n", "plt.title(\"Probabilité de décès en fonction de l'âge et du statut (fumeuses ou non fumeuses)\")\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sur ce graphique, on peut observer que les probabilités de décès pour les fumeuses et les non fumeuses entre 18 et 34 ans et entre 64 et 90 ans sont presques égales, ce qui correspond aux résultats des calculs et au diagramme en barre réalisés à l'étape 2.\n", "Entre 34 et 64 ans, la probabilité de décès des fumeuses est supérieure à celle des non fumeuses, ce qui correspond également aux résultats obtenus à l'étape 2. \n", "D'après ce graphique, la probabilité de décès des fumeuses serait plus élevée que celle des non fumeuses pour un âge allant de 18 à 70 ans puis la tendance s'inverserait. \n", "Cela signifierait que le tabagisme augmente les chances de décès des femmes le pratiquant jusqu'à un certain âge." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_code_all_hidden": true, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 4 }