{ "cells": [ { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "# Sujet 6 : Autour du paradoxe de Simpson" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "## Rappel du contexte\n", "\n", "_En 1972-1974, à Whickham, une ville du nord-est de l'Angleterre, située à environ 6,5 kilomètres au sud-ouest de Newcastle upon Tyne, un sondage d'un sixième des électeurs a été effectué afin d'éclairer des travaux sur les maladies thyroïdiennes et cardiaques (Tunbridge et al. 1977). Une suite de cette étude a été menée vingt ans plus tard (Vanderpump et al. 1995). Certains des résultats avaient trait au tabagisme et cherchaient à savoir si les individus étaient toujours en vie lors de la seconde étude. Par simplicité, nous nous restreindrons aux femmes et parmi celles-ci aux 1314 qui ont été catégorisées comme \"fumant actuellement\" ou \"n'ayant jamais fumé\". Il y avait relativement peu de femmes dans le sondage initial ayant fumé et ayant arrêté depuis (162) et très peu pour lesquelles l'information n'était pas disponible (18). La survie à 20 ans a été déterminée pour l'ensemble des femmes du premier sondage._" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "## Préparation des donées" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Téléchargement des données\n", "\n", "Les données autour du Paradoxe de Simpson sont accessibles via le gitlab INRIA, à l'adresse : " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "data_url = \"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false\"" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On veut éviter une disparation potentielle des données du serveur : on fait donc une copie locale de ce jeu de données, et on travaillera par la suite avec cette copie locale. Ca nous permet également de ne pas télécharger à chaque éxécution du code. On vérifie avant le téléchargement qu'une copie locale n'existe pas déjà." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "data_file = \"data.csv\"\n", "\n", "import os\n", "import urllib.request\n", "if not os.path.exists(data_file):\n", " urllib.request.urlretrieve(data_url, data_file)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Vérification des données\n", "\n", "Avant de commencer notre analyse, on vérifie qu'on a bien les données qu'on souhaite, et on vérifie si on n'a pas des données manquantes, ou abérantes. \n", "\n", "On affiche donc nos données : " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
0YesAlive21.0
1YesAlive19.3
2NoDead57.5
3NoAlive47.1
4YesAlive81.4
5NoAlive36.8
6NoAlive23.8
7YesDead57.5
8YesAlive24.8
9YesAlive49.5
10YesAlive30.0
11NoDead66.0
12YesAlive49.2
13NoAlive58.4
14NoDead60.6
15NoAlive25.1
16NoAlive43.5
17NoAlive27.1
18NoAlive58.3
19YesAlive65.7
20NoDead73.2
21YesAlive38.3
22NoAlive33.4
23YesDead62.3
24NoAlive18.0
25NoAlive56.2
26YesAlive59.2
27NoAlive25.8
28NoDead36.9
29NoAlive20.2
............
1284YesDead36.0
1285YesAlive48.3
1286NoAlive63.1
1287NoAlive60.8
1288YesDead39.3
1289NoAlive36.7
1290NoAlive63.8
1291NoDead71.3
1292NoAlive57.7
1293NoAlive63.2
1294NoAlive46.6
1295YesDead82.4
1296YesAlive38.3
1297YesAlive32.7
1298NoAlive39.7
1299YesDead60.0
1300NoDead71.0
1301NoAlive20.5
1302NoAlive44.4
1303YesAlive31.2
1304YesAlive47.8
1305YesAlive60.9
1306NoDead61.4
1307YesAlive43.0
1308NoAlive42.1
1309YesAlive35.9
1310NoAlive22.3
1311YesDead62.1
1312NoDead88.6
1313NoAlive39.1
\n", "

1314 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "2 No Dead 57.5\n", "3 No Alive 47.1\n", "4 Yes Alive 81.4\n", "5 No Alive 36.8\n", "6 No Alive 23.8\n", "7 Yes Dead 57.5\n", "8 Yes Alive 24.8\n", "9 Yes Alive 49.5\n", "10 Yes Alive 30.0\n", "11 No Dead 66.0\n", "12 Yes Alive 49.2\n", "13 No Alive 58.4\n", "14 No Dead 60.6\n", "15 No Alive 25.1\n", "16 No Alive 43.5\n", "17 No Alive 27.1\n", "18 No Alive 58.3\n", "19 Yes Alive 65.7\n", "20 No Dead 73.2\n", "21 Yes Alive 38.3\n", "22 No Alive 33.4\n", "23 Yes Dead 62.3\n", "24 No Alive 18.0\n", "25 No Alive 56.2\n", "26 Yes Alive 59.2\n", "27 No Alive 25.8\n", "28 No Dead 36.9\n", "29 No Alive 20.2\n", "... ... ... ...\n", "1284 Yes Dead 36.0\n", "1285 Yes Alive 48.3\n", "1286 No Alive 63.1\n", "1287 No Alive 60.8\n", "1288 Yes Dead 39.3\n", "1289 No Alive 36.7\n", "1290 No Alive 63.8\n", "1291 No Dead 71.3\n", "1292 No Alive 57.7\n", "1293 No Alive 63.2\n", "1294 No Alive 46.6\n", "1295 Yes Dead 82.4\n", "1296 Yes Alive 38.3\n", "1297 Yes Alive 32.7\n", "1298 No Alive 39.7\n", "1299 Yes Dead 60.0\n", "1300 No Dead 71.0\n", "1301 No Alive 20.5\n", "1302 No Alive 44.4\n", "1303 Yes Alive 31.2\n", "1304 Yes Alive 47.8\n", "1305 Yes Alive 60.9\n", "1306 No Dead 61.4\n", "1307 Yes Alive 43.0\n", "1308 No Alive 42.1\n", "1309 Yes Alive 35.9\n", "1310 No Alive 22.3\n", "1311 Yes Dead 62.1\n", "1312 No Dead 88.6\n", "1313 No Alive 39.1\n", "\n", "[1314 rows x 3 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data_file\n", "raw_data = pd.read_csv(data_file)\n", "raw_data" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On note bien qu'on a nos trois colonnes :\n", "- __Smoker__ : si la personne fume ou non\n", "- __Status__ : si la personne est vivante ou décédée au moment de la seconde étude\n", "- __Age__ : son âge lors du premier sondage \n", "\n", "On vérifie ensuite qu'on n'a pas de données abérantes." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [Smoker, Status, Age]\n", "Index: []" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_data[raw_data.isnull().any(axis=1)]" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "Il n'y a à priori pas de données manquantes. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
0YesAlive21.0
1YesAlive19.3
2NoDead57.5
3NoAlive47.1
4YesAlive81.4
5NoAlive36.8
6NoAlive23.8
7YesDead57.5
8YesAlive24.8
9YesAlive49.5
10YesAlive30.0
11NoDead66.0
12YesAlive49.2
13NoAlive58.4
14NoDead60.6
15NoAlive25.1
16NoAlive43.5
17NoAlive27.1
18NoAlive58.3
19YesAlive65.7
20NoDead73.2
21YesAlive38.3
22NoAlive33.4
23YesDead62.3
24NoAlive18.0
25NoAlive56.2
26YesAlive59.2
27NoAlive25.8
28NoDead36.9
29NoAlive20.2
............
1284YesDead36.0
1285YesAlive48.3
1286NoAlive63.1
1287NoAlive60.8
1288YesDead39.3
1289NoAlive36.7
1290NoAlive63.8
1291NoDead71.3
1292NoAlive57.7
1293NoAlive63.2
1294NoAlive46.6
1295YesDead82.4
1296YesAlive38.3
1297YesAlive32.7
1298NoAlive39.7
1299YesDead60.0
1300NoDead71.0
1301NoAlive20.5
1302NoAlive44.4
1303YesAlive31.2
1304YesAlive47.8
1305YesAlive60.9
1306NoDead61.4
1307YesAlive43.0
1308NoAlive42.1
1309YesAlive35.9
1310NoAlive22.3
1311YesDead62.1
1312NoDead88.6
1313NoAlive39.1
\n", "

1314 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "2 No Dead 57.5\n", "3 No Alive 47.1\n", "4 Yes Alive 81.4\n", "5 No Alive 36.8\n", "6 No Alive 23.8\n", "7 Yes Dead 57.5\n", "8 Yes Alive 24.8\n", "9 Yes Alive 49.5\n", "10 Yes Alive 30.0\n", "11 No Dead 66.0\n", "12 Yes Alive 49.2\n", "13 No Alive 58.4\n", "14 No Dead 60.6\n", "15 No Alive 25.1\n", "16 No Alive 43.5\n", "17 No Alive 27.1\n", "18 No Alive 58.3\n", "19 Yes Alive 65.7\n", "20 No Dead 73.2\n", "21 Yes Alive 38.3\n", "22 No Alive 33.4\n", "23 Yes Dead 62.3\n", "24 No Alive 18.0\n", "25 No Alive 56.2\n", "26 Yes Alive 59.2\n", "27 No Alive 25.8\n", "28 No Dead 36.9\n", "29 No Alive 20.2\n", "... ... ... ...\n", "1284 Yes Dead 36.0\n", "1285 Yes Alive 48.3\n", "1286 No Alive 63.1\n", "1287 No Alive 60.8\n", "1288 Yes Dead 39.3\n", "1289 No Alive 36.7\n", "1290 No Alive 63.8\n", "1291 No Dead 71.3\n", "1292 No Alive 57.7\n", "1293 No Alive 63.2\n", "1294 No Alive 46.6\n", "1295 Yes Dead 82.4\n", "1296 Yes Alive 38.3\n", "1297 Yes Alive 32.7\n", "1298 No Alive 39.7\n", "1299 Yes Dead 60.0\n", "1300 No Dead 71.0\n", "1301 No Alive 20.5\n", "1302 No Alive 44.4\n", "1303 Yes Alive 31.2\n", "1304 Yes Alive 47.8\n", "1305 Yes Alive 60.9\n", "1306 No Dead 61.4\n", "1307 Yes Alive 43.0\n", "1308 No Alive 42.1\n", "1309 Yes Alive 35.9\n", "1310 No Alive 22.3\n", "1311 Yes Dead 62.1\n", "1312 No Dead 88.6\n", "1313 No Alive 39.1\n", "\n", "[1314 rows x 3 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = raw_data.copy()\n", "data" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On vérifie également qu'il n'y a pas de données abérantes, c'est à dire de personnes dont l'âge n'est pas absurde. Pour cela, on récupère le minimum et le maximum de la colonne __Age__. Un age négatif sera par exemple considéré comme abérant." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "L'age min est : 18.0\n", "L'age max est : 89.9\n" ] } ], "source": [ "age_min = data[\"Age\"].min()\n", "age_max = data[\"Age\"].max()\n", "\n", "print(\"L'age min est : \", age_min)\n", "print(\"L'age max est : \", age_max)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On conserve toutes les personnes. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "## Exercice" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hideOutput": true, "hidePrompt": false }, "source": [ "### Partie 1" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "__Consigne : Représentez dans un tableau le nombre total de femmes vivantes et décédées sur la période en fonction de leur habitude de tabagisme. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On récupère les données des femmes fumeuses et non fumeuses:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
0YesAlive21.0
1YesAlive19.3
4YesAlive81.4
7YesDead57.5
8YesAlive24.8
9YesAlive49.5
10YesAlive30.0
12YesAlive49.2
19YesAlive65.7
21YesAlive38.3
23YesDead62.3
26YesAlive59.2
30YesAlive34.6
31YesAlive51.9
32YesAlive49.9
35YesAlive46.7
36YesAlive44.4
37YesAlive29.5
38YesDead33.0
39YesAlive35.6
40YesAlive39.1
42YesAlive35.7
46YesDead44.3
48YesAlive37.5
49YesAlive22.1
53YesAlive39.0
56YesAlive40.1
60YesAlive58.1
61YesAlive37.3
63YesDead36.3
............
1240YesAlive29.7
1243YesAlive40.1
1251YesAlive27.8
1252YesAlive52.4
1253YesAlive27.8
1254YesAlive41.0
1259YesAlive40.8
1260YesAlive20.4
1263YesAlive20.9
1264YesAlive45.5
1269YesAlive38.8
1270YesAlive55.5
1271YesAlive24.9
1273YesAlive55.7
1276YesAlive58.5
1278YesAlive43.7
1282YesAlive51.2
1284YesDead36.0
1285YesAlive48.3
1288YesDead39.3
1295YesDead82.4
1296YesAlive38.3
1297YesAlive32.7
1299YesDead60.0
1303YesAlive31.2
1304YesAlive47.8
1305YesAlive60.9
1307YesAlive43.0
1309YesAlive35.9
1311YesDead62.1
\n", "

582 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "4 Yes Alive 81.4\n", "7 Yes Dead 57.5\n", "8 Yes Alive 24.8\n", "9 Yes Alive 49.5\n", "10 Yes Alive 30.0\n", "12 Yes Alive 49.2\n", "19 Yes Alive 65.7\n", "21 Yes Alive 38.3\n", "23 Yes Dead 62.3\n", "26 Yes Alive 59.2\n", "30 Yes Alive 34.6\n", "31 Yes Alive 51.9\n", "32 Yes Alive 49.9\n", "35 Yes Alive 46.7\n", "36 Yes Alive 44.4\n", "37 Yes Alive 29.5\n", "38 Yes Dead 33.0\n", "39 Yes Alive 35.6\n", "40 Yes Alive 39.1\n", "42 Yes Alive 35.7\n", "46 Yes Dead 44.3\n", "48 Yes Alive 37.5\n", "49 Yes Alive 22.1\n", "53 Yes Alive 39.0\n", "56 Yes Alive 40.1\n", "60 Yes Alive 58.1\n", "61 Yes Alive 37.3\n", "63 Yes Dead 36.3\n", "... ... ... ...\n", "1240 Yes Alive 29.7\n", "1243 Yes Alive 40.1\n", "1251 Yes Alive 27.8\n", "1252 Yes Alive 52.4\n", "1253 Yes Alive 27.8\n", "1254 Yes Alive 41.0\n", "1259 Yes Alive 40.8\n", "1260 Yes Alive 20.4\n", "1263 Yes Alive 20.9\n", "1264 Yes Alive 45.5\n", "1269 Yes Alive 38.8\n", "1270 Yes Alive 55.5\n", "1271 Yes Alive 24.9\n", "1273 Yes Alive 55.7\n", "1276 Yes Alive 58.5\n", "1278 Yes Alive 43.7\n", "1282 Yes Alive 51.2\n", "1284 Yes Dead 36.0\n", "1285 Yes Alive 48.3\n", "1288 Yes Dead 39.3\n", "1295 Yes Dead 82.4\n", "1296 Yes Alive 38.3\n", "1297 Yes Alive 32.7\n", "1299 Yes Dead 60.0\n", "1303 Yes Alive 31.2\n", "1304 Yes Alive 47.8\n", "1305 Yes Alive 60.9\n", "1307 Yes Alive 43.0\n", "1309 Yes Alive 35.9\n", "1311 Yes Dead 62.1\n", "\n", "[582 rows x 3 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smoker = data[data[\"Smoker\"]==\"Yes\"]\n", "smoker" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
2NoDead57.5
3NoAlive47.1
5NoAlive36.8
6NoAlive23.8
11NoDead66.0
13NoAlive58.4
14NoDead60.6
15NoAlive25.1
16NoAlive43.5
17NoAlive27.1
18NoAlive58.3
20NoDead73.2
22NoAlive33.4
24NoAlive18.0
25NoAlive56.2
27NoAlive25.8
28NoDead36.9
29NoAlive20.2
33NoAlive19.4
34NoAlive56.9
41NoDead69.7
43NoDead75.8
44NoAlive25.3
45NoDead83.0
47NoAlive18.5
50NoAlive82.8
51NoAlive45.0
52NoDead73.3
54NoAlive28.4
55NoDead73.7
............
1262NoAlive41.2
1265NoAlive26.7
1266NoAlive41.8
1267NoAlive33.7
1268NoAlive56.5
1272NoAlive33.0
1274NoAlive25.7
1275NoAlive19.5
1277NoAlive23.4
1279NoAlive34.4
1280NoDead83.9
1281NoAlive34.9
1283NoDead86.3
1286NoAlive63.1
1287NoAlive60.8
1289NoAlive36.7
1290NoAlive63.8
1291NoDead71.3
1292NoAlive57.7
1293NoAlive63.2
1294NoAlive46.6
1298NoAlive39.7
1300NoDead71.0
1301NoAlive20.5
1302NoAlive44.4
1306NoDead61.4
1308NoAlive42.1
1310NoAlive22.3
1312NoDead88.6
1313NoAlive39.1
\n", "

732 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "2 No Dead 57.5\n", "3 No Alive 47.1\n", "5 No Alive 36.8\n", "6 No Alive 23.8\n", "11 No Dead 66.0\n", "13 No Alive 58.4\n", "14 No Dead 60.6\n", "15 No Alive 25.1\n", "16 No Alive 43.5\n", "17 No Alive 27.1\n", "18 No Alive 58.3\n", "20 No Dead 73.2\n", "22 No Alive 33.4\n", "24 No Alive 18.0\n", "25 No Alive 56.2\n", "27 No Alive 25.8\n", "28 No Dead 36.9\n", "29 No Alive 20.2\n", "33 No Alive 19.4\n", "34 No Alive 56.9\n", "41 No Dead 69.7\n", "43 No Dead 75.8\n", "44 No Alive 25.3\n", "45 No Dead 83.0\n", "47 No Alive 18.5\n", "50 No Alive 82.8\n", "51 No Alive 45.0\n", "52 No Dead 73.3\n", "54 No Alive 28.4\n", "55 No Dead 73.7\n", "... ... ... ...\n", "1262 No Alive 41.2\n", "1265 No Alive 26.7\n", "1266 No Alive 41.8\n", "1267 No Alive 33.7\n", "1268 No Alive 56.5\n", "1272 No Alive 33.0\n", "1274 No Alive 25.7\n", "1275 No Alive 19.5\n", "1277 No Alive 23.4\n", "1279 No Alive 34.4\n", "1280 No Dead 83.9\n", "1281 No Alive 34.9\n", "1283 No Dead 86.3\n", "1286 No Alive 63.1\n", "1287 No Alive 60.8\n", "1289 No Alive 36.7\n", "1290 No Alive 63.8\n", "1291 No Dead 71.3\n", "1292 No Alive 57.7\n", "1293 No Alive 63.2\n", "1294 No Alive 46.6\n", "1298 No Alive 39.7\n", "1300 No Dead 71.0\n", "1301 No Alive 20.5\n", "1302 No Alive 44.4\n", "1306 No Dead 61.4\n", "1308 No Alive 42.1\n", "1310 No Alive 22.3\n", "1312 No Dead 88.6\n", "1313 No Alive 39.1\n", "\n", "[732 rows x 3 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_smoker = data[data[\"Smoker\"]==\"No\"]\n", "no_smoker" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On en déduit le total de femmes fumeuses (582) et non fumeuses (732). \n", "\n", "On récupère alors le nombre de femmes fumeuses mortes et vivantes avec la fonction `value_counts` : " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Alive 443\n", "Dead 139\n", "Name: Status, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smoker[\"Status\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On fait de même pour les femmes non fumeuses : " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Alive 502\n", "Dead 230\n", "Name: Status, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_smoker[\"Status\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On synthétise les résultats dans un tableau, auquel on ajoute une ligne pour calculer le total : " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "| | Fumeuses | Non Fumeuses |\n", "| :------------ | :-------------: | -------------: |\n", "| Nombre de Femmes Vivantes | 443 | 502 |\n", "| Nombre de Femmes Mortes | 139 | 230 |\n", "| Total |582 | 732|" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "__Consigne : Calculez dans chaque groupe (fumeuses / non fumeuses) le taux de mortalité (le rapport entre le nombre de femmes décédées dans un groupe et le nombre total de femmes dans ce groupe). Vous pourrez proposer une représentation graphique de ces données et calculer des intervalles de confiance si vous le souhaitez. En quoi ce résultat est-il surprenant ?____" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On calcule le taux de mortalité pour les femmes fumeuses et les femmes non fumeuses à __l'intérieur de chacun des groupes__ : il s'agit du rapport entre le nombre de femmes décédées dans un groupe sur le nombre de femmes total dans ce groupe. \n", "\n", "On récupère le nombre de personnes décédées dans un `DataFrame` en extrayant les personnes décédées `DataFrame[DataFrame[\"Status\"]==\"Dead\"]` et on en déduit le nombre de personnes concernées en prenant la longueur de ce sous-tableau avec la fonction `len`. On récupère le nombre de femmes fumeuses (resp. non fumeuses) total en récupérant la longueur du tableau `smoker` (resp. `no_smoker`) avec la fonction `len`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "0.23883161512027493" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "death_rate_smoker = len(smoker[smoker[\"Status\"]==\"Dead\"])/len(smoker)\n", "death_rate_smoker" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "__Le taux de mortalité des femmes fumeuses est donc de $23,88\\%$__" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "0.31420765027322406" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "death_rate_nosmoker = len(no_smoker[no_smoker[\"Status\"]==\"Dead\"])/len(no_smoker)\n", "death_rate_nosmoker" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "__Le taux de mortalité des femmes non-fumeuses est donc de $31,42\\%$__" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On représente ces données dans un diagramme en hsitogramme : " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "fig = plt.figure()\n", "\n", "height = [death_rate_smoker, death_rate_nosmoker]\n", "x = [\"Yes\", \"No\"]\n", "width = 1.0\n", "\n", "plt.title(\"Taux de mortalité en fonction de la catégorie femmmes fumeuses/non-fumeuses\")\n", "plt.ylabel(u\"Death Rate\")\n", "plt.xlabel(u\"Smoker\")\n", "plt.bar(x, height, width/2)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "Ces données sont surprenantes car le taux de mortalité pour les femmes fumeuses est plus petit que celui des femmes non-fumeuses ce qui est contradictoire avec la littérature à ce sujet." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Partie 2" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "__Consigne : Reprenez la question 1 (effectifs et taux de mortalité) en rajoutant une nouvelle catégorie liée à la classe d'âge. On considérera par exemple les classes suivantes : 18-34 ans, 34-54 ans, 54-64 ans, plus de 64 ans. En quoi ce résultat est-il surprenant ? Arrivez-vous à expliquer ce paradoxe ? De même, vous pourrez proposer une représentation graphique de ces données pour étayer vos explications.__" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On extrait de notre jeu de données les femmes âgées de 18 à 34 ans, puis on applique les étapes de la partie 1." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
0YesAlive21.0
1YesAlive19.3
6NoAlive23.8
8YesAlive24.8
10YesAlive30.0
15NoAlive25.1
17NoAlive27.1
22NoAlive33.4
24NoAlive18.0
27NoAlive25.8
29NoAlive20.2
33NoAlive19.4
37YesAlive29.5
38YesDead33.0
44NoAlive25.3
47NoAlive18.5
49YesAlive22.1
54NoAlive28.4
58NoAlive22.9
65YesAlive33.0
67YesAlive27.9
71YesAlive26.2
76NoAlive27.6
77YesAlive31.4
79NoAlive18.9
81YesAlive25.4
84NoAlive27.3
86NoAlive32.8
91NoAlive18.3
92YesAlive20.2
............
1205NoAlive23.2
1207YesAlive31.4
1208YesAlive30.0
1213NoAlive21.4
1216YesAlive27.9
1217YesAlive29.5
1219YesAlive27.0
1223YesAlive28.3
1226YesAlive31.0
1232NoAlive28.3
1240YesAlive29.7
1247NoAlive26.0
1250NoAlive19.8
1251YesAlive27.8
1253YesAlive27.8
1255NoDead28.5
1256NoAlive26.7
1260YesAlive20.4
1263YesAlive20.9
1265NoAlive26.7
1267NoAlive33.7
1271YesAlive24.9
1272NoAlive33.0
1274NoAlive25.7
1275NoAlive19.5
1277NoAlive23.4
1297YesAlive32.7
1301NoAlive20.5
1303YesAlive31.2
1310NoAlive22.3
\n", "

398 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "6 No Alive 23.8\n", "8 Yes Alive 24.8\n", "10 Yes Alive 30.0\n", "15 No Alive 25.1\n", "17 No Alive 27.1\n", "22 No Alive 33.4\n", "24 No Alive 18.0\n", "27 No Alive 25.8\n", "29 No Alive 20.2\n", "33 No Alive 19.4\n", "37 Yes Alive 29.5\n", "38 Yes Dead 33.0\n", "44 No Alive 25.3\n", "47 No Alive 18.5\n", "49 Yes Alive 22.1\n", "54 No Alive 28.4\n", "58 No Alive 22.9\n", "65 Yes Alive 33.0\n", "67 Yes Alive 27.9\n", "71 Yes Alive 26.2\n", "76 No Alive 27.6\n", "77 Yes Alive 31.4\n", "79 No Alive 18.9\n", "81 Yes Alive 25.4\n", "84 No Alive 27.3\n", "86 No Alive 32.8\n", "91 No Alive 18.3\n", "92 Yes Alive 20.2\n", "... ... ... ...\n", "1205 No Alive 23.2\n", "1207 Yes Alive 31.4\n", "1208 Yes Alive 30.0\n", "1213 No Alive 21.4\n", "1216 Yes Alive 27.9\n", "1217 Yes Alive 29.5\n", "1219 Yes Alive 27.0\n", "1223 Yes Alive 28.3\n", "1226 Yes Alive 31.0\n", "1232 No Alive 28.3\n", "1240 Yes Alive 29.7\n", "1247 No Alive 26.0\n", "1250 No Alive 19.8\n", "1251 Yes Alive 27.8\n", "1253 Yes Alive 27.8\n", "1255 No Dead 28.5\n", "1256 No Alive 26.7\n", "1260 Yes Alive 20.4\n", "1263 Yes Alive 20.9\n", "1265 No Alive 26.7\n", "1267 No Alive 33.7\n", "1271 Yes Alive 24.9\n", "1272 No Alive 33.0\n", "1274 No Alive 25.7\n", "1275 No Alive 19.5\n", "1277 No Alive 23.4\n", "1297 Yes Alive 32.7\n", "1301 No Alive 20.5\n", "1303 Yes Alive 31.2\n", "1310 No Alive 22.3\n", "\n", "[398 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract = data[data[\"Age\"]>=18]\n", "extract = extract[extract[\"Age\"]<34]\n", "extract" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "Pour cette première catégorie d'âge, on extrait les femmes fumeuses (`smoker`) et les femmes non-fumeuses (`nosmoker`), et on extrait à chaque fois le nombre femmes décédées et vivantes avec la fonction `value_counts()`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Alive 174\n", "Dead 5\n", "Name: Status, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smoker = extract[extract[\"Smoker\"]==\"Yes\"]\n", "smoker\n", "smoker[\"Status\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Alive 213\n", "Dead 6\n", "Name: Status, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nosmoker = extract[extract[\"Smoker\"]==\"No\"]\n", "nosmoker\n", "nosmoker[\"Status\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On calcule pour cette tranche d'âge les taux de mortalités avec la formule explicité précédemment : " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Taux de mortalité pour les femmes fumeuses : 0.23883161512027493\n", "Taux de mortalité pour les femmes non-fumeuses : 0.31420765027322406\n" ] } ], "source": [ "death_rate_smoker_cat1 = len(smoker[smoker[\"Status\"]==\"Dead\"])/len(smoker)\n", "death_rate_nosmoker_cat1 = len(nosmoker[nosmoker[\"Status\"]==\"Dead\"])/len(nosmoker)\n", "print(\"Taux de mortalité pour les femmes fumeuses : \",death_rate_smoker)\n", "print(\"Taux de mortalité pour les femmes non-fumeuses : \",death_rate_nosmoker)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On fait de même pour les autres tranches d'âges. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "__Tranche 34-54ans__ : " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "extract = data[data[\"Age\"]>=34]\n", "extract = extract[extract[\"Age\"]<54]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Alive 198\n", "Dead 41\n", "Name: Status, dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smoker = extract[extract[\"Smoker\"]==\"Yes\"]\n", "smoker[\"Status\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Alive 180\n", "Dead 19\n", "Name: Status, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nosmoker = extract[extract[\"Smoker\"]==\"No\"]\n", "nosmoker\n", "nosmoker[\"Status\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Taux de mortalité pour les femmes fumeuses : 0.23883161512027493\n", "Taux de mortalité pour les femmes non-fumeuses : 0.31420765027322406\n" ] } ], "source": [ "death_rate_smoker_cat2 = len(smoker[smoker[\"Status\"]==\"Dead\"])/len(smoker)\n", "death_rate_nosmoker_cat2 = len(nosmoker[nosmoker[\"Status\"]==\"Dead\"])/len(nosmoker)\n", "print(\"Taux de mortalité pour les femmes fumeuses : \",death_rate_smoker)\n", "print(\"Taux de mortalité pour les femmes non-fumeuses : \",death_rate_nosmoker)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "__Tranche 54-64ans__ :" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "extract = data[data[\"Age\"]>=54]\n", "extract = extract[extract[\"Age\"]<64]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Alive 64\n", "Dead 51\n", "Name: Status, dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smoker = extract[extract[\"Smoker\"]==\"Yes\"]\n", "smoker[\"Status\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Alive 80\n", "Dead 39\n", "Name: Status, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nosmoker = extract[extract[\"Smoker\"]==\"No\"]\n", "nosmoker\n", "nosmoker[\"Status\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Taux de mortalité pour les femmes fumeuses : 0.23883161512027493\n", "Taux de mortalité pour les femmes non-fumeuses : 0.31420765027322406\n" ] } ], "source": [ "death_rate_smoker_cat3 = len(smoker[smoker[\"Status\"]==\"Dead\"])/len(smoker)\n", "death_rate_nosmoker_cat3 = len(nosmoker[nosmoker[\"Status\"]==\"Dead\"])/len(nosmoker)\n", "print(\"Taux de mortalité pour les femmes fumeuses : \",death_rate_smoker)\n", "print(\"Taux de mortalité pour les femmes non-fumeuses : \",death_rate_nosmoker)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "__Tranche plus de 64ans__ :" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "extract = data[data[\"Age\"]>=64]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Dead 42\n", "Alive 7\n", "Name: Status, dtype: int64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smoker = extract[extract[\"Smoker\"]==\"Yes\"]\n", "smoker[\"Status\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "text/plain": [ "Dead 166\n", "Alive 29\n", "Name: Status, dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nosmoker = extract[extract[\"Smoker\"]==\"No\"]\n", "nosmoker\n", "nosmoker[\"Status\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Taux de mortalité pour les femmes fumeuses : 0.23883161512027493\n", "Taux de mortalité pour les femmes non-fumeuses : 0.31420765027322406\n" ] } ], "source": [ "death_rate_smoker_cat4 = len(smoker[smoker[\"Status\"]==\"Dead\"])/len(smoker)\n", "death_rate_nosmoker_cat4 = len(nosmoker[nosmoker[\"Status\"]==\"Dead\"])/len(nosmoker)\n", "print(\"Taux de mortalité pour les femmes fumeuses : \",death_rate_smoker)\n", "print(\"Taux de mortalité pour les femmes non-fumeuses : \",death_rate_nosmoker)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On trace alors les taux de mortalité pour les différentes classes d'âges précédemment calculés." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "\n", "labels = ['18-34', '34-54', '54-64', '>64']\n", "death_rate_smoker = [death_rate_smoker_cat1, death_rate_smoker_cat2, death_rate_smoker_cat3, death_rate_smoker_cat4]\n", "death_rate_nosmoker = [death_rate_nosmoker_cat1, death_rate_nosmoker_cat2, death_rate_nosmoker_cat3, death_rate_nosmoker_cat4]\n", "\n", "death_rate_smoker= np.around(death_rate_smoker, decimals=3)\n", "death_rate_nosmoker= np.around(death_rate_nosmoker, decimals=3)\n", "\n", "x = np.arange(len(labels)) \n", "width = 0.35 \n", "\n", "fig, ax = plt.subplots()\n", "rects1 = ax.bar(x - width/2, death_rate_smoker, width, label='Smoker')\n", "rects2 = ax.bar(x + width/2, death_rate_nosmoker, width, label='No Smoker')\n", "\n", "ax.set_ylabel('Death Rate')\n", "ax.set_xlabel('Age')\n", "ax.set_xticks(x)\n", "ax.set_xticklabels(labels)\n", "ax.legend()\n", "\n", "\n", "def autolabel(rects):\n", " \"\"\"Attach a text label above each bar in *rects*, displaying its height.\"\"\"\n", " for rect in rects:\n", " height = rect.get_height()\n", " ax.annotate('{}'.format(height),\n", " xy=(rect.get_x() + rect.get_width() / 2, height),\n", " xytext=(0, 3), # 3 points vertical offset\n", " textcoords=\"offset points\",\n", " ha='center', va='bottom')\n", "\n", "\n", "autolabel(rects1)\n", "autolabel(rects2)\n", "\n", "fig.tight_layout()\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "On remarque que pour les femmes fumeuses ont globalement un taux de mortalité plus élevé à l'intérieur de chacune des classes d'âge, et que la différence est plus marqué pour les classes d'âges moyennes que pour celles extrêmes. On peut penser que pour les jeunes et les personnes âgées, les causes de la mort sont autres que la cigarette. D'où un taux de mortalité quasiement identique. A l'inverse, entre 34 et 64 ans, le fait de fumer entraine une augmentation du taux de mortalité qui semble claire. Logiquement, le taux de mortalité augmente avec l'âge. \n", "\n", "Ce résultat contredit le résultat précédent. C'est lié à des éléments qui ne sont pas pris en compte si on ne considère que le groupe total (comme la présence de variables non indépendantes ou de différences d'effectifs entre les groupes). Regardons de plus près. On calcule l'âge moyen ainsi que la variance pour les deux groupes. On utilise la fonction `mean` pour cela. " ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "L'age moyen du groupe fumeuses est 44.26975945017182, avec un écart type de 16.21788646063739\n" ] } ], "source": [ "print('L\\'age moyen du groupe fumeuses est {0}, avec un écart type de {1}'.format(data[data[\"Smoker\"]==\"Yes\"][\"Age\"].mean(),data[data[\"Smoker\"]==\"Yes\"][\"Age\"].std()))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "L'age moyen du groupe non-fumeuses est 49.81584699453551, avec un écart type de 20.89829374608753\n" ] } ], "source": [ "print('L\\'age moyen du groupe non-fumeuses est {0}, avec un écart type de {1}'.format(data[data[\"Smoker\"]==\"No\"][\"Age\"].mean(),data[data[\"Smoker\"]==\"No\"][\"Age\"].std()))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "Ainsi, le groupe non-fumeuses est significativement plus âgés que le groupe fumeuse. A l'inverse, le groupe fumeuses est moins dispersé en agê. Cela introduit des biais qui expliquent le résulat de la partie 1 obtenu en considérant l'ensemble du groupe et non pas les classe d'âge : ainsi, comme les femmes fumeuses dans nos données sont plus jeunes, elles ont un taux de mortalité __global__ plus petit que celui des femmes non fumeuses, malgré le fait qu'elles fument. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Partie 3\n", "\n", "__Consigne : Afin d'éviter un biais induit par des regroupements en tranches d'âges arbitraires et non régulières, il est envisageable d'essayer de réaliser une régression logistique. Si on introduit une variable Death valant 1 ou 0 pour indiquer si l'individu est décédé durant la période de 20 ans, on peut étudier le modèle Death ~ Age pour étudier la probabilité de décès en fonction de l'âge selon que l'on considère le groupe des fumeuses ou des non fumeuses. Ces régressions vous permettent-elles de conclure sur la nocivité du tabagisme ? Vous pourrez proposer une représentation graphique de ces régressions (en n'omettant pas les régions de confiance).__\n", "\n", "On repart des données initiales, `data`. On va construire deux listes, `data_rate_smoker` et `data_rate_nosmoker` contenant le taux de mortalité à un âge $i$ donné, calculé comme la moyenne du taux de mortalité pour l'intervalle $[i-10,i+10]$, de longueur 20 ans. Pour cela, on reprend les étapes précédentes, en extrayant les femmes dont l'âge est dans l'interval, puis en calculant pour les fumeuses et les non-fumeuses le taux de mortalité. On stocke l'âge $i$ considéré dans une liste `age` pour pouvoir ploter plus facilement un graphique. On fait varier $i$ de 18ans à l'âge maximal" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data_rate_smoker = []\n", "data_rate_nosmoker = []\n", "age = []\n", "for i in range(18,int(data[\"Age\"].max())+1):\n", " extract = data[data[\"Age\"]>=i-10]\n", " extract = extract[extract[\"Age\"]