{ "cells": [ { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "# Etude sur le paradoxe de Simpson" ] }, { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "Paradoxe de Simpson, c'est quoi ca encore ? " ] }, { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "Un paradoxe plutot intéressant qui se cache dans les jeux de données. Il est important de le connaitre pour éviter des erruers monumentales surtout si on travaille en médecine pour analyser correctement des données de santé. \n", "\n", "Il se produit dans au moins deux cas: \n", "1) Un facteur de confusion qui ne saute pas aux yeux de prime abord mais qui se cache et qui va avoir un impact sur le résultat final. \n", "2) les données ne sont pas réparties de manière homogène = ne suivent pas une distribution normalion (loi normale).\n", "\n", "Utile pour aiguiser son esprit critique quant à ce qui nous est présenté dans la litterature, à la tv pour ne pas se méprendre. Une facon simple de s'en débarasser est de vérifier si notre jeu de données suit bien une distibution selon la loi normale. " ] }, { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "Pour le comprendre nous allons utiliser un jeu de données historique comparant le taux de mortalité de femmes fumeuses ou non sur 20 ans d'étude -> Appleton, David R., Joyce M. French, and Mark PJ Vanderpump. « Ignoring a covariate: An example of Simpson’s paradox. » The American Statistician 50.4 (1996): 340-341." ] }, { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "Avant de commencer importons les modules nécessaires à notre analyse: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hideCode": true, "hidePrompt": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "Chargeons les données depuis le lien donné par notre navigateur:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hideCode": true, "hideOutput": true, "hidePrompt": true }, "outputs": [], "source": [ "data_url = \"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false\"" ] }, { "cell_type": "markdown", "metadata": { "hideCode": true, "hidePrompt": true }, "source": [ "Transformons ce jeu de données en DataFrame pandas pour pouvoir l'analyser comme il se doit. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hideCode": true, "hideOutput": true, "hidePrompt": true }, "outputs": [], "source": [ "raw_data = pd.read_csv(data_url)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hideCode": true, "hidePrompt": true }, "outputs": [ { "data": { "text/html": [ "
\n", " | Smoker | \n", "Status | \n", "Age | \n", "
---|---|---|---|
0 | \n", "Yes | \n", "Alive | \n", "21.0 | \n", "
1 | \n", "Yes | \n", "Alive | \n", "19.3 | \n", "
2 | \n", "No | \n", "Dead | \n", "57.5 | \n", "
3 | \n", "No | \n", "Alive | \n", "47.1 | \n", "
4 | \n", "Yes | \n", "Alive | \n", "81.4 | \n", "
5 | \n", "No | \n", "Alive | \n", "36.8 | \n", "
6 | \n", "No | \n", "Alive | \n", "23.8 | \n", "
7 | \n", "Yes | \n", "Dead | \n", "57.5 | \n", "
8 | \n", "Yes | \n", "Alive | \n", "24.8 | \n", "
9 | \n", "Yes | \n", "Alive | \n", "49.5 | \n", "
10 | \n", "Yes | \n", "Alive | \n", "30.0 | \n", "
11 | \n", "No | \n", "Dead | \n", "66.0 | \n", "
12 | \n", "Yes | \n", "Alive | \n", "49.2 | \n", "
13 | \n", "No | \n", "Alive | \n", "58.4 | \n", "
14 | \n", "No | \n", "Dead | \n", "60.6 | \n", "
15 | \n", "No | \n", "Alive | \n", "25.1 | \n", "
16 | \n", "No | \n", "Alive | \n", "43.5 | \n", "
17 | \n", "No | \n", "Alive | \n", "27.1 | \n", "
18 | \n", "No | \n", "Alive | \n", "58.3 | \n", "
19 | \n", "Yes | \n", "Alive | \n", "65.7 | \n", "
20 | \n", "No | \n", "Dead | \n", "73.2 | \n", "
21 | \n", "Yes | \n", "Alive | \n", "38.3 | \n", "
22 | \n", "No | \n", "Alive | \n", "33.4 | \n", "
23 | \n", "Yes | \n", "Dead | \n", "62.3 | \n", "
24 | \n", "No | \n", "Alive | \n", "18.0 | \n", "
25 | \n", "No | \n", "Alive | \n", "56.2 | \n", "
26 | \n", "Yes | \n", "Alive | \n", "59.2 | \n", "
27 | \n", "No | \n", "Alive | \n", "25.8 | \n", "
28 | \n", "No | \n", "Dead | \n", "36.9 | \n", "
29 | \n", "No | \n", "Alive | \n", "20.2 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
1284 | \n", "Yes | \n", "Dead | \n", "36.0 | \n", "
1285 | \n", "Yes | \n", "Alive | \n", "48.3 | \n", "
1286 | \n", "No | \n", "Alive | \n", "63.1 | \n", "
1287 | \n", "No | \n", "Alive | \n", "60.8 | \n", "
1288 | \n", "Yes | \n", "Dead | \n", "39.3 | \n", "
1289 | \n", "No | \n", "Alive | \n", "36.7 | \n", "
1290 | \n", "No | \n", "Alive | \n", "63.8 | \n", "
1291 | \n", "No | \n", "Dead | \n", "71.3 | \n", "
1292 | \n", "No | \n", "Alive | \n", "57.7 | \n", "
1293 | \n", "No | \n", "Alive | \n", "63.2 | \n", "
1294 | \n", "No | \n", "Alive | \n", "46.6 | \n", "
1295 | \n", "Yes | \n", "Dead | \n", "82.4 | \n", "
1296 | \n", "Yes | \n", "Alive | \n", "38.3 | \n", "
1297 | \n", "Yes | \n", "Alive | \n", "32.7 | \n", "
1298 | \n", "No | \n", "Alive | \n", "39.7 | \n", "
1299 | \n", "Yes | \n", "Dead | \n", "60.0 | \n", "
1300 | \n", "No | \n", "Dead | \n", "71.0 | \n", "
1301 | \n", "No | \n", "Alive | \n", "20.5 | \n", "
1302 | \n", "No | \n", "Alive | \n", "44.4 | \n", "
1303 | \n", "Yes | \n", "Alive | \n", "31.2 | \n", "
1304 | \n", "Yes | \n", "Alive | \n", "47.8 | \n", "
1305 | \n", "Yes | \n", "Alive | \n", "60.9 | \n", "
1306 | \n", "No | \n", "Dead | \n", "61.4 | \n", "
1307 | \n", "Yes | \n", "Alive | \n", "43.0 | \n", "
1308 | \n", "No | \n", "Alive | \n", "42.1 | \n", "
1309 | \n", "Yes | \n", "Alive | \n", "35.9 | \n", "
1310 | \n", "No | \n", "Alive | \n", "22.3 | \n", "
1311 | \n", "Yes | \n", "Dead | \n", "62.1 | \n", "
1312 | \n", "No | \n", "Dead | \n", "88.6 | \n", "
1313 | \n", "No | \n", "Alive | \n", "39.1 | \n", "
1314 rows × 3 columns
\n", "\n", " | Smoker | \n", "Status | \n", "Age | \n", "
---|---|---|---|
0 | \n", "Yes | \n", "Alive | \n", "21.0 | \n", "
1 | \n", "Yes | \n", "Alive | \n", "19.3 | \n", "
4 | \n", "Yes | \n", "Alive | \n", "81.4 | \n", "
7 | \n", "Yes | \n", "Dead | \n", "57.5 | \n", "
8 | \n", "Yes | \n", "Alive | \n", "24.8 | \n", "
9 | \n", "Yes | \n", "Alive | \n", "49.5 | \n", "
10 | \n", "Yes | \n", "Alive | \n", "30.0 | \n", "
12 | \n", "Yes | \n", "Alive | \n", "49.2 | \n", "
19 | \n", "Yes | \n", "Alive | \n", "65.7 | \n", "
21 | \n", "Yes | \n", "Alive | \n", "38.3 | \n", "
23 | \n", "Yes | \n", "Dead | \n", "62.3 | \n", "
26 | \n", "Yes | \n", "Alive | \n", "59.2 | \n", "
30 | \n", "Yes | \n", "Alive | \n", "34.6 | \n", "
31 | \n", "Yes | \n", "Alive | \n", "51.9 | \n", "
32 | \n", "Yes | \n", "Alive | \n", "49.9 | \n", "
35 | \n", "Yes | \n", "Alive | \n", "46.7 | \n", "
36 | \n", "Yes | \n", "Alive | \n", "44.4 | \n", "
37 | \n", "Yes | \n", "Alive | \n", "29.5 | \n", "
38 | \n", "Yes | \n", "Dead | \n", "33.0 | \n", "
39 | \n", "Yes | \n", "Alive | \n", "35.6 | \n", "
40 | \n", "Yes | \n", "Alive | \n", "39.1 | \n", "
42 | \n", "Yes | \n", "Alive | \n", "35.7 | \n", "
46 | \n", "Yes | \n", "Dead | \n", "44.3 | \n", "
48 | \n", "Yes | \n", "Alive | \n", "37.5 | \n", "
49 | \n", "Yes | \n", "Alive | \n", "22.1 | \n", "
53 | \n", "Yes | \n", "Alive | \n", "39.0 | \n", "
56 | \n", "Yes | \n", "Alive | \n", "40.1 | \n", "
60 | \n", "Yes | \n", "Alive | \n", "58.1 | \n", "
61 | \n", "Yes | \n", "Alive | \n", "37.3 | \n", "
63 | \n", "Yes | \n", "Dead | \n", "36.3 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
1240 | \n", "Yes | \n", "Alive | \n", "29.7 | \n", "
1243 | \n", "Yes | \n", "Alive | \n", "40.1 | \n", "
1251 | \n", "Yes | \n", "Alive | \n", "27.8 | \n", "
1252 | \n", "Yes | \n", "Alive | \n", "52.4 | \n", "
1253 | \n", "Yes | \n", "Alive | \n", "27.8 | \n", "
1254 | \n", "Yes | \n", "Alive | \n", "41.0 | \n", "
1259 | \n", "Yes | \n", "Alive | \n", "40.8 | \n", "
1260 | \n", "Yes | \n", "Alive | \n", "20.4 | \n", "
1263 | \n", "Yes | \n", "Alive | \n", "20.9 | \n", "
1264 | \n", "Yes | \n", "Alive | \n", "45.5 | \n", "
1269 | \n", "Yes | \n", "Alive | \n", "38.8 | \n", "
1270 | \n", "Yes | \n", "Alive | \n", "55.5 | \n", "
1271 | \n", "Yes | \n", "Alive | \n", "24.9 | \n", "
1273 | \n", "Yes | \n", "Alive | \n", "55.7 | \n", "
1276 | \n", "Yes | \n", "Alive | \n", "58.5 | \n", "
1278 | \n", "Yes | \n", "Alive | \n", "43.7 | \n", "
1282 | \n", "Yes | \n", "Alive | \n", "51.2 | \n", "
1284 | \n", "Yes | \n", "Dead | \n", "36.0 | \n", "
1285 | \n", "Yes | \n", "Alive | \n", "48.3 | \n", "
1288 | \n", "Yes | \n", "Dead | \n", "39.3 | \n", "
1295 | \n", "Yes | \n", "Dead | \n", "82.4 | \n", "
1296 | \n", "Yes | \n", "Alive | \n", "38.3 | \n", "
1297 | \n", "Yes | \n", "Alive | \n", "32.7 | \n", "
1299 | \n", "Yes | \n", "Dead | \n", "60.0 | \n", "
1303 | \n", "Yes | \n", "Alive | \n", "31.2 | \n", "
1304 | \n", "Yes | \n", "Alive | \n", "47.8 | \n", "
1305 | \n", "Yes | \n", "Alive | \n", "60.9 | \n", "
1307 | \n", "Yes | \n", "Alive | \n", "43.0 | \n", "
1309 | \n", "Yes | \n", "Alive | \n", "35.9 | \n", "
1311 | \n", "Yes | \n", "Dead | \n", "62.1 | \n", "
582 rows × 3 columns
\n", "