{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Autour du Paradoxe de Simpson" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En 1972-1974, à Whickham, une ville du nord-est de l'Angleterre, située à environ 6,5 kilomètres au sud-ouest de Newcastle upon Tyne, un sondage d'un sixième des électeurs a été effectué afin d'éclairer des travaux sur les maladies thyroïdiennes et cardiaques (Tunbridge et al. 1977). Une suite de cette étude a été menée vingt ans plus tard (Vanderpump et al. 1995). Certains des résultats avaient trait au tabagisme et cherchaient à savoir si les individus étaient toujours en vie lors de la seconde étude. Par simplicité, nous nous restreindrons aux femmes et parmi celles-ci aux 1314 qui ont été catégorisées comme \"fumant actuellement\" ou \"n'ayant jamais fumé\". Il y avait relativement peu de femmes dans le sondage initial ayant fumé et ayant arrêté depuis (162) et très peu pour lesquelles l'information n'était pas disponible (18). La survie à 20 ans a été déterminée pour l'ensemble des femmes du premier sondage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Les données de ces études sont disponibles sur le gitlab de l'inria dans un [document csv](https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/blob/master/module3/Practical_session/Subject6_smoking.csv). Dans ce document, chaque ligne indique si la personne fume ou non, si elle est vivante ou décédée au moment de la seconde étude, et son âge lors du premier sondage. Nous téléchargeons toujours l'ensemble complet des données du document.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data_url = \"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pour nous protéger contre une éventuelle disparition ou modification du serveur du gitlab, nous faisons une copie locale de ce jeux de données que nous préservons avec notre analyse. Il est inutile et même risquée de télécharger les données à chaque exécution, car dans le cas d'une panne nous pourrions remplacer nos données par un fichier défectueux. Pour cette raison, nous téléchargeons les données seulement si la copie locale n'existe pas." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data_file = \"survey-data-subject6.csv\"\n", "\n", "import os\n", "import urllib.request\n", "if not os.path.exists(data_file):\n", " urllib.request.urlretrieve(data_url, data_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Le document comporte trois colonnes : la première colonne indique leur habitude de tabagisme, la deuxième renseigne si la personne est vivante ou décédée au moment de la seconde étude et enfin, la troisième colonne indique leur âge lors de la première étude" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
0YesAlive21.0
1YesAlive19.3
2NoDead57.5
3NoAlive47.1
4YesAlive81.4
5NoAlive36.8
6NoAlive23.8
7YesDead57.5
8YesAlive24.8
9YesAlive49.5
10YesAlive30.0
11NoDead66.0
12YesAlive49.2
13NoAlive58.4
14NoDead60.6
15NoAlive25.1
16NoAlive43.5
17NoAlive27.1
18NoAlive58.3
19YesAlive65.7
20NoDead73.2
21YesAlive38.3
22NoAlive33.4
23YesDead62.3
24NoAlive18.0
25NoAlive56.2
26YesAlive59.2
27NoAlive25.8
28NoDead36.9
29NoAlive20.2
............
1284YesDead36.0
1285YesAlive48.3
1286NoAlive63.1
1287NoAlive60.8
1288YesDead39.3
1289NoAlive36.7
1290NoAlive63.8
1291NoDead71.3
1292NoAlive57.7
1293NoAlive63.2
1294NoAlive46.6
1295YesDead82.4
1296YesAlive38.3
1297YesAlive32.7
1298NoAlive39.7
1299YesDead60.0
1300NoDead71.0
1301NoAlive20.5
1302NoAlive44.4
1303YesAlive31.2
1304YesAlive47.8
1305YesAlive60.9
1306NoDead61.4
1307YesAlive43.0
1308NoAlive42.1
1309YesAlive35.9
1310NoAlive22.3
1311YesDead62.1
1312NoDead88.6
1313NoAlive39.1
\n", "

1314 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "2 No Dead 57.5\n", "3 No Alive 47.1\n", "4 Yes Alive 81.4\n", "5 No Alive 36.8\n", "6 No Alive 23.8\n", "7 Yes Dead 57.5\n", "8 Yes Alive 24.8\n", "9 Yes Alive 49.5\n", "10 Yes Alive 30.0\n", "11 No Dead 66.0\n", "12 Yes Alive 49.2\n", "13 No Alive 58.4\n", "14 No Dead 60.6\n", "15 No Alive 25.1\n", "16 No Alive 43.5\n", "17 No Alive 27.1\n", "18 No Alive 58.3\n", "19 Yes Alive 65.7\n", "20 No Dead 73.2\n", "21 Yes Alive 38.3\n", "22 No Alive 33.4\n", "23 Yes Dead 62.3\n", "24 No Alive 18.0\n", "25 No Alive 56.2\n", "26 Yes Alive 59.2\n", "27 No Alive 25.8\n", "28 No Dead 36.9\n", "29 No Alive 20.2\n", "... ... ... ...\n", "1284 Yes Dead 36.0\n", "1285 Yes Alive 48.3\n", "1286 No Alive 63.1\n", "1287 No Alive 60.8\n", "1288 Yes Dead 39.3\n", "1289 No Alive 36.7\n", "1290 No Alive 63.8\n", "1291 No Dead 71.3\n", "1292 No Alive 57.7\n", "1293 No Alive 63.2\n", "1294 No Alive 46.6\n", "1295 Yes Dead 82.4\n", "1296 Yes Alive 38.3\n", "1297 Yes Alive 32.7\n", "1298 No Alive 39.7\n", "1299 Yes Dead 60.0\n", "1300 No Dead 71.0\n", "1301 No Alive 20.5\n", "1302 No Alive 44.4\n", "1303 Yes Alive 31.2\n", "1304 Yes Alive 47.8\n", "1305 Yes Alive 60.9\n", "1306 No Dead 61.4\n", "1307 Yes Alive 43.0\n", "1308 No Alive 42.1\n", "1309 Yes Alive 35.9\n", "1310 No Alive 22.3\n", "1311 Yes Dead 62.1\n", "1312 No Dead 88.6\n", "1313 No Alive 39.1\n", "\n", "[1314 rows x 3 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_data = pd.read_csv(data_file)\n", "raw_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pour nous assurer que le jeu de données est complet, nous vérifions qu'il n'y a pas d'informations manquantes conernant l'une des personnes du sondage. Après vérification, il n'y a pas de données manquantes." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [Smoker, Status, Age]\n", "Index: []" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_data[raw_data.isnull().any(axis=1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Effectif et taux de mortalite" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nous calculons le nombre total de femmes vivantes et décédées sur la période en fonction de leur habitude de tabagisme" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "alive_and_smoker = 0\n", "alive_and_non_smoker = 0\n", "dead_and_smoker = 0\n", "dead_and_non_smoker = 0\n", "for i in range(len(raw_data)):\n", " if raw_data.iloc[i][0] == \"Yes\":\n", " if raw_data.iloc[i][1] == \"Alive\":\n", " alive_and_smoker += 1\n", " else :\n", " dead_and_smoker += 1\n", " else :\n", " if raw_data.iloc[i][1] == \"Alive\":\n", " alive_and_non_smoker += 1\n", " else :\n", " dead_and_non_smoker += 1\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "D'apres nos calculs, dans l'etude il y avait 582 fumeuses dont 139 sont mortes et 732 non-fumeuses dont 230 sont decedees. Nous représentons ensuite ces données sous la forme d'un tableau." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerNon-SmokerTotal
Alive443502945
Dead139230369
Total5827321314
\n", "
" ], "text/plain": [ " Smoker Non-Smoker Total\n", "Alive 443 502 945\n", "Dead 139 230 369\n", "Total 582 732 1314" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = [[alive_and_smoker,alive_and_non_smoker,(alive_and_smoker+alive_and_non_smoker)],[dead_and_smoker, dead_and_non_smoker,(dead_and_non_smoker+dead_and_smoker)], [(dead_and_smoker+alive_and_smoker),(dead_and_non_smoker+alive_and_non_smoker),(alive_and_smoker+alive_and_non_smoker + dead_and_non_smoker+dead_and_smoker)]]\n", "\n", "pd.DataFrame(data, columns=[\"Smoker\", \"Non-Smoker\", \"Total\"], index = [\"Alive\", \"Dead\",\"Total\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A present, nous allons calculer le taux de mortalite dans chacun de ces deux groupes. Pour cela, nous allons determiner le rapport entre le nombre de femmes décédées dans un groupe et le nombre total de femmes dans ce groupe.\n", "\n", "Le taux de mortalite chez les fumeuses etait de 24% tandis que celui des non-fumeuses etait de 31%. Nous obtenons un resultat assez surprenant car d'apres ces etudes, les femmes non-fumeuses meurent plus que les femmes qui fument, ce qui est contraire aux campagnes de prevention que l'on peut croiser un peu partout." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "mortality_rate_smoker = dead_and_smoker/(alive_and_smoker+dead_and_smoker)\n", "mortality_rate_non_smoker = dead_and_non_smoker /(alive_and_non_smoker + dead_and_non_smoker)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nous representons les taux de mortalite calcules precedemment dans un diagramme de barres afin d'illustrer visuellement nos resultats et le fait que, d'apres ces sondages, les femmes qui fument meurent moins que celle qui ne fument pas." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0,0.5,'Mortality Rate')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZ8AAAEKCAYAAADNSVhkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAE3hJREFUeJzt3XvUXXV95/H3h/tNpEJAEELwRgsqUSKC1ap4AxSYES3qoIAXxq5Rl7ZdFMexS6S1LdraOqNl8AJRp1oVqYACWhU7IiBJgZB4oYjcjDe8Eh0Qwnf+2DtwjHlOTkL27wkn79daZz1n//Zv7/M9WTnP5/ntvc9vp6qQJKmlzWa7AEnSpsfwkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJam6L2S5gY7XLLrvUvHnzZrsMSXpAWbx48W1VNWdt/QyfGcybN49FixbNdhmS9ICS5KZJ+nnYTZLUnOEjSWrO8JEkNWf4SJKaM3wkSc0NGj5JDkvyrSTXJzllDeuPTrIkydVJFiV5ysi6Dyb5YZKlM+z7T5NUkl365WcnWZzk2v7noSN9/zLJLUlWDPE+JUnrZrDwSbI58B7gcGA/4CVJ9lut2xeAA6pqPvAK4P0j684GDpth33sBzwZuHmm+DTiyqh4LHA98eGTd+cBB6/1mJEkb1JAjn4OA66vqhqr6NfAx4OjRDlW1ou67j/f2QI2s+zfgJzPs+13Ayav1v6qqlveLy4Btkmzdr7u8qr63Ad6TJGkDGDJ8HgbcMrJ8a9/2G5L85yTfBD5DN/oZK8lRwHer6pox3Y4BrqqqO9etZElSC0POcJA1tNVvNVSdC5yb5A+A04BnzbjDZDvgzcBzxvTZH/ibcX3GbHsScBLA3Llz13VzSVPm2M/8cLZLaO6fn7drk9cZcuRzK7DXyPKewPIZ+q46zPaIVRcQzOARwD7ANUlu7Pf570keCpBkT+Bc4OVV9e11LbiqzqyqBVW1YM6ctU5NJElaT0OOfK4EHpVkH+C7wIuBl452SPJI4NtVVUmeAGwF/HimHVbVtcCuI9vfCCyoqtuS7ER36O5NVXXphn4zkqQNZ7CRT1XdDbwWuBj4BvDxqlqW5DVJXtN3OwZYmuRquivjjl11AUKSjwKXAfsmuTXJK9fykq8FHgm8pb90++oku/b7Oj3JrcB2/b7euoHfriRpHeS+i800asGCBeWs1tKmzXM+6y7J4qpasLZ+znAgSWrO8JEkNWf4SJKaM3wkSc0ZPpKk5gwfSVJzho8kqTnDR5LUnOEjSWrO8JEkNWf4SJKaM3wkSc0ZPpKk5gwfSVJzho8kqTnDR5LUnOEjSWrO8JEkNWf4SJKaM3wkSc0ZPpKk5gwfSVJzho8kqTnDR5LUnOEjSWrO8JEkNWf4SJKaM3wkSc0ZPpKk5gwfSVJzho8kqTnDR5LUnOEjSWrO8JEkNWf4SJKaM3wkSc0ZPpKk5gwfSVJzho8kqTnDR5LUnOEjSWpu0PBJcliSbyW5Pskpa1j/X5Is6R9fTXJA375Nkq8luSbJsiSnjmwzP8nlSa5OsijJQX37lkkWJrk2yTeSvGlkm4tG9nVGks2HfN+SpPEGC5/+F/x7gMOB/YCXJNlvtW7fAZ5WVY8DTgPO7NvvBA6tqgOA+cBhSQ7u150OnFpV84E/75cBXgRsXVWPBQ4E/muSef26P+z39RhgTt9XkjRLhhz5HARcX1U3VNWvgY8BR492qKqvVtVP+8XLgT379qqqFX37lv2jVm0G7Ng/fzCwfKR9+yRbANsCvwZ+0e/vF32fLYCtRvYlSZoFWwy474cBt4ws3wo8aUz/VwIXrlroR06LgUcC76mqK/pVbwAuTvJOuvB8ct/+Sbpw+x6wHfDGqvrJyP4upgvEC/u+vyXJScBJAHPnzp3oTa7J987Oem/7QLX7Cea5pMkNOfJZ02/gNf6GSvIMuvD5s3s7Vq3sD63tCRyU5DH9qj+iC5a9gDcCH+jbDwJWAnsA+wB/kuThI/t7LrA7sDVw6JrqqKozq2pBVS2YM2fOxG9UkrRuhgyfW4G9Rpb35L5DZPdK8jjg/cDRVfXj1ddX1c+AS4DD+qbjgU/1zz9BFzoALwUuqqq7quqHwKXAgtX2dQdwHqsd/pMktTVk+FwJPCrJPkm2Al5M94v/Xknm0gXJy6rqupH2OUl26p9vCzwL+Ga/ejnwtP75ocB/9M9vBg5NZ3vgYOCbSXZIsnu/ry2AI0b2JUmaBYOd86mqu5O8FrgY2Bz4YFUtS/Kafv0ZdFer7Qy8NwnA3VW1gO7w2ML+vM9mwMer6oJ+168G/qEPkjvoz9HQXVl3FrCU7pDfWVW1JMluwHlJtu7r+CJwxlDvW5K0dkNecEBVfRb47GptZ4w8fxXwqjVstwR4/Az7/ArdpdSrt69gDZdQV9UPgCeua+2SpOE4w4EkqTnDR5LUnOEjSWrO8JEkNWf4SJKaM3wkSc0ZPpKk5gwfSVJzho8kqTnDR5LU3FrDJ8luST6Q5MJ+eb8krxy+NEnStJpk5HM23eSge/TL19Hd0E2SpPUySfjsUlUfB+6BbrZqupu2SZK0XiYJn18m2Zn+LqRJDgZ+PmhVkqSpNsktFf6Y7iZwj0hyKTCHNdy6QJKkSU0SPsvo7hy6L91N2r6FV8lJku6HSULksqq6u6qWVdXSqroLuGzowiRJ02vGkU+ShwIPA7ZN8ni6UQ/AjsB2DWqTJE2pcYfdngucAOwJ/N1I++3Afx+wJknSlJsxfKpqIbAwyTFVdU7DmiRJU26tFxxU1TlJngfsD2wz0v62IQuTJE2vSabXOQM4Fngd3XmfFwF7D1yXJGmKTXK125Or6uXAT6vqVOAQYK9hy5IkTbNJwuf/9T9/lWQP4C5gn+FKkiRNu0m+ZHpBkp2AdwD/TjfNzvsHrUqSNNUmueDgtP7pOUkuALapKud2kyStt3WaJqeq7gQOSvL5geqRJG0CZgyfJIcmuS7JiiQf6W8itwj4a+Af25UoSZo240Y+fwucBOwMfBK4HPhwVR1YVZ9qUZwkaTqNO+dTVXVJ//xfkvyoqv6hQU2SpCk3Lnx2SvKCkeWMLjv6kSStr3Hh82XgyBmWCzB8JEnrZdzEoie2LESStOnwjqSSpOYMH0lSc4aPJKm5SW6psCjJf0vyOy0KkiRNv0lGPi8G9gCuTPKxJM9NkoHrkiRNsbWGT1VdX1VvBh4N/BPwQeDmJKcmecjQBUqSps9E53ySPI5uup13AOcALwR+AXxxuNIkSdNqrbdUSLIY+BnwAeCUfmZrgCuS/P6QxUmSptMkN5N7UVXdMNqQZJ+q+k5VvWCmjSRJmskkh90+OWGbJEkTmXHkk+R3gf2BB682weiOwDZDFyZJml7jDrvtCzwf2InfnGD0duDVQxYlSZpu4yYW/TTw6SSHVNVlDWuSJE25cYfdTq6q04GXJnnJ6uur6vWDViZJmlrjDrt9o/+5qEUhkqRNx7jDbuf3Pxe2K0eStCkYd9jtfLo7lq5RVR01SEWSpKk37rDbO5tVIUnapIw77PblloVIkjYdk8zt9ijgr4D9GPlyaVU9fMC6JElTbJLpdc4C/hG4G3gG8CHgw0MWJUmabpOEz7ZV9QUgVXVTVb0VOHTYsiRJ02ySWa3vSLIZ8B9JXgt8F9h12LIkSdNskpHPG4DtgNcDBwLHAS8fsihJ0nSbJHzmVdWKqrq1qk6sqmOAuUMXJkmaXpOEz5smbJMkaSLjZjg4HDgCeFiSd4+s2pHuyjdJktbLuAsOltNNKnoUsHik/XbgjUMWJUmabuNmOLgmyVLgOU4uKknakMae86mqlcDOSbZqVI8kaRMwyfd8bgIuTXIe8MtVjVX1d4NVJUmaapOEz/L+sRnwoGHLkSRtCtYaPlV1KkCSB3WLtWLwqiRJU22t3/NJ8pgkVwFLgWVJFifZf/jSJEnTapIvmZ4J/HFV7V1VewN/Arxv2LIkSdNskvDZvqq+tGqhqi4Bth+sIknS1JvkgoMbkryF++7hcxzwneFKkiRNu0lGPq8A5gCfAs7tn584ZFGSpOk2ydVuP6W7nYIkSRvEuIlFzxu3YVUdteHLkSRtCsaNfA4BbgE+ClwBpElFkqSpNy58Hgo8G3gJ8FLgM8BHq2pZi8IkSdNrxgsOqmplVV1UVccDBwPXA5ckeV2z6iRJU2nsBQdJtgaeRzf6mQe8m+6qN0mS1tu4Cw4WAo8BLgROraqlzaqSJE21cSOfl9HdQuHRwOuTe683CN0EozsOXJskaUqNu5PpJF9AlSRpnRkwkqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzaWqZruGjVKSHwE3zXYd62gX4LbZLkLSBvFA/TzvXVVz1tbJ8JkiSRZV1YLZrkPS/Tftn2cPu0mSmjN8JEnNGT7T5czZLkDSBjPVn2fP+UiSmnPkI0lqzvCZUkmenuSC2a5DUltJbkyyy2zXsTaGj35Lki1muwZJ7SXZvNVrGT4biSTbJ/lMkmuSLE1ybP8XzNuTXJZkUZInJLk4ybeTvKbfLkne0W9zbZJj17DvJya5KsnD+9f5YJIr+7aj+z4nJPlEkvOBzzV++9JGLcm8JN9I8r4ky5J8Lsm2SeYnuTzJkiTnJvmdvv8lSf4mydeSXJfkqTPs9/VJvt5v/7G+7a1JFvavcWOSFyQ5vf98X5Rky77fM/vP8LX9Z3rr1fa9bd//1f3ycX09Vyf536uCJsmKJG9LcgVwyID/jL+pqnxsBA/gGOB9I8sPBm4E/qhffhewBHgQMAf44ch2nwc2B3YDbgZ2B54OXAA8GVgMzO37vx04rn++E3AdsD1wAnAr8JDZ/rfw4WNjewDzgLuB+f3yx4Hj+s/k0/q2twF/3z+/BPjb/vkRwL/OsN/lwNb98536n28FvgJsCRwA/Ao4vF93LvCfgG2AW4BH9+0fAt7QP7+xr/dfgZf3bb8HnA9s2S+/d2RdAX/Y+t/Ukc/G41rgWf1fS0+tqp/37eeNrL+iqm6vqh8BdyTZCXgK8NGqWllVPwC+DDyx3+b36C7XPLKqbu7bngOckuRqug/INsDcft3nq+onA75H6YHsO1V1df98MfAIusD4ct+2EPiDkf6fGuk7b4Z9LgH+T5Lj6MJtlQur6i66z/3mwEV9+7X9vvbt67luhtf+NHBWVX2oX34mcCBwZf/Zfybw8H7dSuCcmd/2MDy2v5GoquuSHEj3V9JfJVl16OvO/uc9I89XLW8BZMxuv0cXLo+n+wuLvv8xVfWt0Y5JngT88n69CWm6jX7+VtIdOZik/0r637VJzqL/PFbVEcDz6ELjKOAtSfYf3baq7klyV/VDFCb73ANcChye5J/6bQMsrKo3raHvHVW1ci372+Ac+WwkkuwB/KqqPgK8E3jChJv+G3Bsks2TzKH7j/y1ft3P6P5zvz3J0/u2i4HXJUn/uo/fQG9B2tT8HPjpyPmcl9EdeZhRVZ1YVfOr6ogkmwF7VdWXgJPpwmyHCV/7m8C8JI+c4bX/HPgx3eE1gC8AL0yyK0CShyTZe8LXGoThs/F4LPC1fkj8ZuAvJtzuXLqh+zXAF4GTq+r7q1b2h+KOBN7Tj25OozuWvCTJ0n5Z0vo5HnhHkiXAfLrzPpPaHPhIkmuBq4B3VdXPJtmwqu4ATgQ+0W9/D3DGat3eAGyT5PSq+jrwP4DP9bV+nu7c8KxxhgNJUnOOfCRJzRk+kqTmDB9JUnOGjySpOcNHktSc4SMNLMmb+/nAlvTzaj3pfu5vjTOWJzkqySn3Z99SK85wIA0oySHA84EnVNWd/VT3Ww3xWlV1HvdNxyRt1Bz5SMPaHbitqlZNl3JbVS0faMbyE5L8r7797CTvTvLVJDckeWHfvlmS9/YjsQuSfHbVOqklw0ca1ueAvfpp9d+b5Gkj626pqkOA/wucDbwQOJj7viX/ArpvzR8APIvum/T3fis9yZPpvtV+dFXdsIbX3p1u4tnnA389ss95dDNqvIqWU+hLIzzsJg2oqlb0E8Y+FXgG8M8j52VGZyzfoapuB25P8lszlgM/SLJqxvJfcN+M5c+pquWs2b9U1T3A15Ps1rc9BfhE3/79JF/asO9YmozhIw2sD49LgEv6ebiO71dt6BnLVze6z6z2U5pVHnaTBpRk3ySPGmmaD9w04ebrOmP5JL4CHNOf+9mN7qaDUnOOfKRh7QD8z/4w2t3A9cBJdOdh1uZcunMy19DdbfLkqvp+kt+FbsbyJEcCFyZ5xYT1nEN3I7GldHexvYLu1gBSU85qLW1ikuzQn4vamW4k9fujt+GQWnDkI216LuhHYlsBpxk8mg2OfCRJzXnBgSSpOcNHktSc4SNJas7wkSQ1Z/hIkpozfCRJzf1/LKn6aLjRV9QAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "mortality_rate = [mortality_rate_smoker,mortality_rate_non_smoker]\n", "smoking = ['smoker', 'non-smoker']\n", "plt.bar(smoking, mortality_rate,color=['#E69F00', '#56B4E9'],width = 0.25)\n", "plt.xticks(smoking)\n", "plt.yticks(mortality_rate)\n", "plt.xlabel('Smoking')\n", "plt.ylabel('Mortality Rate')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enfin, nous allons estimer le taux de mortalite chez les femmes du au tabagisme a cette epoque sur la population anglaise d'apres les resultats de ces deux etudes. \n", "\n", "Pour faire cela, nous allons calculer des intervalles de confiance a 95% pour chaque categorie (fumeuses et non-fumeuses). La formule generale pour calculer un intervalle de\n", "confiance au niveau de confiance 0,95 est : $ [f-1/\\sqrt{n} ; f+1/\\sqrt{n}]$ si $n\\le30$ et si $nf\\le5$ et $n(1-f)\\le5$ avec $f$ la fréquence observée dans un échantillon de taille $n$. \n", "\n", "Dans notre cas, la frequence observee $f$ correspond au taux de mortalite et la taille $n$ de l'echantillon correspond au nombre de femmes ayant repondu aux sondages soit 1314." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apres avoir verifie que les conditions pour calculer les intervalles de confiance de chacune des deux categories etaient respectees, nous effectuons les calculs.\n", "\n", "Pour les fumeuses, l'intervalle de confiance a 95% du taux de mortalite chez les femmes est $[0.21 ; 0.27]$. Pour les non-fumeuses, l'intervalle de confiance a 95% du taux de mortalite chez les femmes est $[0.29 ; 0.34]$. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import math\n", "n = 1314\n", "if (n*mortality_rate_smoker <= 5) and (n*(1-mortality_rate_smoker)<=5):\n", " confidence_interval_smoker_low = mortality_rate_smoker-(1/math.sqrt(n))\n", " confidence_interval_smoker_high = mortality_rate_smoker+(1/math.sqrt(n))\n", "if (n*mortality_rate_non_smoker <= 5) and (n*(1-mortality_rate_non_smoker)<=5):\n", " confidence_interval_non_smoker_low = mortality_rate_non_smoker-(1/math.sqrt(n))\n", " confidence_interval_non_smoker_high = mortality_rate_non_smoker+(1/math.sqrt(n))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Effectif et taux de mortalite par tranches d'age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nous allons reprendre les calculs d'effectif et de taux de mortalite calcules precedemment, mais nous allons les categoriser par tranche d'age. Les femmes ayant participe a ces etudes seront reparties dans quatre categories en fonction de leur age : 18-35 ans, 35-55 ans, 55-64 ans, plus de 65 ans." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "582 732\n" ] } ], "source": [ "#class_18_to_35 = []\n", "#class_35_to_55 = []\n", "#class_55_to_64 = []\n", "#class_over_65 = []\n", "\n", "smoker = []\n", "non_smoker = []\n", "\n", "raw_data[\"Status\"].replace({\"Dead\": \"1\", \"Alive\": \"0\"}, inplace=True)\n", "#raw_data[\"Age\"] = raw_data[\"Age\"].astype(str)\n", "\n", "#raw_data\n", "\n", "for i in range(len(raw_data)):\n", " if raw_data.iloc[i][0] == \"Yes\":\n", " smoker.append(raw_data.iloc[i])\n", " else :\n", " non_smoker.append(raw_data.iloc[i])\n", " #if raw_data.iloc[i][2] < 35:\n", " # class_18_to_35.append(raw_data.iloc[i])\n", " #elif 35 <= raw_data.iloc[i][2] < 55:\n", " # class_35_to_55.append(raw_data.iloc[i])\n", " #elif 55 <= raw_data.iloc[i][2] < 65 :\n", " # class_55_to_64.append(raw_data.iloc[i])\n", " #else :\n", " # class_over_65.append(raw_data.iloc[i])\n", "print(len(smoker), len(non_smoker))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "alive_and_smoker_18to35 = 0\n", "dead_and_smoker_18to35 = 0\n", "alive_and_smoker_35to55 = 0\n", "dead_and_smoker_35to55 = 0\n", "alive_and_smoker_55to64 = 0\n", "dead_and_smoker_55to64 = 0\n", "alive_and_smoker_over65 = 0\n", "dead_and_smoker_over65 = 0\n", "\n", "for i in range(len(smoker)):\n", " if smoker[i][1] == \"0\" :\n", " if smoker[i][2] < 35:\n", " alive_and_smoker_18to35 += 1\n", " elif 35 <= smoker[i][2] < 55:\n", " alive_and_smoker_35to55 += 1\n", " elif 55 <= smoker[i][2] < 65 :\n", " alive_and_smoker_55to64 += 1\n", " else :\n", " alive_and_smoker_over65 += 1\n", " else :\n", " if smoker[i][2] < 35:\n", " dead_and_smoker_18to35 += 1\n", " elif 35 <= smoker[i][2] < 55:\n", " dead_and_smoker_35to55 += 1\n", " elif 55 <= smoker[i][2] < 65 :\n", " dead_and_smoker_55to64 += 1\n", " else :\n", " dead_and_smoker_over65 += 1\n", " \n", "alive_and_non_smoker_18to35 = 0\n", "dead_and_non_smoker_18to35 = 0\n", "alive_and_non_smoker_35to55 = 0\n", "dead_and_non_smoker_35to55 = 0\n", "alive_and_non_smoker_55to64 = 0\n", "dead_and_non_smoker_55to64 = 0\n", "alive_and_non_smoker_over65 = 0\n", "dead_and_non_smoker_over65 = 0\n", " \n", "for i in range(len(non_smoker)):\n", " if non_smoker[i][1] == \"0\" :\n", " if non_smoker[i][2] < 35:\n", " alive_and_non_smoker_18to35 += 1\n", " elif 35 <= non_smoker[i][2] < 55:\n", " alive_and_non_smoker_35to55 += 1\n", " elif 55 <= non_smoker[i][2] < 65 :\n", " alive_and_non_smoker_55to64 += 1\n", " else :\n", " alive_and_non_smoker_over65 += 1\n", " else :\n", " if non_smoker[i][2] < 35:\n", " dead_and_non_smoker_18to35 += 1\n", " elif 35 <= non_smoker[i][2] < 55:\n", " dead_and_non_smoker_35to55 += 1\n", " elif 55 <= non_smoker[i][2] < 65 :\n", " dead_and_non_smoker_55to64 += 1\n", " else :\n", " dead_and_non_smoker_over65 += 1\n", " \n", " \n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Smoker [18-35]Non-Smoker [18-35]Smoker [35-55]Non-Smoker [35-55]Smoker [55-64]Non-Smoker [55-64]Smoker [65+]Non-Smoker [65+]
Alive1822211901726481728
Dead763919514042165
\n", "
" ], "text/plain": [ " Smoker [18-35] Non-Smoker [18-35] Smoker [35-55] Non-Smoker [35-55] \\\n", "Alive 182 221 190 172 \n", "Dead 7 6 39 19 \n", "\n", " Smoker [55-64] Non-Smoker [55-64] Smoker [65+] Non-Smoker [65+] \n", "Alive 64 81 7 28 \n", "Dead 51 40 42 165 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_18to35 = [[alive_and_smoker_18to35,alive_and_non_smoker_18to35],[dead_and_smoker_18to35, dead_and_non_smoker_18to35]]\n", "data_35to55 = [[alive_and_smoker_35to55,alive_and_non_smoker_35to55],[dead_and_smoker_35to55, dead_and_non_smoker_35to55]]\n", "data_55to64 = [[alive_and_smoker_55to64,alive_and_non_smoker_55to64],[dead_and_smoker_55to64, dead_and_non_smoker_55to64]]\n", "data_over65 = [[alive_and_smoker_over65,alive_and_non_smoker_over65],[dead_and_smoker_over65, dead_and_non_smoker_over65]]\n", "\n", "df1 = pd.DataFrame(data_18to35, columns=[\"Smoker [18-35]\", \"Non-Smoker [18-35]\"], index = [\"Alive\", \"Dead\"])\n", "df2 = pd.DataFrame(data_35to55, columns=[\"Smoker [35-55]\", \"Non-Smoker [35-55]\"], index = [\"Alive\", \"Dead\"])\n", "df3 = pd.DataFrame(data_55to64, columns=[\"Smoker [55-64]\", \"Non-Smoker [55-64]\"], index = [\"Alive\", \"Dead\"])\n", "df4 = pd.DataFrame(data_over65, columns=[\"Smoker [65+]\", \"Non-Smoker [65+]\"], index = [\"Alive\", \"Dead\"])\n", "\n", "df_total = pd.concat([df1,df2,df3,df4],axis=1)\n", "df_total" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.037037037037037035 0.02643171806167401\n", "0.1703056768558952 0.09947643979057591\n", "0.4434782608695652 0.3305785123966942\n", "0.8571428571428571 0.8549222797927462\n" ] } ], "source": [ "mortality_rate_smoker_18to35 = dead_and_smoker_18to35/(alive_and_smoker_18to35 + dead_and_smoker_18to35)\n", "mortality_rate_non_smoker_18to35 = dead_and_non_smoker_18to35/(alive_and_non_smoker_18to35 + dead_and_non_smoker_18to35)\n", "\n", "mortality_rate_smoker_35to55 = dead_and_smoker_35to55/(alive_and_smoker_35to55 + dead_and_smoker_35to55)\n", "mortality_rate_non_smoker_35to55 = dead_and_non_smoker_35to55/(alive_and_non_smoker_35to55 + dead_and_non_smoker_35to55)\n", "\n", "mortality_rate_smoker_55to64 = dead_and_smoker_55to64/(alive_and_smoker_55to64 + dead_and_smoker_55to64)\n", "mortality_rate_non_smoker_55to64 = dead_and_non_smoker_55to64/(alive_and_non_smoker_55to64 + dead_and_non_smoker_55to64)\n", "\n", "mortality_rate_smoker_over65 = dead_and_smoker_over65/(alive_and_smoker_over65 + dead_and_smoker_over65)\n", "mortality_rate_non_smoker_over65 = dead_and_non_smoker_over65/(alive_and_non_smoker_over65 + dead_and_non_smoker_over65)\n", "\n", "print(mortality_rate_smoker_18to35,mortality_rate_non_smoker_18to35)\n", "print(mortality_rate_smoker_35to55,mortality_rate_non_smoker_35to55)\n", "print(mortality_rate_smoker_55to64,mortality_rate_non_smoker_55to64)\n", "print(mortality_rate_smoker_over65,mortality_rate_non_smoker_over65)\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "mortality_rate_smoker = (mortality_rate_smoker_18to35,mortality_rate_smoker_35to55,mortality_rate_smoker_55to64,mortality_rate_smoker_over65)\n", "mortality_rate_non_smoker = (mortality_rate_non_smoker_18to35,mortality_rate_non_smoker_35to55,mortality_rate_non_smoker_55to64,mortality_rate_non_smoker_over65)\n", "age = ['18-35','35-55','55-64','65+']\n", "indices = range(len(mortality_rate_smoker))\n", "width = np.min(np.diff(indices))/3.\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "ax.bar(indices-width/2.,mortality_rate_smoker,width,color='#E69F00',label='Smoker')\n", "ax.bar(indices+width/2.,mortality_rate_non_smoker,width,color='#56B4E9',label='Non-Smoker')\n", "#tiks = ax.get_xticks().tolist()\n", "plt.xticks(indices + width / 2, age)\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Status Smoker\n", "0 No 502\n", " Yes 443\n", "1 No 230\n", " Yes 139\n", "dtype: int64\n" ] } ], "source": [ "#raw_data[\"Status\"].replace({\"Dead\": \"1\", \"Alive\": \"0\"}, inplace=True)\n", "#raw_data\n", "\n", "count = raw_data.groupby(['Status', 'Smoker']).size() \n", "print(count)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "raw_data[\"Status\"] = raw_data[\"Status\"].astype(int)\n", "\n", "df_smoker = raw_data[raw_data['Smoker'] == 'Yes']\n", "df_non_smoker = raw_data[raw_data['Smoker'] == 'No']\n", " \n", "df_smoker.plot(kind='scatter',x='Age',y='Status',color='#E69F00')\n", "df_non_smoker.plot(kind='scatter',x='Age',y='Status',color='#56B4E9')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }