{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Around Simpson's Paradox" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np\n", "import isoweek\n", "import os\n", "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In 1972-1974, in Whickham, a town in the north-east of England, located approximately 6.5 kilometres south-west of Newcastle upon Tyne, a survey of one-sixth of the electorate was conducted in order to inform work on thyroid and heart disease (Tunbridge and al. 1977). A continuation of this study was carried out twenty years later. (Vanderpump et al. 1995). Some of the results were related to smoking and whether individuals were still alive at the time of the second study. For the purpose of simplicity, we will restrict the data to women and among these to the 1314 that were categorized as \"smoking currently\" or \"never smoked\". There were relatively few women in the initial survey who smoked but have since quit (162) and very few for which information was not available (18). Survival at 20 years was determined for all women of the first survey.\n", "\n", "All these data are available in this [file CSV](https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/blob/master/module3/Practical_session/Subject6_smoking.csv). You will find on each line if the person smokes or not, whether alive or dead at the time of the second study, and his age at the time of the first survey." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__The mission is to:__\n", "\n", "1. Tabulate the total number of women alive and dead over the period according to their smoking habits. Calculate in each group (smoking/non-smoking) the mortality rate (the ratio of the number of women who died in a group to the total number of women in that group).\n", "2. Go back to question 1 (numbers and mortality rates) and add a new category related to the age group.\n", "3. In order to avoid a bias induced by arbitrary and non-regular age groupings, it is possible to try to perform a logistic regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Smoking influence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We should check whether we have the local csv file with the data and to download it if not." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data_file = \"smoking.csv\"\n", "data_url = \"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false\"\n", "if not(os.path.exists(data_file)) :\n", " with open(data_file, \"wb\") as file:\n", " file.write(requests.get(data_url).content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset contains the information about smoking habits, age and survivability. The first column contains index. Here is a description of the rest of the columns:\n", "\n", "`Smoker` contains *Yes* or *No* value and shows whether a person smoked or not.\n", "\n", "`Status` contains *Alive* or *Dead* value and shows whether a person were alive or dead.\n", "\n", "`Age` contains a float value and indicates the age of a person." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", " | Smoker | \n", "Status | \n", "Age | \n", "
---|---|---|---|
0 | \n", "Yes | \n", "Alive | \n", "21.0 | \n", "
1 | \n", "Yes | \n", "Alive | \n", "19.3 | \n", "
2 | \n", "No | \n", "Dead | \n", "57.5 | \n", "
3 | \n", "No | \n", "Alive | \n", "47.1 | \n", "
4 | \n", "Yes | \n", "Alive | \n", "81.4 | \n", "
5 | \n", "No | \n", "Alive | \n", "36.8 | \n", "
6 | \n", "No | \n", "Alive | \n", "23.8 | \n", "
7 | \n", "Yes | \n", "Dead | \n", "57.5 | \n", "
8 | \n", "Yes | \n", "Alive | \n", "24.8 | \n", "
9 | \n", "Yes | \n", "Alive | \n", "49.5 | \n", "
10 | \n", "Yes | \n", "Alive | \n", "30.0 | \n", "
11 | \n", "No | \n", "Dead | \n", "66.0 | \n", "
12 | \n", "Yes | \n", "Alive | \n", "49.2 | \n", "
13 | \n", "No | \n", "Alive | \n", "58.4 | \n", "
14 | \n", "No | \n", "Dead | \n", "60.6 | \n", "
15 | \n", "No | \n", "Alive | \n", "25.1 | \n", "
16 | \n", "No | \n", "Alive | \n", "43.5 | \n", "
17 | \n", "No | \n", "Alive | \n", "27.1 | \n", "
18 | \n", "No | \n", "Alive | \n", "58.3 | \n", "
19 | \n", "Yes | \n", "Alive | \n", "65.7 | \n", "
20 | \n", "No | \n", "Dead | \n", "73.2 | \n", "
21 | \n", "Yes | \n", "Alive | \n", "38.3 | \n", "
22 | \n", "No | \n", "Alive | \n", "33.4 | \n", "
23 | \n", "Yes | \n", "Dead | \n", "62.3 | \n", "
24 | \n", "No | \n", "Alive | \n", "18.0 | \n", "
25 | \n", "No | \n", "Alive | \n", "56.2 | \n", "
26 | \n", "Yes | \n", "Alive | \n", "59.2 | \n", "
27 | \n", "No | \n", "Alive | \n", "25.8 | \n", "
28 | \n", "No | \n", "Dead | \n", "36.9 | \n", "
29 | \n", "No | \n", "Alive | \n", "20.2 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
1284 | \n", "Yes | \n", "Dead | \n", "36.0 | \n", "
1285 | \n", "Yes | \n", "Alive | \n", "48.3 | \n", "
1286 | \n", "No | \n", "Alive | \n", "63.1 | \n", "
1287 | \n", "No | \n", "Alive | \n", "60.8 | \n", "
1288 | \n", "Yes | \n", "Dead | \n", "39.3 | \n", "
1289 | \n", "No | \n", "Alive | \n", "36.7 | \n", "
1290 | \n", "No | \n", "Alive | \n", "63.8 | \n", "
1291 | \n", "No | \n", "Dead | \n", "71.3 | \n", "
1292 | \n", "No | \n", "Alive | \n", "57.7 | \n", "
1293 | \n", "No | \n", "Alive | \n", "63.2 | \n", "
1294 | \n", "No | \n", "Alive | \n", "46.6 | \n", "
1295 | \n", "Yes | \n", "Dead | \n", "82.4 | \n", "
1296 | \n", "Yes | \n", "Alive | \n", "38.3 | \n", "
1297 | \n", "Yes | \n", "Alive | \n", "32.7 | \n", "
1298 | \n", "No | \n", "Alive | \n", "39.7 | \n", "
1299 | \n", "Yes | \n", "Dead | \n", "60.0 | \n", "
1300 | \n", "No | \n", "Dead | \n", "71.0 | \n", "
1301 | \n", "No | \n", "Alive | \n", "20.5 | \n", "
1302 | \n", "No | \n", "Alive | \n", "44.4 | \n", "
1303 | \n", "Yes | \n", "Alive | \n", "31.2 | \n", "
1304 | \n", "Yes | \n", "Alive | \n", "47.8 | \n", "
1305 | \n", "Yes | \n", "Alive | \n", "60.9 | \n", "
1306 | \n", "No | \n", "Dead | \n", "61.4 | \n", "
1307 | \n", "Yes | \n", "Alive | \n", "43.0 | \n", "
1308 | \n", "No | \n", "Alive | \n", "42.1 | \n", "
1309 | \n", "Yes | \n", "Alive | \n", "35.9 | \n", "
1310 | \n", "No | \n", "Alive | \n", "22.3 | \n", "
1311 | \n", "Yes | \n", "Dead | \n", "62.1 | \n", "
1312 | \n", "No | \n", "Dead | \n", "88.6 | \n", "
1313 | \n", "No | \n", "Alive | \n", "39.1 | \n", "
1314 rows × 3 columns
\n", "