{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Around Simpson's Paradox" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np\n", "import isoweek\n", "import os\n", "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In 1972-1974, in Whickham, a town in the north-east of England, located approximately 6.5 kilometres south-west of Newcastle upon Tyne, a survey of one-sixth of the electorate was conducted in order to inform work on thyroid and heart disease (Tunbridge and al. 1977). A continuation of this study was carried out twenty years later. (Vanderpump et al. 1995). Some of the results were related to smoking and whether individuals were still alive at the time of the second study. For the purpose of simplicity, we will restrict the data to women and among these to the 1314 that were categorized as \"smoking currently\" or \"never smoked\". There were relatively few women in the initial survey who smoked but have since quit (162) and very few for which information was not available (18). Survival at 20 years was determined for all women of the first survey.\n", "\n", "All these data are available in this [file CSV](https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/blob/master/module3/Practical_session/Subject6_smoking.csv). You will find on each line if the person smokes or not, whether alive or dead at the time of the second study, and his age at the time of the first survey." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__The mission is to:__\n", "\n", "1. Tabulate the total number of women alive and dead over the period according to their smoking habits. Calculate in each group (smoking/non-smoking) the mortality rate (the ratio of the number of women who died in a group to the total number of women in that group).\n", "2. Go back to question 1 (numbers and mortality rates) and add a new category related to the age group.\n", "3. In order to avoid a bias induced by arbitrary and non-regular age groupings, it is possible to try to perform a logistic regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Smoking influence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We should check whether we have the local csv file with the data and to download it if not." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data_file = \"smoking.csv\"\n", "data_url = \"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false\"\n", "if not(os.path.exists(data_file)) :\n", " with open(data_file, \"wb\") as file:\n", " file.write(requests.get(data_url).content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset contains the information about smoking habits, age and survivability. The first column contains index. Here is a description of the rest of the columns:\n", "\n", "`Smoker` contains *Yes* or *No* value and shows whether a person smoked or not.\n", "\n", "`Status` contains *Alive* or *Dead* value and shows whether a person were alive or dead.\n", "\n", "`Age` contains a float value and indicates the age of a person." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SmokerStatusAge
0YesAlive21.0
1YesAlive19.3
2NoDead57.5
3NoAlive47.1
4YesAlive81.4
5NoAlive36.8
6NoAlive23.8
7YesDead57.5
8YesAlive24.8
9YesAlive49.5
10YesAlive30.0
11NoDead66.0
12YesAlive49.2
13NoAlive58.4
14NoDead60.6
15NoAlive25.1
16NoAlive43.5
17NoAlive27.1
18NoAlive58.3
19YesAlive65.7
20NoDead73.2
21YesAlive38.3
22NoAlive33.4
23YesDead62.3
24NoAlive18.0
25NoAlive56.2
26YesAlive59.2
27NoAlive25.8
28NoDead36.9
29NoAlive20.2
............
1284YesDead36.0
1285YesAlive48.3
1286NoAlive63.1
1287NoAlive60.8
1288YesDead39.3
1289NoAlive36.7
1290NoAlive63.8
1291NoDead71.3
1292NoAlive57.7
1293NoAlive63.2
1294NoAlive46.6
1295YesDead82.4
1296YesAlive38.3
1297YesAlive32.7
1298NoAlive39.7
1299YesDead60.0
1300NoDead71.0
1301NoAlive20.5
1302NoAlive44.4
1303YesAlive31.2
1304YesAlive47.8
1305YesAlive60.9
1306NoDead61.4
1307YesAlive43.0
1308NoAlive42.1
1309YesAlive35.9
1310NoAlive22.3
1311YesDead62.1
1312NoDead88.6
1313NoAlive39.1
\n", "

1314 rows × 3 columns

\n", "
" ], "text/plain": [ " Smoker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "2 No Dead 57.5\n", "3 No Alive 47.1\n", "4 Yes Alive 81.4\n", "5 No Alive 36.8\n", "6 No Alive 23.8\n", "7 Yes Dead 57.5\n", "8 Yes Alive 24.8\n", "9 Yes Alive 49.5\n", "10 Yes Alive 30.0\n", "11 No Dead 66.0\n", "12 Yes Alive 49.2\n", "13 No Alive 58.4\n", "14 No Dead 60.6\n", "15 No Alive 25.1\n", "16 No Alive 43.5\n", "17 No Alive 27.1\n", "18 No Alive 58.3\n", "19 Yes Alive 65.7\n", "20 No Dead 73.2\n", "21 Yes Alive 38.3\n", "22 No Alive 33.4\n", "23 Yes Dead 62.3\n", "24 No Alive 18.0\n", "25 No Alive 56.2\n", "26 Yes Alive 59.2\n", "27 No Alive 25.8\n", "28 No Dead 36.9\n", "29 No Alive 20.2\n", "... ... ... ...\n", "1284 Yes Dead 36.0\n", "1285 Yes Alive 48.3\n", "1286 No Alive 63.1\n", "1287 No Alive 60.8\n", "1288 Yes Dead 39.3\n", "1289 No Alive 36.7\n", "1290 No Alive 63.8\n", "1291 No Dead 71.3\n", "1292 No Alive 57.7\n", "1293 No Alive 63.2\n", "1294 No Alive 46.6\n", "1295 Yes Dead 82.4\n", "1296 Yes Alive 38.3\n", "1297 Yes Alive 32.7\n", "1298 No Alive 39.7\n", "1299 Yes Dead 60.0\n", "1300 No Dead 71.0\n", "1301 No Alive 20.5\n", "1302 No Alive 44.4\n", "1303 Yes Alive 31.2\n", "1304 Yes Alive 47.8\n", "1305 Yes Alive 60.9\n", "1306 No Dead 61.4\n", "1307 Yes Alive 43.0\n", "1308 No Alive 42.1\n", "1309 Yes Alive 35.9\n", "1310 No Alive 22.3\n", "1311 Yes Dead 62.1\n", "1312 No Dead 88.6\n", "1313 No Alive 39.1\n", "\n", "[1314 rows x 3 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(data_file)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We divide the dataset into two groups depending on the `Status` value." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Our dataset contains 945 alive and 369 dead persons.\n" ] } ], "source": [ "alive, dead = data[data[\"Status\"] == \"Alive\"], data[data[\"Status\"] == \"Dead\"]\n", "print(\"Our dataset contains\", alive.shape[0], \"alive and\", dead.shape[0], \"dead persons.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We calculate the mortality rate in both smoking and non-smoking groups." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mortality rate is 0.23883161512027493 in smoking group and 0.31420765027322406 in non-smoking group.\n" ] } ], "source": [ "smokers, non_smokers = data[data[\"Smoker\"] == \"Yes\"], data[data[\"Smoker\"] == \"No\"]\n", "print(\"The mortality rate is\", smokers[smokers[\"Status\"] == \"Dead\"].shape[0] / smokers.shape[0] , \"in smoking group and\", non_smokers[non_smokers[\"Status\"] == \"Dead\"].shape[0] / non_smokers.shape[0] , \"in non-smoking group.\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will graph these data." ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " \n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:9: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " if __name__ == '__main__':\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "smokers_alive = data[data[\"Smoker\"] == \"Yes\"][data[\"Status\"] == \"Alive\"].shape[0]\n", "non_smokers_alive = data[data[\"Smoker\"] == \"No\"][data[\"Status\"] == \"Alive\"].shape[0]\n", "smokers_dead = data[data[\"Smoker\"] == \"Yes\"].shape[0] - smokers_alive\n", "non_smokers_dead = data[data[\"Smoker\"] == \"No\"].shape[0] - non_smokers_alive\n", "\n", "x = np.arange(2)\n", "width = 0.2\n", "plt.bar(x-width, height=[smokers_alive, non_smokers_alive],width=width,color='green')\n", "plt.bar(x, [smokers_dead, non_smokers_dead], width, color='red')\n", "plt.xticks(x, ['Smokers', 'Non-smokers'])\n", "plt.ylabel(\"Number of people\")\n", "plt.legend([\"Alive\", \"Dead\"])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to our data, there is almost no corelation between smoking and the death rate. Moreover, the mortality rate is a little higher in a non-smoking group. Perharps, we did not consider all the necessary factors. Therefore, we will observe the influence of the age. We distinguish four age groups :\n", "- __young :__ under 34 years old,\n", "- __middle-aged :__ between 34 and 55 years old,\n", "- __elder adults :__ betwen 55 and 65 years old,\n", "- __seniors :__ above 65 years old." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Age influence" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mortality rate is\n", " 0.027932960893854747 in young smoking group,\n", " 0.9726027397260274 in young non-smoking group,\n", " 0.17154811715481172 in middle-aged smoking group,\n", " 0.9045226130653267 in middle-aged non-smoking group,\n", " 0.4434782608695652 in elder adults smoking group,\n", " 0.6721311475409836 in elder adults non-smoking group,\n", " 0.8571428571428571 in senior smoking group and\n", " 0.140625 in senior non-smoking group,\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " This is separate from the ipykernel package so we can avoid doing imports until\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " \n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " import sys\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " \n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:9: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " if __name__ == '__main__':\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:10: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " # Remove the CWD from sys.path while we load stuff.\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:11: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " # This is added back by InteractiveShellApp.init_path()\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:12: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " if sys.path[0] == '':\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:13: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " del sys.path[0]\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:15: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " from ipykernel import kernelapp as app\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:16: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " app.launch_new_instance()\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:18: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:19: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:20: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:21: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:22: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n" ] } ], "source": [ "young, middle_aged, elder_adults, seniors = data[data.Age < 34], data[(34 <= data.Age) & (data.Age < 55)], data[(55 <= data.Age) & (data.Age <= 65)], data[data.Age > 65]\n", "\n", "smokers_alive = data[data[\"Smoker\"] == \"Yes\"][data[\"Status\"] == \"Alive\"].shape[0]\n", "\n", "\n", "young_smokers = young[data[\"Smoker\"] == \"Yes\"]\n", "young_non_smokers = young[data[\"Smoker\"] == \"No\"]\n", "middle_aged_smokers = middle_aged[data[\"Smoker\"] == \"Yes\"]\n", "middle_aged_non_smokers = middle_aged[data[\"Smoker\"] == \"No\"]\n", "elder_adults_smokers = elder_adults[data[\"Smoker\"] == \"Yes\"]\n", "elder_adults_non_smokers = elder_adults[data[\"Smoker\"] == \"No\"]\n", "seniors_smokers = seniors[data[\"Smoker\"] == \"Yes\"]\n", "seniors_non_smokers = seniors[data[\"Smoker\"] == \"No\"]\n", "\n", "young_smokers_alive = young_smokers[data[\"Status\"] == \"Alive\"]\n", "young_non_smokers_alive = young_non_smokers[data[\"Status\"] == \"Dead\"]\n", "middle_aged_smokers_alive = middle_aged_smokers[data[\"Status\"] == \"Alive\"]\n", "middle_aged_non_smokers_alive = middle_aged_non_smokers[data[\"Status\"] == \"Dead\"]\n", "elder_adults_smokers_alive = elder_adults_smokers[data[\"Status\"] == \"Alive\"]\n", "elder_adults_non_smokers_alive = elder_adults_non_smokers[data[\"Status\"] == \"Dead\"]\n", "seniors_smokers_alive = seniors_smokers[data[\"Status\"] == \"Alive\"]\n", "seniors_non_smokers_alive = seniors_non_smokers[data[\"Status\"] == \"Dead\"]\n", "\n", "print(\"The mortality rate is\\n\",\n", " (young_smokers.shape[0]- young_smokers_alive.shape[0])/ young_smokers.shape[0],\n", " \"in young smoking group,\\n\",\n", " (young_non_smokers.shape[0]- young_non_smokers_alive.shape[0]) / young_non_smokers.shape[0],\n", " \"in young non-smoking group,\\n\",\n", " (middle_aged_smokers.shape[0]- middle_aged_smokers_alive.shape[0]) / middle_aged_smokers.shape[0],\n", " \"in middle-aged smoking group,\\n\",\n", " (middle_aged_non_smokers.shape[0]- middle_aged_non_smokers_alive.shape[0]) / middle_aged_non_smokers.shape[0],\n", " \"in middle-aged non-smoking group,\\n\",\n", " (elder_adults_smokers.shape[0]- elder_adults_smokers_alive.shape[0]) / elder_adults_smokers.shape[0],\n", " \"in elder adults smoking group,\\n\",\n", " (elder_adults_non_smokers.shape[0]- elder_adults_non_smokers_alive.shape[0]) / elder_adults_non_smokers.shape[0],\n", " \"in elder adults non-smoking group,\\n\",\n", " (seniors_smokers.shape[0]- seniors_smokers_alive.shape[0]) / seniors_smokers.shape[0],\n", " \"in senior smoking group and\\n\",\n", " (seniors_non_smokers.shape[0]- seniors_non_smokers_alive.shape[0]) / seniors_non_smokers.shape[0],\n", " \"in senior non-smoking group,\",)\n" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = np.arange(8)\n", "width = 0.4\n", "plt.bar(x-width, height=[young_smokers_alive.shape[0], middle_aged_smokers_alive.shape[0], elder_adults_smokers_alive.shape[0], seniors_smokers_alive.shape[0], young_non_smokers_alive.shape[0], middle_aged_non_smokers_alive.shape[0], elder_adults_non_smokers_alive.shape[0], seniors_non_smokers_alive.shape[0]],width=width,color='green')\n", "plt.bar(x, [young_smokers.shape[0]-young_smokers_alive.shape[0], middle_aged_smokers.shape[0]-middle_aged_smokers_alive.shape[0], elder_adults_smokers.shape[0]-elder_adults_smokers_alive.shape[0], seniors_smokers.shape[0]-seniors_smokers_alive.shape[0], young_non_smokers.shape[0]-young_non_smokers_alive.shape[0], middle_aged_non_smokers.shape[0]-middle_aged_non_smokers_alive.shape[0], elder_adults_non_smokers.shape[0]-elder_adults_non_smokers_alive.shape[0], seniors_non_smokers.shape[0]-seniors_non_smokers_alive.shape[0]], width, color='red')\n", "plt.xticks(x, ['Young\\nsmokers', 'Middle-\\naged\\nsmokers', 'Elder\\nadults\\nsmokers', 'Seniors\\nsmokers', 'Young\\nnon-\\nsmokers', 'Middle-\\naged\\nnon-\\nsmokers', 'Elder\\nadults\\nnon-\\nsmokers', 'Seniors\\nnon-\\nsmokers'])\n", "plt.ylabel(\"Number of people\")\n", "plt.legend([\"Alive\", \"Dead\"])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to our data, the mortality rate is high in all the non-smoking groups except seniors as well as in the senior smoking group. Smoking affects seniors health condition because they are more vulnerable to all deseases types than others. However, in other groups we could see the opposite effect. Of course, smoking does not improve state of health, but there are other reasons which lead to death and are more dangerous than smoking such as cardiac deseases, accidents, etc.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Logistic regression analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As it was proposed, we will check our hypothesis with the logistic regression." ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import classification_report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to transform our data to numerical values first." ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Smoker int64\n", "Status int64\n", "Age float64\n", "dtype: object" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numeric_data = data.copy()\n", "numeric_data.loc[numeric_data[\"Smoker\"] == \"Yes\", \"Smoker\"] = 1\n", "numeric_data.loc[numeric_data[\"Smoker\"] == \"No\", \"Smoker\"] = 0\n", "numeric_data.loc[numeric_data[\"Status\"] == \"Alive\", \"Status\"] = 1\n", "numeric_data.loc[numeric_data[\"Status\"] == \"Dead\", \"Status\"] = 0\n", "numeric_data.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " We take 90% of our data to train the model and 10% to test it." ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.71 0.67 0.69 33\n", " 1 0.89 0.91 0.90 99\n", "\n", "avg / total 0.85 0.85 0.85 132\n", "\n" ] } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(numeric_data.drop('Status',axis=1), numeric_data['Status'], test_size=0.10, random_state=1)\n", "model = LogisticRegression()\n", "model.fit(X_train,y_train)\n", "predictions = model.predict(X_test)\n", "print(classification_report(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We obtained satisfactory results, which could approve our hypothesis. However, it is not sufficient to conclude that there is a correlation between smoking and the mortality rate. Further analysis is neeeded." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }