{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Autour du Paradoxe de Simpson"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import numpy as np\n",
"import statsmodels.api as sm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Obtention et pré-traitement des données\n",
"\n",
"Les données sont présentes sur le Gitlab du MOOC. Par sécurité elles sont téléchargées localement. Il n'est néanmoins pas nécessaire (et contre-productif) de re-télécharger le fichier à chaque exécution, le téléchargement n'a lieux que si le fichier de données n'est pas présent sur la machine.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data_url=\"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false\"\n",
"data_file=\"Subject6_smoking.csv.csv\"\n",
"import os\n",
"import urllib.request\n",
"if not os.path.exists(data_file):\n",
" urllib.request.urlretrieve(data_url, data_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On affiche un aperçu des données :"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Smoker \n",
" Status \n",
" Age \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Yes \n",
" Alive \n",
" 21.0 \n",
" \n",
" \n",
" 1 \n",
" Yes \n",
" Alive \n",
" 19.3 \n",
" \n",
" \n",
" 2 \n",
" No \n",
" Dead \n",
" 57.5 \n",
" \n",
" \n",
" 3 \n",
" No \n",
" Alive \n",
" 47.1 \n",
" \n",
" \n",
" 4 \n",
" Yes \n",
" Alive \n",
" 81.4 \n",
" \n",
" \n",
" 5 \n",
" No \n",
" Alive \n",
" 36.8 \n",
" \n",
" \n",
" 6 \n",
" No \n",
" Alive \n",
" 23.8 \n",
" \n",
" \n",
" 7 \n",
" Yes \n",
" Dead \n",
" 57.5 \n",
" \n",
" \n",
" 8 \n",
" Yes \n",
" Alive \n",
" 24.8 \n",
" \n",
" \n",
" 9 \n",
" Yes \n",
" Alive \n",
" 49.5 \n",
" \n",
" \n",
" 10 \n",
" Yes \n",
" Alive \n",
" 30.0 \n",
" \n",
" \n",
" 11 \n",
" No \n",
" Dead \n",
" 66.0 \n",
" \n",
" \n",
" 12 \n",
" Yes \n",
" Alive \n",
" 49.2 \n",
" \n",
" \n",
" 13 \n",
" No \n",
" Alive \n",
" 58.4 \n",
" \n",
" \n",
" 14 \n",
" No \n",
" Dead \n",
" 60.6 \n",
" \n",
" \n",
" 15 \n",
" No \n",
" Alive \n",
" 25.1 \n",
" \n",
" \n",
" 16 \n",
" No \n",
" Alive \n",
" 43.5 \n",
" \n",
" \n",
" 17 \n",
" No \n",
" Alive \n",
" 27.1 \n",
" \n",
" \n",
" 18 \n",
" No \n",
" Alive \n",
" 58.3 \n",
" \n",
" \n",
" 19 \n",
" Yes \n",
" Alive \n",
" 65.7 \n",
" \n",
" \n",
" 20 \n",
" No \n",
" Dead \n",
" 73.2 \n",
" \n",
" \n",
" 21 \n",
" Yes \n",
" Alive \n",
" 38.3 \n",
" \n",
" \n",
" 22 \n",
" No \n",
" Alive \n",
" 33.4 \n",
" \n",
" \n",
" 23 \n",
" Yes \n",
" Dead \n",
" 62.3 \n",
" \n",
" \n",
" 24 \n",
" No \n",
" Alive \n",
" 18.0 \n",
" \n",
" \n",
" 25 \n",
" No \n",
" Alive \n",
" 56.2 \n",
" \n",
" \n",
" 26 \n",
" Yes \n",
" Alive \n",
" 59.2 \n",
" \n",
" \n",
" 27 \n",
" No \n",
" Alive \n",
" 25.8 \n",
" \n",
" \n",
" 28 \n",
" No \n",
" Dead \n",
" 36.9 \n",
" \n",
" \n",
" 29 \n",
" No \n",
" Alive \n",
" 20.2 \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 1284 \n",
" Yes \n",
" Dead \n",
" 36.0 \n",
" \n",
" \n",
" 1285 \n",
" Yes \n",
" Alive \n",
" 48.3 \n",
" \n",
" \n",
" 1286 \n",
" No \n",
" Alive \n",
" 63.1 \n",
" \n",
" \n",
" 1287 \n",
" No \n",
" Alive \n",
" 60.8 \n",
" \n",
" \n",
" 1288 \n",
" Yes \n",
" Dead \n",
" 39.3 \n",
" \n",
" \n",
" 1289 \n",
" No \n",
" Alive \n",
" 36.7 \n",
" \n",
" \n",
" 1290 \n",
" No \n",
" Alive \n",
" 63.8 \n",
" \n",
" \n",
" 1291 \n",
" No \n",
" Dead \n",
" 71.3 \n",
" \n",
" \n",
" 1292 \n",
" No \n",
" Alive \n",
" 57.7 \n",
" \n",
" \n",
" 1293 \n",
" No \n",
" Alive \n",
" 63.2 \n",
" \n",
" \n",
" 1294 \n",
" No \n",
" Alive \n",
" 46.6 \n",
" \n",
" \n",
" 1295 \n",
" Yes \n",
" Dead \n",
" 82.4 \n",
" \n",
" \n",
" 1296 \n",
" Yes \n",
" Alive \n",
" 38.3 \n",
" \n",
" \n",
" 1297 \n",
" Yes \n",
" Alive \n",
" 32.7 \n",
" \n",
" \n",
" 1298 \n",
" No \n",
" Alive \n",
" 39.7 \n",
" \n",
" \n",
" 1299 \n",
" Yes \n",
" Dead \n",
" 60.0 \n",
" \n",
" \n",
" 1300 \n",
" No \n",
" Dead \n",
" 71.0 \n",
" \n",
" \n",
" 1301 \n",
" No \n",
" Alive \n",
" 20.5 \n",
" \n",
" \n",
" 1302 \n",
" No \n",
" Alive \n",
" 44.4 \n",
" \n",
" \n",
" 1303 \n",
" Yes \n",
" Alive \n",
" 31.2 \n",
" \n",
" \n",
" 1304 \n",
" Yes \n",
" Alive \n",
" 47.8 \n",
" \n",
" \n",
" 1305 \n",
" Yes \n",
" Alive \n",
" 60.9 \n",
" \n",
" \n",
" 1306 \n",
" No \n",
" Dead \n",
" 61.4 \n",
" \n",
" \n",
" 1307 \n",
" Yes \n",
" Alive \n",
" 43.0 \n",
" \n",
" \n",
" 1308 \n",
" No \n",
" Alive \n",
" 42.1 \n",
" \n",
" \n",
" 1309 \n",
" Yes \n",
" Alive \n",
" 35.9 \n",
" \n",
" \n",
" 1310 \n",
" No \n",
" Alive \n",
" 22.3 \n",
" \n",
" \n",
" 1311 \n",
" Yes \n",
" Dead \n",
" 62.1 \n",
" \n",
" \n",
" 1312 \n",
" No \n",
" Dead \n",
" 88.6 \n",
" \n",
" \n",
" 1313 \n",
" No \n",
" Alive \n",
" 39.1 \n",
" \n",
" \n",
"
\n",
"
1314 rows × 3 columns
\n",
"
"
],
"text/plain": [
" Smoker Status Age\n",
"0 Yes Alive 21.0\n",
"1 Yes Alive 19.3\n",
"2 No Dead 57.5\n",
"3 No Alive 47.1\n",
"4 Yes Alive 81.4\n",
"5 No Alive 36.8\n",
"6 No Alive 23.8\n",
"7 Yes Dead 57.5\n",
"8 Yes Alive 24.8\n",
"9 Yes Alive 49.5\n",
"10 Yes Alive 30.0\n",
"11 No Dead 66.0\n",
"12 Yes Alive 49.2\n",
"13 No Alive 58.4\n",
"14 No Dead 60.6\n",
"15 No Alive 25.1\n",
"16 No Alive 43.5\n",
"17 No Alive 27.1\n",
"18 No Alive 58.3\n",
"19 Yes Alive 65.7\n",
"20 No Dead 73.2\n",
"21 Yes Alive 38.3\n",
"22 No Alive 33.4\n",
"23 Yes Dead 62.3\n",
"24 No Alive 18.0\n",
"25 No Alive 56.2\n",
"26 Yes Alive 59.2\n",
"27 No Alive 25.8\n",
"28 No Dead 36.9\n",
"29 No Alive 20.2\n",
"... ... ... ...\n",
"1284 Yes Dead 36.0\n",
"1285 Yes Alive 48.3\n",
"1286 No Alive 63.1\n",
"1287 No Alive 60.8\n",
"1288 Yes Dead 39.3\n",
"1289 No Alive 36.7\n",
"1290 No Alive 63.8\n",
"1291 No Dead 71.3\n",
"1292 No Alive 57.7\n",
"1293 No Alive 63.2\n",
"1294 No Alive 46.6\n",
"1295 Yes Dead 82.4\n",
"1296 Yes Alive 38.3\n",
"1297 Yes Alive 32.7\n",
"1298 No Alive 39.7\n",
"1299 Yes Dead 60.0\n",
"1300 No Dead 71.0\n",
"1301 No Alive 20.5\n",
"1302 No Alive 44.4\n",
"1303 Yes Alive 31.2\n",
"1304 Yes Alive 47.8\n",
"1305 Yes Alive 60.9\n",
"1306 No Dead 61.4\n",
"1307 Yes Alive 43.0\n",
"1308 No Alive 42.1\n",
"1309 Yes Alive 35.9\n",
"1310 No Alive 22.3\n",
"1311 Yes Dead 62.1\n",
"1312 No Dead 88.6\n",
"1313 No Alive 39.1\n",
"\n",
"[1314 rows x 3 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_data = pd.read_csv(data_file)\n",
"raw_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On vérifie qu'aucune ligne ne soit vide de valeur."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Smoker \n",
" Status \n",
" Age \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [Smoker, Status, Age]\n",
"Index: []"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_data[raw_data.isnull().any(axis=1)]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aucun soucis n'a été repéré sur les données, elles semblent être exploitables en l'état."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"data=raw_data #we rename for coherence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Première exploitation des données\n",
"\n",
"On effectue une analyse simple (simpliste?) sur les données. On commence par compter le nombre de fumeurs et non-fumeur"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nombre de fumeurs = 582\n",
"Nombre de non fumeurs = 732\n",
"Taille de l'échantillon = 1314\n"
]
}
],
"source": [
"smokers=pd.DataFrame.sum(data['Smoker']=='Yes')\n",
"print('Nombre de fumeurs =',smokers)\n",
"non_smokers=pd.DataFrame.sum(data['Smoker']=='No')\n",
"print('Nombre de non fumeurs =',non_smokers)\n",
"total=smokers+non_smokers\n",
"print('Taille de l\\'échantillon =',total)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On calcule maintenant le taux de mortalité pour ces deux groupes :"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mortalité fumeur = 0.239\n",
"Mortalité non fumeur = 0.314\n",
"Mortalité de l'échantillon = 0.281\n"
]
}
],
"source": [
"deaths_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')&(data['Status']=='Dead'))\n",
"death_rate_smokers=deaths_smokers/smokers\n",
"deaths_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')&(data['Status']=='Dead'))\n",
"death_rate_non_smokers=deaths_non_smokers/non_smokers\n",
"death_rate_total=(deaths_smokers+deaths_non_smokers)/total\n",
"print('Mortalité fumeur =',round(death_rate_smokers,3))\n",
"print('Mortalité non fumeur =', round(death_rate_non_smokers,3))\n",
"print('Mortalité de l\\'échantillon =',round(death_rate_total,3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On arrange ces informations sous forme d'un tableau"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Fumeurs Non-fumeurs Total\n",
" ------------------------------------\n",
"Taille du groupe 582 732 1314\n",
"Vivant 443 502 945\n",
"Mort 139 230 369\n",
"Mortalité 0.239 0.314 0.281\n"
]
}
],
"source": [
"print(' Fumeurs Non-fumeurs Total')\n",
"print(' ------------------------------------')\n",
"print('Taille du groupe ',smokers,' ',non_smokers,' ',total)\n",
"print('Vivant ',smokers-deaths_smokers,' ',non_smokers-deaths_non_smokers,' ',total-deaths_smokers-deaths_non_smokers)\n",
"print('Mort ',deaths_smokers,' ',deaths_non_smokers,' ',deaths_smokers+deaths_non_smokers)\n",
"print('Mortalité ',round(death_rate_smokers,3),' ',round(death_rate_non_smokers,3),' ',round(death_rate_total,3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On peut également les représenter sous forme de graphique circulaire"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"labels = 'Non-fumeurs', 'Fumeurs'\n",
"sizes = [non_smokers/total,(smokers/total)]\n",
"\n",
"\n",
"fig1, ax1 = plt.subplots()\n",
"ax1.pie(sizes, labels=labels,shadow=True,startangle=90,autopct='%1.1f%%')\n",
"ax1.axis('equal') \n",
"plt.title('Répartition de l\\'échantillon')\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"labels = 'Vivants', 'Morts'\n",
"sizes = [1-death_rate_smokers,death_rate_smokers]\n",
"explode = (0, 0.1)\n",
"\n",
"fig1, ax1 = plt.subplots()\n",
"ax1.pie(sizes, labels=labels,explode=explode,startangle=90,shadow=True,autopct='%1.1f%%',colors=('green','red'))\n",
"ax1.axis('equal') \n",
"plt.title('Mortalité échantillon de fumeurs')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"labels = 'Vivants', 'Morts'\n",
"sizes = [1-death_rate_non_smokers,death_rate_non_smokers]\n",
"explode = (0, 0.1)\n",
"\n",
"fig1, ax1 = plt.subplots()\n",
"ax1.pie(sizes, labels=labels,explode=explode,startangle=90,shadow=True,autopct='%1.1f%%',colors=('green','red'))\n",
"ax1.axis('equal') \n",
"plt.title('Mortalité échantillon de non-fumeurs')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Intervalle de confiance ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Il apparait alors que la mortalité est plus importante au sein de l'échantillon 'non-fumeur', une conclusion hâtive peut donc nous amener à mettre en doute la plus connues des inscription figurant sur les paquets de cigarettes actuels."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prise en compte de l'âge\n",
"\n",
"Notre analyse précédante nous mêne à une contradiction avec le célèbre _Fumer Tue_. On se penche donc sur la répartition d'âge au sein des groupes afin de voir si cela peut mener à une explication.\n",
"On commence par regrouper par tranche d'âge (18-34,34-54,55-64,65+)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"data.loc[data['Age']<35,'Categorie d\\'age'] = 'A'\n",
"data.loc[(data['Age']<55) & (data['Age']>=35),'Categorie d\\'age'] = 'B'\n",
"data.loc[(data['Age']<65) & (data['Age']>=55),'Categorie d\\'age'] = 'C'\n",
"data.loc[data['Age']>=65,'Categorie d\\'age'] = 'D'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On vérifie que la somme des sous-groupe soit bien égale au nombre total des donnés. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Categorie d'âge 18-34 35-54 55-64 65+ total\n",
"-------------------------------------------------------------------------------------------\n",
"Taille de l'échantillon 416 420 236 242 1314\n"
]
}
],
"source": [
"A_total=pd.DataFrame.sum((data['Categorie d\\'age']=='A'))\n",
"B_total=pd.DataFrame.sum((data['Categorie d\\'age']=='B'))\n",
"C_total=pd.DataFrame.sum((data['Categorie d\\'age']=='C'))\n",
"D_total=pd.DataFrame.sum((data['Categorie d\\'age']=='D'))\n",
"print('Categorie d\\'âge 18-34 35-54 55-64 65+ total')\n",
"print('-------------------------------------------------------------------------------------------')\n",
"print('Taille de l\\'échantillon ',A_total,' ',B_total,' ',C_total,' ',D_total,' ',A_total+B_total+C_total+D_total)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Il ne semble pas y avoir d'erreur sur le découapage en sous-échantillons, on procède donc aux même analyses que précédement appliquées cette fois-ci par tranches d'âges."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Catégorie d'âge 18-34"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Fumeurs Non-fumeurs Total\n",
" ------------------------------------\n",
"Taille du groupe 189 227 416\n",
"Vivant 182 221 403\n",
"Mort 7 6 13\n",
"Mortalité 0.037 0.026 0.031\n"
]
}
],
"source": [
"A_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')& (data['Categorie d\\'age']=='A'))\n",
"A_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')& (data['Categorie d\\'age']=='A'))\n",
"A_total=A_smokers+A_non_smokers\n",
"\n",
"A_deaths_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')&(data['Status']=='Dead')&(data['Categorie d\\'age']=='A'))\n",
"A_death_rate_smokers=A_deaths_smokers/A_smokers\n",
"A_deaths_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')&(data['Status']=='Dead')&(data['Categorie d\\'age']=='A'))\n",
"A_death_rate_non_smokers=A_deaths_non_smokers/A_non_smokers\n",
"A_death_rate_total=(A_deaths_smokers+A_deaths_non_smokers)/A_total\n",
"\n",
"\n",
"print(' Fumeurs Non-fumeurs Total')\n",
"print(' ------------------------------------')\n",
"print('Taille du groupe ',A_smokers,' ',A_non_smokers,' ',A_total)\n",
"print('Vivant ',A_smokers-A_deaths_smokers,' ',A_non_smokers-A_deaths_non_smokers,' ',A_total-A_deaths_smokers-A_deaths_non_smokers)\n",
"print('Mort ',A_deaths_smokers,' ',A_deaths_non_smokers,' ',A_deaths_smokers+A_deaths_non_smokers)\n",
"print('Mortalité ',round(A_death_rate_smokers,3),' ',round(A_death_rate_non_smokers,3),' ',round(A_death_rate_total,3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Catégorie d'age 35-54"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Fumeurs Non-fumeurs Total\n",
" ------------------------------------\n",
"Taille du groupe 229 191 420\n",
"Vivant 190 172 362\n",
"Mort 39 19 58\n",
"Mortalité 0.17 0.099 0.138\n"
]
}
],
"source": [
"B_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')& (data['Categorie d\\'age']=='B'))\n",
"B_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')& (data['Categorie d\\'age']=='B'))\n",
"B_total=B_smokers+B_non_smokers\n",
"\n",
"B_deaths_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')&(data['Status']=='Dead')&(data['Categorie d\\'age']=='B'))\n",
"B_death_rate_smokers=B_deaths_smokers/B_smokers\n",
"B_deaths_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')&(data['Status']=='Dead')&(data['Categorie d\\'age']=='B'))\n",
"B_death_rate_non_smokers=B_deaths_non_smokers/B_non_smokers\n",
"B_death_rate_total=(B_deaths_smokers+B_deaths_non_smokers)/B_total\n",
"\n",
"\n",
"print(' Fumeurs Non-fumeurs Total')\n",
"print(' ------------------------------------')\n",
"print('Taille du groupe ',B_smokers,' ',B_non_smokers,' ',B_total)\n",
"print('Vivant ',B_smokers-B_deaths_smokers,' ',B_non_smokers-B_deaths_non_smokers,' ',B_total-B_deaths_smokers-B_deaths_non_smokers)\n",
"print('Mort ',B_deaths_smokers,' ',B_deaths_non_smokers,' ',B_deaths_smokers+B_deaths_non_smokers)\n",
"print('Mortalité ',round(B_death_rate_smokers,3),' ',round(B_death_rate_non_smokers,3),' ',round(B_death_rate_total,3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Catégorie d'age 55-64"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Fumeurs Non-fumeurs Total\n",
" ------------------------------------\n",
"Taille du groupe 115 121 236\n",
"Vivant 64 81 145\n",
"Mort 51 40 91\n",
"Mortalité 0.443 0.331 0.386\n"
]
}
],
"source": [
"C_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')& (data['Categorie d\\'age']=='C'))\n",
"C_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')& (data['Categorie d\\'age']=='C'))\n",
"C_total=C_smokers+C_non_smokers\n",
"\n",
"C_deaths_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')&(data['Status']=='Dead')&(data['Categorie d\\'age']=='C'))\n",
"C_death_rate_smokers=C_deaths_smokers/C_smokers\n",
"C_deaths_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')&(data['Status']=='Dead')&(data['Categorie d\\'age']=='C'))\n",
"C_death_rate_non_smokers=C_deaths_non_smokers/C_non_smokers\n",
"C_death_rate_total=(C_deaths_smokers+C_deaths_non_smokers)/C_total\n",
"\n",
"\n",
"print(' Fumeurs Non-fumeurs Total')\n",
"print(' ------------------------------------')\n",
"print('Taille du groupe ',C_smokers,' ',C_non_smokers,' ',C_total)\n",
"print('Vivant ',C_smokers-C_deaths_smokers,' ',C_non_smokers-C_deaths_non_smokers,' ',C_total-C_deaths_smokers-C_deaths_non_smokers)\n",
"print('Mort ',C_deaths_smokers,' ',C_deaths_non_smokers,' ',C_deaths_smokers+C_deaths_non_smokers)\n",
"print('Mortalité ',round(C_death_rate_smokers,3),' ',round(C_death_rate_non_smokers,3),' ',round(C_death_rate_total,3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Catégorie d'âge 65+"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Fumeurs Non-fumeurs Total\n",
" ------------------------------------\n",
"Taille du groupe 49 193 242\n",
"Vivant 7 28 35\n",
"Mort 42 165 207\n",
"Mortalité 0.857 0.855 0.855\n"
]
}
],
"source": [
"D_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')& (data['Categorie d\\'age']=='D'))\n",
"D_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')& (data['Categorie d\\'age']=='D'))\n",
"D_total=D_smokers+D_non_smokers\n",
"\n",
"D_deaths_smokers=pd.DataFrame.sum((data['Smoker']=='Yes')&(data['Status']=='Dead')&(data['Categorie d\\'age']=='D'))\n",
"D_death_rate_smokers=D_deaths_smokers/D_smokers\n",
"D_deaths_non_smokers=pd.DataFrame.sum((data['Smoker']=='No')&(data['Status']=='Dead')&(data['Categorie d\\'age']=='D'))\n",
"D_death_rate_non_smokers=D_deaths_non_smokers/D_non_smokers\n",
"D_death_rate_total=(D_deaths_smokers+D_deaths_non_smokers)/D_total\n",
"\n",
"\n",
"print(' Fumeurs Non-fumeurs Total')\n",
"print(' ------------------------------------')\n",
"print('Taille du groupe ',D_smokers,' ',D_non_smokers,' ',D_total)\n",
"print('Vivant ',D_smokers-D_deaths_smokers,' ',D_non_smokers-D_deaths_non_smokers,' ',D_total-D_deaths_smokers-D_deaths_non_smokers)\n",
"print('Mort ',D_deaths_smokers,' ',D_deaths_non_smokers,' ',D_deaths_smokers+D_deaths_non_smokers)\n",
"print('Mortalité ',round(D_death_rate_smokers,3),' ',round(D_death_rate_non_smokers,3),' ',round(D_death_rate_total,3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyse"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"labels = ['18-34', '35-54', '55-64', '65+']\n",
"nn_smkers_dth_rt = [A_death_rate_non_smokers,B_death_rate_non_smokers,C_death_rate_non_smokers,D_death_rate_non_smokers]\n",
"nn_smkers_dth_rt = [round(num, 2) for num in nn_smkers_dth_rt]\n",
"smkers_dth_rt = [A_death_rate_smokers,B_death_rate_smokers,C_death_rate_smokers,D_death_rate_smokers]\n",
"smkers_dth_rt = [round(num, 2) for num in smkers_dth_rt]\n",
"\n",
"x = np.arange(len(labels)) # the label locations\n",
"width = 0.35 # the width of the bars\n",
"\n",
"fig, ax = plt.subplots()\n",
"rects1 = ax.bar(x - width/2, nn_smkers_dth_rt, width, label='Non-fumeur')\n",
"rects2 = ax.bar(x + width/2, smkers_dth_rt, width, label='Fumeur')\n",
"\n",
"# Add some text for labels, title and custom x-axis tick labels, etc.\n",
"ax.set_ylabel('Taux de mortalité')\n",
"ax.set_title('Taux de mortalité par tabagisme et catégorie d\\'âge')\n",
"ax.set_xticks(x)\n",
"ax.set_xticklabels(labels)\n",
"ax.legend()\n",
"\n",
"\n",
"def autolabel(rects):\n",
" \n",
" for rect in rects:\n",
" height = rect.get_height()\n",
" ax.annotate('{}'.format(height),\n",
" xy=(rect.get_x() + rect.get_width() / 2, height),\n",
" xytext=(0, 3), # 3 points vertical offset\n",
" textcoords=\"offset points\",\n",
" ha='center', va='bottom')\n",
"\n",
"\n",
"autolabel(rects1)\n",
"autolabel(rects2)\n",
"\n",
"fig.tight_layout()\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"labels = ['18-34', '35-54', '55-64', '65+']\n",
"nn_smkrs = [A_non_smokers,B_non_smokers,C_non_smokers,D_non_smokers]\n",
"smkrs = [A_smokers,B_smokers,C_smokers,D_smokers]\n",
"\n",
"x = np.arange(len(labels)) # the label locations\n",
"width = 0.35 # the width of the bars\n",
"\n",
"fig, ax = plt.subplots()\n",
"rects1 = ax.bar(x - width/2, nn_smkrs, width, label='Non-fumeur')\n",
"rects2 = ax.bar(x + width/2, smkrs, width, label='Fumeur')\n",
"\n",
"# Add some text for labels, title and custom x-axis tick labels, etc.\n",
"ax.set_ylabel('Taille du groupe')\n",
"ax.set_title('Répartition du tabagisme en fonction de la catégorie d\\'âge')\n",
"ax.set_xticks(x)\n",
"ax.set_xticklabels(labels)\n",
"ax.legend()\n",
"\n",
"\n",
"def autolabel(rects):\n",
" \n",
" for rect in rects:\n",
" height = rect.get_height()\n",
" ax.annotate('{}'.format(height),\n",
" xy=(rect.get_x() + rect.get_width() / 2, height),\n",
" xytext=(0, 3), # 3 points vertical offset\n",
" textcoords=\"offset points\",\n",
" ha='center', va='bottom')\n",
"\n",
"\n",
"autolabel(rects1)\n",
"autolabel(rects2)\n",
"\n",
"fig.tight_layout()\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ces deux graphique mettent en évidences plusieurs choses :\n",
" - le taux de mortalité à 20 ans est très dépendant de l'âge (ce qui après réflexion semble évident),\n",
" - la proportion de fumeur dépend de l'âge,\n",
" - pour chaque catégorie d'âge la mortalité des fumeurs est plus importantes que celles des non-fumeurs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Régression logistique"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Afin d'éviter un biais induit par des regroupements en tranches d'âges arbitraires et non régulières, on réalise une régression logistique. Pour cela on introduit la variable Death qui vaut 1 si l'individu est décédé dans la période de 20 ans, 0 sinon."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"data['Death']=0\n",
"data.loc[data['Status']=='Dead','Death'] = 1\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"x=data['Age']\n",
"x=sm.add_constant(x)\n",
"y=data['Death']"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.382339\n",
" Iterations 7\n"
]
},
{
"data": {
"text/html": [
"\n",
"Logit Regression Results \n",
"\n",
" Dep. Variable: Death No. Observations: 1314 \n",
" \n",
"\n",
" Model: Logit Df Residuals: 1312 \n",
" \n",
"\n",
" Method: MLE Df Model: 1 \n",
" \n",
"\n",
" Date: Tue, 28 Apr 2020 Pseudo R-squ.: 0.3560 \n",
" \n",
"\n",
" Time: 22:33:45 Log-Likelihood: -502.39 \n",
" \n",
"\n",
" converged: True LL-Null: -780.16 \n",
" \n",
"\n",
" LLR p-value: 7.883e-123 \n",
" \n",
"
\n",
"\n",
"\n",
" coef std err z P>|z| [0.025 0.975] \n",
" \n",
"\n",
" const -6.1045 0.321 -18.992 0.000 -6.735 -5.475 \n",
" \n",
"\n",
" Age 0.0977 0.006 17.578 0.000 0.087 0.109 \n",
" \n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Logit Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Death No. Observations: 1314\n",
"Model: Logit Df Residuals: 1312\n",
"Method: MLE Df Model: 1\n",
"Date: Tue, 28 Apr 2020 Pseudo R-squ.: 0.3560\n",
"Time: 22:33:45 Log-Likelihood: -502.39\n",
"converged: True LL-Null: -780.16\n",
" LLR p-value: 7.883e-123\n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"const -6.1045 0.321 -18.992 0.000 -6.735 -5.475\n",
"Age 0.0977 0.006 17.578 0.000 0.087 0.109\n",
"==============================================================================\n",
"\"\"\""
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = sm.Logit(y, x)\n",
"result = model.fit(method='newton')\n",
"result.summary()\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "'Frequency'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2524\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2525\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2526\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'Frequency'",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mdata_pred\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Frequency'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_pred\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Constant'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'Age'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mdata_pred\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"Age\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"Frequency\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mkind\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"line\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mylim\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscatter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"Age\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"Frequency\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgrid\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2137\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2138\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2139\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_column\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2140\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2141\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_getitem_column\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m_getitem_column\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2144\u001b[0m \u001b[0;31m# get column\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2145\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_unique\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2146\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_item_cache\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2147\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2148\u001b[0m \u001b[0;31m# duplicate columns & possible reduce dimensionality\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m_get_item_cache\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m 1840\u001b[0m \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1841\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mres\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1842\u001b[0;31m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1843\u001b[0m \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_box_item_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1844\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py\u001b[0m in \u001b[0;36mget\u001b[0;34m(self, item, fastpath)\u001b[0m\n\u001b[1;32m 3841\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3842\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3843\u001b[0;31m \u001b[0mloc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3844\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3845\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2525\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2526\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2527\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_maybe_cast_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2528\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2529\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtolerance\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtolerance\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'Frequency'"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"data_pred = pd.DataFrame({'Age': np.linspace(start=18, stop=100, num=100), 'Constant': 1})\n",
"data_pred['Frequency'] = result.predict(data_pred[['Constant','Age']])\n",
"data_pred.plot(x=\"Age\",y=\"Frequency\",kind=\"line\",ylim=[0,1])\n",
"plt.scatter(x=data[\"Age\"],y=data[\"Frequency\"])\n",
"plt.grid(True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}