{ "cells": [ { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "# CO2 concentration by Donato Tiano" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "#### Loading Dataset" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "As first step, I've downloaded the file and I've put it in the GitLab. I've modified the file .csv removing the header.\n", "Then, I've download the file via Python and printed every row of the dataset. Moreover, I've created a dataframe in order to organize better the information. \n", "The measure extracted from the .csv are converted in Float." ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Smocker Status Age\n", "0 Yes Alive 21.0\n", "1 Yes Alive 19.3\n", "2 No Dead 57.5\n", "3 No Alive 47.1\n", "4 Yes Alive 81.4\n", "5 No Alive 36.8\n", "6 No Alive 23.8\n", "7 Yes Dead 57.5\n", "8 Yes Alive 24.8\n", "9 Yes Alive 49.5\n", "10 Yes Alive 30.0\n", "11 No Dead 66.0\n", "12 Yes Alive 49.2\n", "13 No Alive 58.4\n", "14 No Dead 60.6\n", "15 No Alive 25.1\n", "16 No Alive 43.5\n", "17 No Alive 27.1\n", "18 No Alive 58.3\n", "19 Yes Alive 65.7\n", "20 No Dead 73.2\n", "21 Yes Alive 38.3\n", "22 No Alive 33.4\n", "23 Yes Dead 62.3\n", "24 No Alive 18.0\n", "25 No Alive 56.2\n", "26 Yes Alive 59.2\n", "27 No Alive 25.8\n", "28 No Dead 36.9\n", "29 No Alive 20.2\n", "... ... ... ...\n", "1284 Yes Dead 36.0\n", "1285 Yes Alive 48.3\n", "1286 No Alive 63.1\n", "1287 No Alive 60.8\n", "1288 Yes Dead 39.3\n", "1289 No Alive 36.7\n", "1290 No Alive 63.8\n", "1291 No Dead 71.3\n", "1292 No Alive 57.7\n", "1293 No Alive 63.2\n", "1294 No Alive 46.6\n", "1295 Yes Dead 82.4\n", "1296 Yes Alive 38.3\n", "1297 Yes Alive 32.7\n", "1298 No Alive 39.7\n", "1299 Yes Dead 60.0\n", "1300 No Dead 71.0\n", "1301 No Alive 20.5\n", "1302 No Alive 44.4\n", "1303 Yes Alive 31.2\n", "1304 Yes Alive 47.8\n", "1305 Yes Alive 60.9\n", "1306 No Dead 61.4\n", "1307 Yes Alive 43.0\n", "1308 No Alive 42.1\n", "1309 Yes Alive 35.9\n", "1310 No Alive 22.3\n", "1311 Yes Dead 62.1\n", "1312 No Dead 88.6\n", "1313 No Alive 39.1\n", "\n", "[1314 rows x 3 columns]\n" ] } ], "source": [ "import csv\n", "import requests\n", "import pandas as pd\n", "CSV_URL = 'https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv'\n", "\n", "with requests.Session() as s:\n", " download = s.get(CSV_URL)\n", " decoded_content = download.content.decode('utf-8')\n", " cr = csv.reader(decoded_content.splitlines(), delimiter=',')\n", " my_list = list(cr)\n", " datasetDict = {'Smocker': [],\n", " 'Status': [],\n", " 'Age': []\n", " }\n", " for row in my_list[1:]:\n", " datasetDict['Smocker'].append(row[0])\n", " datasetDict['Status'].append(row[1])\n", " datasetDict['Age'].append(float(row[2]))\n", " \n", "\n", "df = pd.DataFrame(datasetDict, columns = ['Smocker', 'Status','Age'])\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Computation of Mortality of Smocker and Not Smocker Women" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "I'm going to tabulate the total number of women alive and dead over the period according to their smoking habits." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "smockerWomen = df[df.Smocker == 'Yes']\n", "noSmockerWomen = df[df.Smocker == 'No']" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "First of all, I will count the number of women for each groups." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of Smocker Women: 582\n", "Number of No Smocker Women: 732\n" ] } ], "source": [ "print(\"Number of Smocker Women: \" + str(len(list(smockerWomen.Status))))\n", "print(\"Number of No Smocker Women: \" + str(len(list(noSmockerWomen.Status))))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "I'm going to compute the mortality rate of each table. The computation is very easy, I just count the number of women \"*Dead*\" for each group, subdivided by the number of women of the same group. " ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mortality rate of Smocker Women: 0.23883161512027493\n", "Mortality rate of No Smocker Women: 0.31420765027322406\n" ] } ], "source": [ "from statistics import mean\n", "\n", "print(\"Mortality rate of Smocker Women: \" + str((list(smockerWomen.Status).count(\"Dead\"))/len(list(smockerWomen.Status))))\n", "print(\"Mortality rate of No Smocker Women: \" + str((list(noSmockerWomen.Status).count(\"Dead\"))/len(list(noSmockerWomen.Status))))\n" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "The Mortality rate of the Smocker Women is lower than the No Smocker Women, but it contains less number of people." ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "### Mortality Rate by Age Group" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "I'm going to compute the mortality rate grouped by the age of the women. First of all, I'm going to subdivide the Smocker and NoSmocker Group by the age." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [], "source": [ "smockerWomen1834 = smockerWomen[smockerWomen.Age < 34]\n", "smockerWomen3454 = smockerWomen[(smockerWomen.Age > 34) & (smockerWomen.Age < 54)]\n", "smockerWomen5564 = smockerWomen[(smockerWomen.Age > 55) & (smockerWomen.Age < 64)]\n", "smockerWomen65 = smockerWomen[smockerWomen.Age > 65]\n", "\n", "noSmockerWomen1834 = noSmockerWomen[noSmockerWomen.Age < 34]\n", "noSmockerWomen3454 = noSmockerWomen[(noSmockerWomen.Age > 34) & (noSmockerWomen.Age < 54)]\n", "noSmockerWomen5564 = noSmockerWomen[(noSmockerWomen.Age > 55) & (noSmockerWomen.Age < 64)]\n", "noSmockerWomen65 = noSmockerWomen[noSmockerWomen.Age > 65]" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "Then, it is possible to show the mortality rate for each group of age and smocker/noSmocker" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mortality rate of Smocker Women 18-34: 0.027932960893854747\n", "Mortality rate of Smocker Women 15-54: 0.1729957805907173\n", "Mortality rate of Smocker Women 55-64: 0.43859649122807015\n", "Mortality rate of Smocker Women 65: 0.8571428571428571\n", "\n", "Mortality rate of No Smocker Women 18-34: 0.0273972602739726\n", "Mortality rate of No Smocker Women 15-54: 0.09547738693467336\n", "Mortality rate of No Smocker Women 55-64: 0.3277310924369748\n", "Mortality rate of No Smocker Women 65: 0.859375\n" ] } ], "source": [ "smockerWomen1834Mort = (list(smockerWomen1834.Status).count(\"Dead\"))/len(list(smockerWomen1834.Status))\n", "smockerWomen3454Mort = (list(smockerWomen3454.Status).count(\"Dead\"))/len(list(smockerWomen3454.Status))\n", "smockerWomen5564Mort = (list(smockerWomen5564.Status).count(\"Dead\"))/len(list(smockerWomen5564.Status))\n", "smockerWomen65Mort = (list(smockerWomen65.Status).count(\"Dead\"))/len(list(smockerWomen65.Status))\n", "\n", "noSmockerWomen1834Mort = (list(noSmockerWomen1834.Status).count(\"Dead\"))/len(list(noSmockerWomen1834.Status))\n", "noSmockerWomen3454Mort = (list(noSmockerWomen3454.Status).count(\"Dead\"))/len(list(noSmockerWomen3454.Status))\n", "noSmockerWomen5564Mort = (list(noSmockerWomen5564.Status).count(\"Dead\"))/len(list(noSmockerWomen5564.Status))\n", "noSmockerWomen65Mort = (list(noSmockerWomen65.Status).count(\"Dead\"))/len(list(noSmockerWomen65.Status))\n", "\n", "print(\"Mortality rate of Smocker Women 18-34: \" + str(smockerWomen1834Mort))\n", "print(\"Mortality rate of Smocker Women 35-54: \" + str(smockerWomen3454Mort))\n", "print(\"Mortality rate of Smocker Women 55-64: \" + str(smockerWomen5564Mort))\n", "print(\"Mortality rate of Smocker Women 65: \" + str(smockerWomen65Mort))\n", "print()\n", "print(\"Mortality rate of No Smocker Women 18-34: \" + str(noSmockerWomen1834Mort))\n", "print(\"Mortality rate of No Smocker Women 35-54: \" + str(noSmockerWomen3454Mort))\n", "print(\"Mortality rate of No Smocker Women 55-64: \" + str(noSmockerWomen5564Mort))\n", "print(\"Mortality rate of No Smocker Women 65: \" + str(noSmockerWomen65Mort))" ] }, { "cell_type": "markdown", "metadata": { "hideCode": false, "hidePrompt": false }, "source": [ "In the graph below is shown the difference in terms of rate mortality of each range group. It is notable the high difference of mortality rate on the middle group 35-64. Indeed, in the Smocker group the mortality rate is higher. " ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "hideCode": false, "hidePrompt": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# libraries\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", " \n", "# width of the bars\n", "barWidth = 0.2\n", "\n", "# Bars of Smocker Women\n", "bars1 = [smockerWomen1834Mort,smockerWomen3454Mort, smockerWomen5564Mort, smockerWomen65Mort]\n", "\n", "\n", "# Bars of No Smocker Women\n", "bars2 =[ noSmockerWomen1834Mort, noSmockerWomen3454Mort, noSmockerWomen5564Mort,noSmockerWomen65Mort]\n", "\n", " \n", "# The x position of bars\n", "r1 = np.arange(len(bars1))\n", "r2 = [x + barWidth for x in r1]\n", " \n", "# Create blue bars\n", "plt.bar(r1, bars1, width = barWidth, color = 'blue', edgecolor = 'black', capsize=7, label='Smockers')\n", " \n", "# Create cyan bars\n", "plt.bar(r2, bars2, width = barWidth, color = 'cyan', edgecolor = 'black', capsize=7, label='No Smockers')\n", " \n", "# general layout\n", "plt.xticks([r + barWidth for r in range(len(bars1))], ['18-34', '35-54', '54-65','65'])\n", "plt.ylabel('Rate Mortality')\n", "plt.legend()\n", " \n", "# Show graphic\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computation of Logistic Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've computed the linear regression, but the results appear similar." ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import classification_report, confusion_matrix\n", "from sklearn.preprocessing import LabelEncoder\n", "\n", "# print(smockerWomen5564)\n", "labelencoder = LabelEncoder()\n", "smockerWomen['Death'] = labelencoder.fit_transform(smockerWomen['Status'])\n", "noSmockerWomen['Death'] = labelencoder.fit_transform(noSmockerWomen['Status'])\n", "\n", "\n", "listApp = []\n", "\n", "listAppNoSmo = []\n", "\n", "for x in list(smockerWomen['Age']):\n", " listApp.append([x])\n", "\n", "for x in list(noSmockerWomen['Age']):\n", " listAppNoSmo.append([x])\n", "\n", " \n", "\n", "model = LogisticRegression(solver='liblinear', random_state=0).fit(listApp,list(smockerWomen['Death']))\n", "modelNoSm = LogisticRegression(solver='liblinear', random_state=0).fit(listAppNoSmo,list(noSmockerWomen['Death']))\n", "\n", "resSmok = model.predict_proba(listApp)\n", "resNoSmok = model.predict_proba(listAppNoSmo)\n", "\n", "# print(resSmok[0][0])\n", "dictSmokeRate = {'Age':[],'SurviveSmoke':[]}\n", "for x in range(0,len(listApp)):\n", " dictSmokeRate['Age'].append(listApp[x][0])\n", " dictSmokeRate['SurviveSmoke'].append(resSmok[x][0])\n", "\n", " \n", "\n", "dictNoSmokeRate = {'Age':[],'SurviveNoSmoke':[]}\n", "for x in range(0,len(listAppNoSmo)):\n", " dictNoSmokeRate['Age'].append(listAppNoSmo[x][0])\n", " dictNoSmokeRate['SurviveNoSmoke'].append(resNoSmok[x][0])\n", " \n", "dfSmoke = pd.DataFrame(dictSmokeRate, columns = ['Age','SurviveSmoke'])\n", "dfSmoke = dfSmoke.sort_values(by=['Age'])\n", "\n", "# print(dictNoSmokeRate)\n", "\n", "dfNoSmoke = pd.DataFrame(dictNoSmokeRate, columns = ['Age','SurviveNoSmoke'])\n", "dfNoSmoke = dfNoSmoke.sort_values(by=['Age'])\n", "\n", "ax = dfSmoke.plot(x='Age')\n", "dfNoSmoke.plot(x='Age',ax=ax)\n" ] } ], "metadata": { "hide_code_all_hidden": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }