progress on peer evaluated exercise

parent 704ad9c6
......@@ -9,7 +9,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
......@@ -29,7 +29,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 50,
"metadata": {},
"outputs": [
{
......@@ -109,7 +109,7 @@
"4 Yes Alive 81.4"
]
},
"execution_count": 2,
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
......@@ -130,9 +130,9 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 51,
"metadata": {
"scrolled": true
"scrolled": false
},
"outputs": [
{
......@@ -151,21 +151,23 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"Let's visualize the number of women alive and dead after twenty years, according to their smoking habits. A heatmap is effective in this case."
]
},
{
"cell_type": "code",
"execution_count": 71,
"execution_count": 52,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"image/png": "\n",
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
"<Figure size 720x576 with 1 Axes>"
]
},
"metadata": {
......@@ -179,13 +181,383 @@
"count = np.reshape(count, (2, 2))\n",
"annots = np.array([f\"{v}\\n{v/len(data):.2%}\" for v in count.flatten()]).reshape(2,2)\n",
"\n",
"sns.heatmap(count, annot=annots, fmt=\"\", cmap='Blues', cbar=False,\n",
" xticklabels=['Alive', 'Dead'], yticklabels=['No', 'Yes'])\n",
"plt.figure(figsize=(10,8))\n",
"sns.heatmap(count, annot=annots, fmt=\"\", cmap='Blues', cbar=False, square=True,\n",
" xticklabels=['Alive', 'Dead'], yticklabels=['No', 'Yes'], annot_kws={\"fontsize\": 25})\n",
"plt.title(\"Number and percentage of alive/dead women after 20 years, according to smoking habits\")\n",
"plt.xlabel(\"Status\")\n",
"plt.ylabel(\"Smoker\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is possible to see that the fraction of smokers and non smokers is quite balanced (in total, 582 smokers and 732 non smokers). As expected, there are less dead than alive people (369 versus 945).\n",
"\n",
"We can then compute the mortality rate for the two groups. For a population proportion $p$, confidence intervals are computed as $\\hat{p} \\pm z \\cdot \\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n}}$, where $\\hat{p}$ is the sample proportion, $n$ is the sample size and $z$ is the value derived from the standard normal distribution. For 95% confidence intervals, $z=1.96$."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mortality rate for smokers:\t23.88% ± 3.46%\n",
"Mortality rate for non smokers:\t31.42% ± 3.36%\n"
]
}
],
"source": [
"z = 1.96\n",
"\n",
"num_smokers = sum(data['Smoker'] == \"Yes\")\n",
"num_dead_smokers = sum(np.logical_and(data['Smoker'] == \"Yes\", data['Status'] == \"Dead\"))\n",
"rate_smokers = num_dead_smokers / num_smokers\n",
"ci_smokers = z * (rate_smokers * (1 - rate_smokers) / num_smokers) ** 0.5\n",
"print(f\"Mortality rate for smokers:\\t{rate_smokers:.2%} \" + u\"\\u00B1\" + f\" {ci_smokers:.2%}\")\n",
"\n",
"num_non_smokers = len(data) - num_smokers\n",
"num_dead_non_smokers = sum(np.logical_and(data['Smoker'] == \"No\", data['Status'] == \"Dead\"))\n",
"rate_non_smokers = num_dead_non_smokers / num_non_smokers\n",
"ci_non_smokers = z * (rate_non_smokers * (1 - rate_non_smokers) / num_non_smokers) ** 0.5\n",
"print(f\"Mortality rate for non smokers:\\t{rate_non_smokers:.2%} \" + u\"\\u00B1\" + f\" {ci_non_smokers:.2%}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Surprisingly, the mortality rate is sensibly higher for women categorized as non smokers. However, we are not taking into account an important information: the age of those people at the time of the poll. This result can be expected, for example, if the average age of polled non smokers was higher than the one of smokers.\n",
"\n",
"---\n",
"\n",
"Let's now include the age in the analysis. The following age classes are considered: 18-34 years, 35-54 years, 55-64 years, over 65 years."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Smoker</th>\n",
" <th>Status</th>\n",
" <th>Age</th>\n",
" <th>Binned age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>21.0</td>\n",
" <td>18-34 years</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>19.3</td>\n",
" <td>18-34 years</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>No</td>\n",
" <td>Dead</td>\n",
" <td>57.5</td>\n",
" <td>55-64 years</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>No</td>\n",
" <td>Alive</td>\n",
" <td>47.1</td>\n",
" <td>35-54 years</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>81.4</td>\n",
" <td>Over 65 years</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Smoker Status Age Binned age\n",
"0 Yes Alive 21.0 18-34 years\n",
"1 Yes Alive 19.3 18-34 years\n",
"2 No Dead 57.5 55-64 years\n",
"3 No Alive 47.1 35-54 years\n",
"4 Yes Alive 81.4 Over 65 years"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def bin_age(age):\n",
" if age < 18:\n",
" return None\n",
" if age < 35:\n",
" return \"18-34 years\"\n",
" elif age < 55:\n",
" return \"35-54 years\"\n",
" elif age < 65:\n",
" return \"55-64 years\"\n",
" else:\n",
" return \"Over 65 years\"\n",
"\n",
"data['Binned age'] = data['Age'].apply(bin_age)\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, let's check that no missing are present, to ensure that no women under 18 was polled."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows with missing values: 0\n"
]
}
],
"source": [
"print(\"Number of rows with missing values:\", data.isnull().any(axis=1).sum())"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[221],\n",
" [ 6],\n",
" [182],\n",
" [ 7],\n",
" [172],\n",
" [ 19],\n",
" [190],\n",
" [ 39],\n",
" [ 81],\n",
" [ 40],\n",
" [ 64],\n",
" [ 51],\n",
" [ 28],\n",
" [165],\n",
" [ 7],\n",
" [ 42]])"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array(data.groupby(['Binned age', 'Smoker', 'Status']).count())"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Binned age</th>\n",
" <th>Smoker</th>\n",
" <th>Status</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"4\" valign=\"top\">18-34 years</th>\n",
" <th rowspan=\"2\" valign=\"top\">No</th>\n",
" <th>Alive</th>\n",
" <td>221</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dead</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">Yes</th>\n",
" <th>Alive</th>\n",
" <td>182</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dead</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"4\" valign=\"top\">35-54 years</th>\n",
" <th rowspan=\"2\" valign=\"top\">No</th>\n",
" <th>Alive</th>\n",
" <td>172</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dead</th>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">Yes</th>\n",
" <th>Alive</th>\n",
" <td>190</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dead</th>\n",
" <td>39</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"4\" valign=\"top\">55-64 years</th>\n",
" <th rowspan=\"2\" valign=\"top\">No</th>\n",
" <th>Alive</th>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dead</th>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">Yes</th>\n",
" <th>Alive</th>\n",
" <td>64</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dead</th>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"4\" valign=\"top\">Over 65 years</th>\n",
" <th rowspan=\"2\" valign=\"top\">No</th>\n",
" <th>Alive</th>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dead</th>\n",
" <td>165</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">Yes</th>\n",
" <th>Alive</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dead</th>\n",
" <td>42</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age\n",
"Binned age Smoker Status \n",
"18-34 years No Alive 221\n",
" Dead 6\n",
" Yes Alive 182\n",
" Dead 7\n",
"35-54 years No Alive 172\n",
" Dead 19\n",
" Yes Alive 190\n",
" Dead 39\n",
"55-64 years No Alive 81\n",
" Dead 40\n",
" Yes Alive 64\n",
" Dead 51\n",
"Over 65 years No Alive 28\n",
" Dead 165\n",
" Yes Alive 7\n",
" Dead 42"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.groupby(['Binned age', 'Smoker', 'Status']).count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Barchart with one bar for each age, divided in four parts(smoker or not, dead or alive)"
]
}
],
"metadata": {
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment