"plt.title(\"Number and percentage of alive/dead women after 20 years, according to smoking habits\")\n",
"plt.xlabel(\"Status\")\n",
"plt.ylabel(\"Smoker\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is possible to see that the fraction of smokers and non smokers is quite balanced (in total, 582 smokers and 732 non smokers). As expected, there are less dead than alive people (369 versus 945).\n",
"\n",
"We can then compute the mortality rate for the two groups. For a population proportion $p$, confidence intervals are computed as $\\hat{p} \\pm z \\cdot \\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n}}$, where $\\hat{p}$ is the sample proportion, $n$ is the sample size and $z$ is the value derived from the standard normal distribution. For 95% confidence intervals, $z=1.96$."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mortality rate for smokers:\t23.88% ± 3.46%\n",
"Mortality rate for non smokers:\t31.42% ± 3.36%\n"
"print(f\"Mortality rate for non smokers:\\t{rate_non_smokers:.2%} \" + u\"\\u00B1\" + f\" {ci_non_smokers:.2%}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Surprisingly, the mortality rate is sensibly higher for women categorized as non smokers. However, we are not taking into account an important information: the age of those people at the time of the poll. This result can be expected, for example, if the average age of polled non smokers was higher than the one of smokers.\n",
"\n",
"---\n",
"\n",
"Let's now include the age in the analysis. The following age classes are considered: 18-34 years, 35-54 years, 55-64 years, over 65 years."