"print(\"Number of rows with missing values:\", data.isnull().any(axis=1).sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's visualize the number on women alive and dead after twenty years, according to their smoking habits and age. Different colors correspond to different couples of smoking habits and status."
"ax.set_title(\"Number of alive/dead women after 20 years, according to their smoking habits and age\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One can see that most women that were over 65 in 1972 are dead twenty years after and that, at that a great majority of polled older women were non-smokers. For the other age groups results look similar but are difficult to interpret.\n",
"\n",
"Let's therefore compute the mortality rates for smokers and non-smokers in different age groups."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Age group: 18-34 years\n",
"\tMortality rate for smokers:\t3.70% ± 2.69%\n",
"\tMortality rate for non smokers:\t2.64% ± 2.09%\n",
"\n",
"Age group: 35-54 years\n",
"\tMortality rate for smokers:\t17.03% ± 4.87%\n",
"\tMortality rate for non smokers:\t9.95% ± 4.24%\n",
"\n",
"Age group: 55-64 years\n",
"\tMortality rate for smokers:\t44.35% ± 9.08%\n",
"\tMortality rate for non smokers:\t33.06% ± 8.38%\n",
"\n",
"Age group: Over 65 years\n",
"\tMortality rate for smokers:\t85.71% ± 9.80%\n",
"\tMortality rate for non smokers:\t85.49% ± 4.97%\n",
" print(f\"\\tMortality rate for non smokers:\\t{rate_non_smokers:.2%} \" + u\"\\u00B1\" + f\" {ci_non_smokers:.2%}\")\n",
" \n",
"for group in sorted(data['Age group'].unique()):\n",
" ci_per_age(group)\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, the mortality rate is considerably higher for smokers, especially for people between 35 and 65 years old. This might seem like a contradiction, as before the rate was higher for non-smokers. However, from the previous bar chart it is clear that the percentage of polled smokers/non-smokers is different in different age groups, in particular for older women as mentioned. In addition, the fact that most polled women over 65 are non-smokers can be an argument in favor of the hypothesis that smoking is dangerous for health, but this can't be proven through statistics.\n",
"\n",
"---\n",
"\n",
"The age groups are fixed a-priori. In order to have more flexible results and reduce the introduced bias it is possible to try to perform a logistic regression, studying the probability of death in the two groups according to the age."