We will be using the libraries ggplot2 and dplyr in this document.
```{r, warning=FALSE, message=FALSE}
library(dplyr)
library(ggplot2)
```
We are studying a dataset containing information gathered by two surveys conducted in 1977 and 1995 respectively.
One sixth of the electorate was surveyed but the dataset we use in this analysis is restricted to women and, more specifically, to the 1314 that were categorized as "smoking currently" or "never smoked" (for the sake of simplicity).
There were very few women in the initial dataset that were categorized differently (162 as "smoked but quit" and 18 for which data was not available).
Each line in the dataset contains if the person smokes or not, if the person was alive at the time of the second survey and their age at the time of the first one.
We can see the two intervals do not overlap which means that we can say that the mortality rate among the smokers is lower than the one among nonsmokers with more than 90% of confidence.
It is obviously quite surprising as we would normally expect the mortality rate among the smokers to be higher.
However, this mortality rate does not take into account the cause of death.
```{r}
# Number of old (65 years old or more) people in the smoker group and the percentage of this group it represents
We can see in the result above that there are much more old people in the non-smoker group than in the other one.
The mortality rate in this group may therefore be increased by natural deaths.
### Splitting the groups into age classes
We split both groups into 4 age classes : 18 to 34 years old, 34 to 54, 54 to 65 and above 65.
```{r}
data %>% mutate(age_class = case_when(
data$Age >= 18.0 & data$Age < 34.0 ~ "18-34",
data$Age >= 34.0 & data$Age < 54.0 ~ "34-54",
data$Age >= 54.0 & data$Age < 65.0 ~ "54-65",
data$Age >= 65.0 ~ "65+",
)) -> data2
```
We plot the data in the following graph:
```{r}
ggplot(data2) +
facet_grid(Smoker ~ age_class) +
aes(x = Status, fill = factor(Status)) +
geom_bar() +
theme_bw() +
labs(x = "", fill = "", title = "Number of deaths by group of age", subtitle = "Between two surveys conducted in 1977 and 1995 (among the surveyed population)")
```
The following table shows, for each class, the number of people in this class, the mortality rate, the standard deviation and the confidence interval.
The mortality rate for the classes 34-54 and 54-65 are much higher for smokers than non-smokers. The other two classes have a similar mortality rate in both groups (although it is still greater by a small amount).
However, the confidence intervals do not allow us to conclude here as they all overlap.
We can see the Simpson's Paradox appear here: the conclusion we could make from this plot/table is the opposite of the one we made without the age classes.
### Logistic regression
```{r, warning=F, message=F}
ggplot(data, aes(x = Age, y = Death, col=Smoker)) +
labs(colour = "Group", title = "Probability of death as a function of age", subtitle = "In each group, between the two surveys")
```
Looking at the curves, we see that the mortality rate in this sample is higher for smokers up to approximately 70 years old and then it starts being the opposite (which is close to what we got in the previous section).
However, we still cannot conclude on the harmfulness of smoking as the confidence intervals overlap everywhere.
With more measurements to reduce the confidence interval we could maybe say that the mortality rate is lower for non-smokers under the age of 50 as the confidence interval barely overlaps.
It is likely that the dataset we are using does not contain the measurements we would need to conclude on the kind of things (for example, we could compute the life expectancy of a smoker vs non-smoker if we had the date of death).