title: 'Autour du Paradoxe de Simpson '
author: "Salem Zeiny"
output:
html_notebook: default
pdf_document: default
---

# Intro
In 1972-1974,a poll was conducted on 1/6th of the voters to shed light on thyroid and heart disease research (Tunbridge et al. 1977). 20 years later, a continuation of this study was carried (Vanderpump et al. 1995)

To simplify our analysis, we will restrict to females (N=1314) in either of the following categories (current smokers or never smokers)

# Analysis section 

## load library and set up working environment
Rversion 4.0.3 was used for this analysis 
```{r}
library(ggplot2)
library(dplyr)
library(broom)
library(tidyr)
```
## Load the data 
Data was downloaded from [github](https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/blob/master/module3/Practical_session/Subject6_smoking.csv)

```{r}
data <- read.csv("C:/Users/zeinys/Desktop/module3_Practical_session_Subject6_smoking.csv", h=T)
# data checking 
head(data)
str(data) #1314 obs
table(data$Smoker, useNA = "always")
```

## Descriptive analysis
### Describe mortality status of women by smoking habits 
```{r}
scale_round <- function(x){
  x <- x*100
  x <- round(x,digits = 2)
}

#manual number check
with(data, table(Status, Smoker, useNA = "always"))

#creating a better table to summarize all results needed
data%>%group_by(Smoker)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"), scale_round))%>%mutate(Alive = paste0(Alive, " (", alive_per ,"%)"), `Mortality Rate`=paste0(dead_per, "% (95%CI=", death_lower, ";",death_upper, ")"))%>%select(-c("alive_per", "death_lower","death_upper", "dead_per")) 

```
At baseline we had 732 non-smokers and 582 smokers, of wich 502 and 443 were still alive by the end of the study (respectively for non-smokers and smokers). Therefore the death rate in the smokers group 23.88% (95% CI: 20.42 to 27.35 %) and 31.4% (95% CI: 28.06 to 34.78 %) in the non-smoking group.

These results seem surprising as this study is in the continuity of a study assesing the role of smoking in heart disease and illnesses in the thyroid, therefore we would expect the mortality rates to be higher in the smoking group, as smoking is known to negatively impact the cardiovascular and the respiratory systems 

## Could it be confounding ?
First we will assess the variation in mortality rates by age groups regardless of smoking status.
We will now take into account age groups. We will use the following groups : 18-34, 34-54, 55-64 and 65+
```{r}
#check that the minimum age is 18
summary(data$Age) #Ok

data <- data%>%mutate(age_group=cut(Age, breaks=c(-Inf, 34, 54, 64,Inf), labels=c("18-34","35-54","55-64","65+")))

data%>%group_by(age_group)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"), scale_round))%>%mutate(Alive = paste0(Alive, " (", alive_per ,"%)"), `Mortality Rate`=paste0(dead_per, "% (95%CI=", death_lower, ";",death_upper, ")"))%>%select(-c("alive_per", "death_lower","death_upper", "dead_per"))%>%ungroup()
```
Mortality rates are increasing with age as expected.

Age could be acting as a confounder, therefore we will assess mortality rates within each age group by smoking status

```{r}
##standrize
data%>%group_by(age_group, Smoker)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"), scale_round))%>%mutate(Alive = paste0(Alive, " (", alive_per ,"%)"), `Mortality Rate`=paste0(dead_per, "% (95%CI=", death_lower, ";",death_upper, ")"))%>%select(-c("alive_per", "death_lower","death_upper", "dead_per"))#%>%pivot_wider(names_from = age_group, values_from=`Mortality Rate`)


#plotting 
scale100 <- function(x){x <- x*100}
data%>%group_by(age_group, Smoker)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"),scale100))%>%ggplot(aes(x=age_group, y=dead_per, color=Smoker))+geom_pointrange(aes(ymin=death_lower, ymax=death_upper),position = position_dodge(width = 0.2))+labs(title = "Mortality rates by age group in smokers vs. non smokers", x="Age group", y="Morality rate (in %)")
```
In the youngest age group, we can see that mortality rates are low and not different between smokers and non-smokers. 
With increasing age groups (35 to 64) we can see an increase in the mortality rates , and a notable difference between current smokers and nover smokers mortality rates. 
Finally in the highest age group, smoking doesn't seem to have the same impact on mortality.
These results seem to make sense, as the mortality in the youngest age group is generally low, and the the one in the older age groups is generally high regardless of smoking status. People at older ages tend to develop illnesses that can or not relate to smoking status. It's an interesting finding that emphasizes on the impact of smoking in middle aged individuals.

## We will verify our findings using a regression analysis 
This will allow us to adjust on age (continuous variable), without losing information (as is the case when using categories)
```{r}
head(data)
#we will recode status variables as 0:alive and 1: dead (as dead is the outcome of interest)
data <- data%>%mutate(death=ifelse(Status=="Alive","0","1"))%>%mutate(death=as.factor(death))
#checks
with(data, table(death, Status))

mod1 <- glm(death~Smoker+Age, family="binomial", data=data)
tidy(mod1)


#estimates and CI for estimated risk 
ilink <- family(mod1)$linkinv

## add fit and se.fit on the **link** scale
ndata <- bind_cols(data, setNames(as_tibble(predict(mod1, data, se.fit = TRUE)[1:2]),
                                   c('fit_link','se_link')))
## create the interval and backtransform
ndata <- mutate(ndata,
                fit_resp  = ilink(fit_link),
                right_upr = ilink(fit_link + (2 * se_link)),
                right_lwr = ilink(fit_link - (2 * se_link)))
## show
ndata

ndata%>%mutate(death=as.numeric(as.character(death)))%>%select(Smoker, Age, death, fit_resp, right_upr, right_lwr)%>%pivot_longer(cols=c("death", "fit_resp"), names_to="value_from", values_to="risk")%>%mutate(right_upr=ifelse(value_from=="death", NA, as.numeric(right_upr)),  right_lwr=ifelse(value_from=="death",NA,as.numeric(right_lwr)))%>%ggplot(aes(x=Age, y=risk, color=Smoker, shape=value_from))+geom_point()+ scale_shape_discrete(name=element_blank(),
                          breaks=c("death", "fit_resp"),
                          labels=c("Observed risk of death", "Estimated risk of death"))+ylab("Risk of death")+labs(caption = "0: death not observed during followup \n 1: death during followup")+theme(plot.caption = element_text(hjust = 0.5))+geom_ribbon(aes(ymin = right_lwr, ymax = right_upr),alpha = 0.1)


```
This graph compares risk of death for every age according to smoking status.
As already explained, we can see that most of the deaths observed in our study at the middle aged group occurred in smokers, but necessarily the case for older age groups (risk=1).
After fitting a logistic linear regression adjusted for age, we estimate the risk of death of every participant. The estimates are depicted in the sigmoid function. We can see that throughout different ages, risk estimates for smoker are higher than for non-smokers.
The confidence intervals do overlap, and therefore statistically speaking reject the hypothesis of a negative impact of cigarette smoking found in this dataset.

However, it does question the safety of cigarette use and gives reason to further analysis and bigger studies to firmly conclude on impact of cigarette.