--- title: "smoker" author: "abigail pickard" date: "17/11/2020" output: html_document --- In 1972-1974, in Whickham, a town in the north-east of England, located approximately 6.5 kilometres south-west of Newcastle upon Tyne, a survey of one-sixth of the electorate was conducted in order to inform work on thyroid and heart disease (Tunbridge and al. 1977). A continuation of this study was carried out twenty years later. (Vanderpump et al. 1995). Some of the results were related to smoking and whether individuals were still alive at the time of the second study. For the purpose of simplicity, the data is restricted to women and among these to the 1314 that were categorized as "smoking currently" or "never smoked". There were relatively few women in the initial survey who smoked but have since quit (162) and very few for which information was not available (18). Survival at 20 years was determined for all women of the first survey. All these data are available in this file [CSV] (module3/Practical_session/Subject6_smoking.csv). You will find on each line if the person smokes or not, whether alive or dead at the time of the second study, and his age at the time of the first survey. ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r} data <- read.csv("https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false") ``` ## R Markdown First let's get a general summary of the data. ```{r} summary(data) ``` And we also want to check for any missing data points and the how our variables are defined. ```{r} na_records = apply(data, 1, function (x) any(is.na(x))) data[na_records,] ``` ```{r} class(data$Age) class(data$Status) class(data$Smoker) ``` Tabulate the total number of women alive and dead over the period according to their smoking habits. Calculate in each group (smoking/non-smoking) the mortality rate (the ratio of the number of women who died in a group to the total number of women in that group). So we create a crosstabulation of status and smoker/non-smoker. ```{r} attach(data) mytable <- table(Smoker,Status) mytable # print ``` Using this information we can then calculate the proportion of deaths in the two respective groups. ```{r} 443 + 139 139/582 ``` Showing that 23.88percent of non-smokers died within the 20 years. ```{r} 502+230 230/732 ``` And that 31.42percent of non-smokers died within the 20 years. This is interesting as it indicates that the mortality rate for non-smokers was higher than for smokers. We created a simple bar plot to have an overview of number of deaths versus number of survivals. ```{r} # Simple Bar Plot counts <- table(data$Status) barplot(counts, main="smokers", xlab="status") ``` `` We then graphed this data by smoker or non-smoker. ```{r} # Stacked Bar Plot with Colors and Legend counts <- table(data$Status, data$Smoker) barplot(counts, main="Distribution by Status and Smokers", xlab="Smoker", col=c("darkblue","red"), legend = rownames(counts)) ``` We then want to define age groups: 18-34 years, 34-54 years, 55-64 years, over 65 years. ```{r} library(dplyr) data <- data %>% mutate(agegroup = case_when(Age >= 64 ~ '4', Age >= 55 & Age <= 64 ~ '3', Age >= 34 & Age <= 54 ~ '2', Age >= 18 & Age <= 33.99 ~ '1')) # end function ``` We then calculated in each age group (smoking/non-smoking) the mortality rate (the ratio of the number of women who died in a group to the total number of women in that group). ```{r} attach(data) mytable <- table(Smoker, Status, agegroup) mytable # print ``` In order to avoid a bias induced by arbitrary and non-regular age groupings, we perform a logistic regression. We introduced a Death variable of 1 or 0 to indicate whether the individual died during the 20-year period, to then study the Death ~ Age model to study the probability of death as a function of age according to whether one considers the group of smokers or non-smokers. ```{r} data$Status_num <- ifelse(data$Status =="Dead", 1, 0) ``` data$Status_num <- ifelse(train$Status =="Dead", 1, 0) ```{r} xtabs(~Status_num + Smoker, data = data) ``` ```{r} data$Smoker <- factor(data$Smoker) mylogit <- glm(Status_num ~ Age + Smoker, data = data, family = "binomial") ``` ```{r} summary(mylogit) ``` ```{r} confint(mylogit) ``` Based on this regression model we would not have enough evidence to conclude or not on the harmfulness of smoking. But it does appear that age and smoking have an interaction effect on mortality.