diff --git a/module3/exo3/exercice_en.Rmd b/module3/exo3/exercice_en.Rmd index 13b258ddd0da29bc3bf08c64b6a1db742f6d5409..6cab17cb0a5c1a44cb50ac2320988de8c0f0dcf3 100644 --- a/module3/exo3/exercice_en.Rmd +++ b/module3/exo3/exercice_en.Rmd @@ -1,33 +1,108 @@ --- -title: "Your title" -author: "Your name" -date: "Today's date" +title: "smoker" +author: "abigail pickard" +date: "17/11/2020" output: html_document --- +In 1972-1974, in Whickham, a town in the north-east of England, located approximately 6.5 kilometres south-west of Newcastle upon Tyne, a survey of one-sixth of the electorate was conducted in order to inform work on thyroid and heart disease (Tunbridge and al. 1977). A continuation of this study was carried out twenty years later. (Vanderpump et al. 1995). Some of the results were related to smoking and whether individuals were still alive at the time of the second study. For the purpose of simplicity, the data is restricted to women and among these to the 1314 that were categorized as "smoking currently" or "never smoked". There were relatively few women in the initial survey who smoked but have since quit (162) and very few for which information was not available (18). Survival at 20 years was determined for all women of the first survey. - +All these data are available in this file [CSV] (module3/Practical_session/Subject6_smoking.csv). You will find on each line if the person smokes or not, whether alive or dead at the time of the second study, and his age at the time of the first survey. ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` +```{r} +data <- read.csv("https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false") +``` -## Some explanations +## R Markdown -This is an R Markdown document that you can easily export to HTML, PDF, and MS Word formats. For more information on R Markdown, see . +First let's get a general summary of the data. -When you click on the button **Knit**, the document will be compiled in order to re-execute the R code and to include the results into the final document. As we have shown in the video, R code is inserted as follows: +```{r} +summary(data) +``` +And we also want to check for any missing data points and the how our variables are defined. -```{r cars} -summary(cars) +```{r} +na_records = apply(data, 1, function (x) any(is.na(x))) +data[na_records,] ``` +```{r} +class(data$Age) +class(data$Status) +class(data$Smoker) +``` +Tabulate the total number of women alive and dead over the period according to their smoking habits. Calculate in each group (smoking/non-smoking) the mortality rate (the ratio of the number of women who died in a group to the total number of women in that group). So we create a crosstabulation of status and smoker/non-smoker. -It is also straightforward to include figures. For example: +```{r} +attach(data) +mytable <- table(Smoker,Status) +mytable # print +``` +Using this information we can then calculate the proportion of deaths in the two respective groups. -```{r pressure, echo=FALSE} -plot(pressure) +```{r} +443 + 139 +139/582 ``` +Showing that 23.88percent of non-smokers died within the 20 years. +```{r} +502+230 +230/732 +``` +And that 31.42percent of non-smokers died within the 20 years. +This is interesting as it indicates that the mortality rate for non-smokers was higher than for smokers. -Note the parameter `echo = FALSE` that indicates that the code will not appear in the final version of the document. We recommend not to use this parameter in the context of this MOOC, because we want your data analyses to be perfectly transparent and reproducible. +We created a simple bar plot to have an overview of number of deaths versus number of survivals. +```{r} +# Simple Bar Plot +counts <- table(data$Status) +barplot(counts, main="smokers", + xlab="status") +``` +`` +We then graphed this data by smoker or non-smoker. +```{r} +# Stacked Bar Plot with Colors and Legend +counts <- table(data$Status, data$Smoker) +barplot(counts, main="Distribution by Status and Smokers", + xlab="Smoker", col=c("darkblue","red"), + legend = rownames(counts)) +``` +We then want to define age groups: 18-34 years, 34-54 years, 55-64 years, over 65 years. -Since the results are not stored in Rmd files, you should generate an HTML or PDF version of your exercises and commit them. Otherwise reading and checking your analysis will be difficult for anyone else but you. +```{r} +library(dplyr) +data <- data %>% mutate(agegroup = case_when(Age >= 64 ~ '4', + Age >= 55 & Age <= 64 ~ '3', + Age >= 34 & Age <= 54 ~ '2', + Age >= 18 & Age <= 33.99 ~ '1')) # end function +``` + +We then calculated in each age group (smoking/non-smoking) the mortality rate (the ratio of the number of women who died in a group to the total number of women in that group). +```{r} +attach(data) +mytable <- table(Smoker, Status, agegroup) +mytable # print +``` +In order to avoid a bias induced by arbitrary and non-regular age groupings, we perform a logistic regression. We introduced a Death variable of 1 or 0 to indicate whether the individual died during the 20-year period, to then study the Death ~ Age model to study the probability of death as a function of age according to whether one considers the group of smokers or non-smokers. +```{r} +data$Status_num <- ifelse(data$Status =="Dead", 1, 0) +``` + +data$Status_num <- ifelse(train$Status =="Dead", 1, 0) +```{r} +xtabs(~Status_num + Smoker, data = data) +``` +```{r} +data$Smoker <- factor(data$Smoker) +mylogit <- glm(Status_num ~ Age + Smoker, data = data, family = "binomial") +``` +```{r} +summary(mylogit) +``` +```{r} +confint(mylogit) +``` -Now it's your turn! You can delete all this information and replace it by your computational document. +Based on this regression model we would not have enough evidence to conclude or not on the harmfulness of smoking. But it does appear that age and smoking have an interaction effect on mortality. \ No newline at end of file