diff --git a/module3/exercice3/ConcentrationCO2.Rmd b/module3/exercice3/ConcentrationCO2.Rmd index 2ea30baee367634869bf2296639f7a40ba466d50..211782bc6ae8d5a87ff980a934cca32140b2ca5a 100644 --- a/module3/exercice3/ConcentrationCO2.Rmd +++ b/module3/exercice3/ConcentrationCO2.Rmd @@ -1,133 +1,121 @@ --- -title: "Concentration de CO2 dans l'atmosphère depuis 1958" -author: "Hélène Raynal" -date: "21 avril 2020" -output: html_document +title: 'Autour du Paradoxe de Simpson ' +author: "Hana Zahed" +output: +html_notebook: default +pdf_document: default --- +# Intro +In 1972-1974,a poll was conducted on 1/6th of the voters to shed light on thyroid and heart disease research (Tunbridge et al. 1977). 20 years later, a continuation of this study was carried (Vanderpump et al. 1995) -### Sujet - -En 1958, Charles David Keeling a initié une mesure de la concentration de CO2 dans l'atmosphère à l'observatoire de Mauna Loa, Hawaii, États-Unis qui continue jusqu'à aujourd'hui. L'objectif initial était d'étudier la variation saisonnière, mais l'intérêt s'est déplacé plus tard vers l'étude de la tendance croissante dans le contexte du changement climatique. En honneur à Keeling, ce jeu de données est souvent appelé "Keeling Curve" (voir (https://en.wikipedia.org/wiki/Keeling_Curve) pour l'histoire et l'importance de ces données). - -Les données sont disponibles sur le site Web de l'institut Scripps. Utilisez le fichier avec les observations hebdomadaires. Attention, ce fichier est mis à jour régulièrement avec de nouvelles observations. Notez donc bien la date du téléchargement, et gardez une copie locale de la version précise que vous analysez. Faites aussi attention aux données manquantes. - - -### Pré-requis -Traitement de suites chronologiques - -Quelques références: - -- [caschrono: Séries Temporelles Avec R] (https://cran.r-project.org/web/packages/caschrono/) - Yves Aragon - Université Toulouse Capitole - 28 janvier 2019 - - -### Create dataframe and load R libraries required for the different statistical treatments - -The data file below contains 10 columns. - -- Columns 1-4 give the dates in several redundant formats. -- Column 5 below gives monthly Mauna Loa CO2 concentrations in micro-mol CO2 per mole (ppm), reported on the 2008A SIO manometric mole fraction scale. This is thestandard version of the data most often sought. The monthly values have been adjusted to 24:00 hours on the 15th of each month. -- Column 6 gives the same data after a seasonal adjustment to remove the quasi-regular seasonal cycle. The adjustment involves subtracting from the data a 4-harmonic fit with a linear gain factor. -- Column 7 is a smoothed version of the data generated from a stiff cubic spline function plus 4-harmonic functions with linear gain. -- Column 8 is the same smoothed version with the seasonal cycle removed. -- Column 9 is identical to Column 5 except that the missing values from Column 5 have been filled with values from Column 7. -- Column 10 is identical to Column 6 except missing values have been filled with values from Column 8. - -Missing values are denoted by -99.99 - - CO2 concentrations are measured on the '08A' calibration scale - -```{r } -library(tidyverse) -library(forecast) -library(lubridate) -library(car) -library(scales) -library(patchwork) -library(kableExtra) - -dataCO2 <- read.csv("monthly_in_situ_co2_mlo.csv", sep="," ,skip = 57) -colnames(dataCO2) <- c("Year", "Month","Date1", "Date2", "ObsCO2", "SeasAdjCO2","SplineAdjCO2", "SplineAdjCO2Trend", "ObsCO2Comp", "SeasAdjCO2Comp") -summary(dataCO2) - -dataCO2$Date <- ymd(paste0(dataCO2$Year, " ", dataCO2$Month, " ", "15")) +To simplify our analysis, we will restrict to females (N=1314) in either of the following categories (current smokers or never smokers) +# Analysis section +## load library and set up working environment +Rversion 4.0.3 was used for this analysis +```{r} +library(ggplot2) +library(dplyr) +library(broom) +library(tidyr) ``` +## Load the data +Data was downloaded from [github](https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/blob/master/module3/Practical_session/Subject6_smoking.csv) - -** Remplacement dans la série des valeurs observées, des valeurs manquantes -99.99 par celles qui sont interpolées -** on enlève ensuite les observations manquantes -```{r } - - - -dataCO2 <- dataCO2[dataCO2$ObsCO2Comp != "-99.99", ] - +```{r} +data <- read.csv("C:/Users/zahedh/Desktop/module3_Practical_session_Subject6_smoking.csv", h=T) +# data checking +head(data) +str(data) #1314 obs +table(data$Smoker, useNA = "always") ``` -** Create a column Date with format YYYY MM DD -```{r } +## Descriptive analysis +### Describe mortality status of women by smoking habits +```{r} +scale_round <- function(x){ + x <- x*100 + x <- round(x,digits = 2) +} +#manual number check +with(data, table(Status, Smoker, useNA = "always")) -dataCO2$Date <- ymd(paste0(dataCO2$Year, "-", dataCO2$Month, "-", "15")) +#creating a better table to summarize all results needed +data%>%group_by(Smoker)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"), scale_round))%>%mutate(Alive = paste0(Alive, " (", alive_per ,"%)"), `Mortality Rate`=paste0(dead_per, "% (95%CI=", death_lower, ";",death_upper, ")"))%>%select(-c("alive_per", "death_lower","death_upper", "dead_per")) ``` +At baseline we had 732 non-smokers and 582 smokers, of wich 502 and 443 were still alive by the end of the study (respectively for non-smokers and smokers). Therefore the death rate in the smokers group 23.88% (95% CI: 20.42 to 27.35 %) and 31.4% (95% CI: 28.06 to 34.78 %) in the non-smoking group. +These results seem surprising as this study is in the continuity of a study assesing the role of smoking in heart disease and illnesses in the thyroid, therefore we would expect the mortality rates to be higher in the smoking group, as smoking is known to negatively impact the cardiovascular and the respiratory systems -### Représentation des résultats - -```{r } -ggplot(dataCO2,aes(Date, dataCO2$ObsCO2Comp)) + - geom_line(color='orange') + - xlab("Year, Month") + - scale_x_date(date_labels = "%Y-%m", date_breaks = "5 year") + - theme(axis.text.x = element_text(face = "bold", color = "#993333", - size = 12, angle = 45, hjust = 1)) + - ylab("CO2 Concentration (ppm)") + - scale_y_continuous() + - theme(axis.text.y = element_text(face = "bold", color = "#993333", - size = 10, hjust = 1),axis.title.y = element_text(size = 10)) + - ggtitle("Graphique 1") - -``` - -```{r } +## Could it be confounding ? +First we will assess the variation in mortality rates by age groups regardless of smoking status. +We will now take into account age groups. We will use the following groups : 18-34, 34-54, 55-64 and 65+ +```{r} +#check that the minimum age is 18 +summary(data$Age) #Ok - library(viridis) +data <- data%>%mutate(age_group=cut(Age, breaks=c(-Inf, 34, 54, 64,Inf), labels=c("18-34","35-54","55-64","65+"))) -dataCO2_by_year <- dataCO2 %>% group_by("Year") -ggplot(dataCO2_by_year, aes(dataCO2_by_year$Month,dataCO2_by_year$ObsCO2Comp )) + - geom_line(aes( group = dataCO2_by_year$Year , colour=dataCO2_by_year$Year)) + - xlab("Month")+ - ylab("CO2 Concentration (ppm)") + - ggtitle("Graphique saisonnier") +data%>%group_by(age_group)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"), scale_round))%>%mutate(Alive = paste0(Alive, " (", alive_per ,"%)"), `Mortality Rate`=paste0(dead_per, "% (95%CI=", death_lower, ";",death_upper, ")"))%>%select(-c("alive_per", "death_lower","death_upper", "dead_per"))%>%ungroup() ``` +Mortality rates are increasing with age as expected. +Age could be acting as a confounder, therefore we will assess mortality rates within each age group by smoking status +```{r} +##standrize +data%>%group_by(age_group, Smoker)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"), scale_round))%>%mutate(Alive = paste0(Alive, " (", alive_per ,"%)"), `Mortality Rate`=paste0(dead_per, "% (95%CI=", death_lower, ";",death_upper, ")"))%>%select(-c("alive_per", "death_lower","death_upper", "dead_per"))#%>%pivot_wider(names_from = age_group, values_from=`Mortality Rate`) -### Modélisation -** Doc +#plotting +scale100 <- function(x){x <- x*100} +data%>%group_by(age_group, Smoker)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"),scale100))%>%ggplot(aes(x=age_group, y=dead_per, color=Smoker))+geom_pointrange(aes(ymin=death_lower, ymax=death_upper),position = position_dodge(width = 0.2))+labs(title = "Mortality rates by age group in smokers vs. non smokers", x="Age group", y="Morality rate (in %)") +``` +In the youngest age group, we can see that mortality rates are low and not different between smokers and non-smokers. +With increasing age groups (35 to 64) we can see an increase in the mortality rates , and a notable difference between current smokers and nover smokers mortality rates. +Finally in the highest age group, smoking doesn't seem to have the same impact on mortality. +These results seem to make sense, as the mortality in the youngest age group is generally low, and the the one in the older age groups is generally high regardless of smoking status. People at older ages tend to develop illnesses that can or not relate to smoking status. It's an interesting finding that emphasizes on the impact of smoking in middle aged individuals. -(https://github.com/Peymankor/Data-Science_Portfolio/blob/master/Time%20Series%20Analysis-Historical%20Co2/medium_post.Rmd) +## We will verify our findings using a regression analysis +This will allow us to adjust on age (continuous variable), without losing information (as is the case when using categories) +```{r} +head(data) +#we will recode status variables as 0:alive and 1: dead (as dead is the outcome of interest) +data <- data%>%mutate(death=ifelse(Status=="Alive","0","1"))%>%mutate(death=as.factor(death)) +#checks +with(data, table(death, Status)) -Série n'est pas stationnaire comme le montre le graphique +mod1 <- glm(death~Smoker+Age, family="binomial", data=data) +tidy(mod1) -Série montre une saisonnalité voir graphique 2 -Now, knowing the the non-statinary and seasonality of the data, it suggest to use theseasons differencing to model the data. To answer, +#estimates and CI for estimated risk +ilink <- family(mod1)$linkinv -* How is Autocorelation function and Partial Auto corellation? +## add fit and se.fit on the **link** scale +ndata <- bind_cols(data, setNames(as_tibble(predict(mod1, data, se.fit = TRUE)[1:2]), + c('fit_link','se_link'))) +## create the interval and backtransform +ndata <- mutate(ndata, + fit_resp = ilink(fit_link), + right_upr = ilink(fit_link + (2 * se_link)), + right_lwr = ilink(fit_link - (2 * se_link))) +## show +ndata -Here is the plot of ACF and PACF from the *forecast* package: +ndata%>%mutate(death=as.numeric(as.character(death)))%>%select(Smoker, Age, death, fit_resp, right_upr, right_lwr)%>%pivot_longer(cols=c("death", "fit_resp"), names_to="value_from", values_to="risk")%>%mutate(right_upr=ifelse(value_from=="death", NA, as.numeric(right_upr)), right_lwr=ifelse(value_from=="death",NA,as.numeric(right_lwr)))%>%ggplot(aes(x=Age, y=risk, color=Smoker, shape=value_from))+geom_point()+ scale_shape_discrete(name=element_blank(), + breaks=c("death", "fit_resp"), + labels=c("Observed risk of death", "Estimated risk of death"))+ylab("Risk of death")+labs(caption = "0: death not observed during followup \n 1: death during followup")+theme(plot.caption = element_text(hjust = 0.5))+geom_ribbon(aes(ymin = right_lwr, ymax = right_upr),alpha = 0.1) -```{r} -Co2_train <- ts(dataCO2_by_year$ObsCO2Comp, start = c(1958,3), frequency = 12) -Co2_train %>% ggtsdisplay() ``` +This graph compares risk of death for every age according to smoking status. +As already explained, we can see that most of the deaths observed in our study at the middle aged group occurred in smokers, but necessarily the case for older age groups (risk=1). +After fitting a logistic linear regression adjusted for age, we estimate the risk of death of every participant. The estimates are depicted in the sigmoid function. We can see that throughout different ages, risk estimates for smoker are higher than for non-smokers. +The confidence intervals do overlap, and therefore statistically speaking reject the hypothesis of a negative impact of cigarette smoking found in this dataset. +However, it does question the safety of cigarette use and gives reason to further analysis and bigger studies to firmly conclude on impact of cigarette. -```{r} -Co2_train %>% diff(lag=12) %>% diff() %>% ggtsdisplay() -``` \ No newline at end of file