title: "Concentration de CO2 dans l'atmosphère depuis 1958"
author: "Hélène Raynal"
date: "21 avril 2020"
output: html_document
title: 'Autour du Paradoxe de Simpson '
author: "Hana Zahed"
output:
html_notebook: default
pdf_document: default
---
# Intro
In 1972-1974,a poll was conducted on 1/6th of the voters to shed light on thyroid and heart disease research (Tunbridge et al. 1977). 20 years later, a continuation of this study was carried (Vanderpump et al. 1995)
### Sujet
En 1958, Charles David Keeling a initié une mesure de la concentration de CO2 dans l'atmosphère à l'observatoire de Mauna Loa, Hawaii, États-Unis qui continue jusqu'à aujourd'hui. L'objectif initial était d'étudier la variation saisonnière, mais l'intérêt s'est déplacé plus tard vers l'étude de la tendance croissante dans le contexte du changement climatique. En honneur à Keeling, ce jeu de données est souvent appelé "Keeling Curve" (voir (https://en.wikipedia.org/wiki/Keeling_Curve) pour l'histoire et l'importance de ces données).
Les données sont disponibles sur le site Web de l'institut Scripps. Utilisez le fichier avec les observations hebdomadaires. Attention, ce fichier est mis à jour régulièrement avec de nouvelles observations. Notez donc bien la date du téléchargement, et gardez une copie locale de la version précise que vous analysez. Faites aussi attention aux données manquantes.
### Pré-requis
Traitement de suites chronologiques
Quelques références:
- [caschrono: Séries Temporelles Avec R] (https://cran.r-project.org/web/packages/caschrono/) - Yves Aragon - Université Toulouse Capitole - 28 janvier 2019
### Create dataframe and load R libraries required for the different statistical treatments
The data file below contains 10 columns.
- Columns 1-4 give the dates in several redundant formats.
- Column 5 below gives monthly Mauna Loa CO2 concentrations in micro-mol CO2 per mole (ppm), reported on the 2008A SIO manometric mole fraction scale. This is thestandard version of the data most often sought. The monthly values have been adjusted to 24:00 hours on the 15th of each month.
- Column 6 gives the same data after a seasonal adjustment to remove the quasi-regular seasonal cycle. The adjustment involves subtracting from the data a 4-harmonic fit with a linear gain factor.
- Column 7 is a smoothed version of the data generated from a stiff cubic spline function plus 4-harmonic functions with linear gain.
- Column 8 is the same smoothed version with the seasonal cycle removed.
- Column 9 is identical to Column 5 except that the missing values from Column 5 have been filled with values from Column 7.
- Column 10 is identical to Column 6 except missing values have been filled with values from Column 8.
Missing values are denoted by -99.99
CO2 concentrations are measured on the '08A' calibration scale
To simplify our analysis, we will restrict to females (N=1314) in either of the following categories (current smokers or never smokers)
# Analysis section
## load library and set up working environment
Rversion 4.0.3 was used for this analysis
```{r}
library(ggplot2)
library(dplyr)
library(broom)
library(tidyr)
```
## Load the data
Data was downloaded from [github](https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/blob/master/module3/Practical_session/Subject6_smoking.csv)
** Remplacement dans la série des valeurs observées, des valeurs manquantes -99.99 par celles qui sont interpolées
At baseline we had 732 non-smokers and 582 smokers, of wich 502 and 443 were still alive by the end of the study (respectively for non-smokers and smokers). Therefore the death rate in the smokers group 23.88% (95% CI: 20.42 to 27.35 %) and 31.4% (95% CI: 28.06 to 34.78 %) in the non-smoking group.
These results seem surprising as this study is in the continuity of a study assesing the role of smoking in heart disease and illnesses in the thyroid, therefore we would expect the mortality rates to be higher in the smoking group, as smoking is known to negatively impact the cardiovascular and the respiratory systems
### Représentation des résultats
```{r }
ggplot(dataCO2,aes(Date, dataCO2$ObsCO2Comp)) +
geom_line(color='orange') +
xlab("Year, Month") +
scale_x_date(date_labels = "%Y-%m", date_breaks = "5 year") +
theme(axis.text.x = element_text(face = "bold", color = "#993333",
size = 12, angle = 45, hjust = 1)) +
ylab("CO2 Concentration (ppm)") +
scale_y_continuous() +
theme(axis.text.y = element_text(face = "bold", color = "#993333",
data%>%group_by(age_group, Smoker)%>%summarise(N=n(), Alive=sum(Status=="Alive"), Dead=sum(Status=="Dead"), alive_per=Alive/N, dead_per=Dead/N, death_lower=dead_per-1.96*sqrt(dead_per*(1-dead_per)/N), death_upper=dead_per+1.96*sqrt(dead_per*(1-dead_per)/N))%>%mutate(across(c("alive_per","dead_per","death_lower","death_upper"),scale100))%>%ggplot(aes(x=age_group, y=dead_per, color=Smoker))+geom_pointrange(aes(ymin=death_lower, ymax=death_upper),position = position_dodge(width = 0.2))+labs(title = "Mortality rates by age group in smokers vs. non smokers", x="Age group", y="Morality rate (in %)")
```
In the youngest age group, we can see that mortality rates are low and not different between smokers and non-smokers.
With increasing age groups (35 to 64) we can see an increase in the mortality rates , and a notable difference between current smokers and nover smokers mortality rates.
Finally in the highest age group, smoking doesn't seem to have the same impact on mortality.
These results seem to make sense, as the mortality in the youngest age group is generally low, and the the one in the older age groups is generally high regardless of smoking status. People at older ages tend to develop illnesses that can or not relate to smoking status. It's an interesting finding that emphasizes on the impact of smoking in middle aged individuals.
labels=c("Observed risk of death", "Estimated risk of death"))+ylab("Risk of death")+labs(caption = "0: death not observed during followup \n 1: death during followup")+theme(plot.caption = element_text(hjust = 0.5))+geom_ribbon(aes(ymin = right_lwr, ymax = right_upr),alpha = 0.1)
```{r}
Co2_train <- ts(dataCO2_by_year$ObsCO2Comp, start = c(1958,3), frequency = 12)
Co2_train %>% ggtsdisplay()
```
This graph compares risk of death for every age according to smoking status.
As already explained, we can see that most of the deaths observed in our study at the middle aged group occurred in smokers, but necessarily the case for older age groups (risk=1).
After fitting a logistic linear regression adjusted for age, we estimate the risk of death of every participant. The estimates are depicted in the sigmoid function. We can see that throughout different ages, risk estimates for smoker are higher than for non-smokers.
The confidence intervals do overlap, and therefore statistically speaking reject the hypothesis of a negative impact of cigarette smoking found in this dataset.
However, it does question the safety of cigarette use and gives reason to further analysis and bigger studies to firmly conclude on impact of cigarette.