Around Simpson’s Paradox

Loading and inspecting data

module3_Practical_session_Subject6_smoking <- read.csv("~/R/module3_Practical_session_Subject6_smoking.csv")

convert the data to factor character

module3_Practical_session_Subject6_smoking$Smoker <- as.factor(module3_Practical_session_Subject6_smoking$Smoker)

module3_Practical_session_Subject6_smoking$Status <- as.factor(module3_Practical_session_Subject6_smoking$Status)

module3_Practical_session_Subject6_smoking$AgeG <- as.factor(module3_Practical_session_Subject6_smoking$AgeG)

the data checked for its value; either character, numeric or factor

sapply(module3_Practical_session_Subject6_smoking, class)

##    Smoker    Status       Age      AgeG 
##  "factor"  "factor" "numeric"  "factor"

I tabulated the total number of women alive and dead over the period according to their smoking habits

tabulate(module3_Practical_session_Subject6_smoking$Smoker)

## [1] 732 582

module3_Practical_session_Subject6_smoking.tab <- table(module3_Practical_session_Subject6_smoking$Status, module3_Practical_session_Subject6_smoking$Smoker)

TAB=table(module3_Practical_session_Subject6_smoking$Status, module3_Practical_session_Subject6_smoking$Smoker)

Graphic representation of these data and calculate confidence intervals, P-value, calculate confidence intervals and the Odds ratio. The surprising result is mortality was significantly high (p=0.003) among non exposed group (non-smoker). The graph below shows that hign mortality rate among non-smoker.

barplot(TAB, beside=T, legend=T)

chisq.test(TAB, correct=T)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TAB
## X-squared = 8.7515, df = 1, p-value = 0.003093

fisher.test(TAB)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  TAB
## p-value = 0.002989
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.5307485 0.8822128
## sample estimates:
## odds ratio 
##  0.6850392

Mortality rates based on a new age category. The following classes were considered: 18-34 years, 34-54 years, 55-64 years, over 65 years

TAB1=table(module3_Practical_session_Subject6_smoking$Status, module3_Practical_session_Subject6_smoking$Smoker, module3_Practical_session_Subject6_smoking$AgeG)

prop.table(TAB1)

## , ,  = 18-34
## 
##        
##                  No         Yes
##   Alive 0.168188737 0.138508371
##   Dead  0.004566210 0.005327245
## 
## , ,  = 35-54
## 
##        
##                  No         Yes
##   Alive 0.130898021 0.143835616
##   Dead  0.014459665 0.029680365
## 
## , ,  = 55-64
## 
##        
##                  No         Yes
##   Alive 0.061643836 0.049467275
##   Dead  0.030441400 0.038812785
## 
## , ,  = Over 64
## 
##        
##                  No         Yes
##   Alive 0.021308980 0.005327245
##   Dead  0.125570776 0.031963470

New Death variable created and 1 or 0 to indicate whether the individual died during the 20-year period, I studied the Death ~ Age model to study the probability of death as a function of age according to whether one considers the group of smokers or non-smokers.These regressions doesn’t allow me to conclude on the harmfulness of smoking. Since the age increase the fitted value decrease.

Death <- module3_Practical_session_Subject6_smoking$Status
Deathcode<-ifelse(Death == "Alive", 1, 0)

logistic <- glm(Deathcode ~ module3_Practical_session_Subject6_smoking$Age, binomial)
summary(logistic)

## 
## Call:
## glm(formula = Deathcode ~ module3_Practical_session_Subject6_smoking$Age, 
##     family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8803  -0.4551   0.2848   0.5897   2.3335  
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                     6.104537   0.321414   18.99
## module3_Practical_session_Subject6_smoking$Age -0.097651   0.005555  -17.58
##                                                Pr(>|z|)    
## (Intercept)                                      <2e-16 ***
## module3_Practical_session_Subject6_smoking$Age   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1560.3  on 1313  degrees of freedom
## Residual deviance: 1004.8  on 1312  degrees of freedom
## AIC: 1008.8
## 
## Number of Fisher Scoring iterations: 5

plot(module3_Practical_session_Subject6_smoking$Age, jitter(Deathcode, 0.5), xlab = "module3_Practical_session_Subject6_smoking$Age", ylab = "DeathCode (0 - Dead, 1 - Alive")

plot(module3_Practical_session_Subject6_smoking$Age,fitted.values(logistic))

Around Simpson’s Paradox

Tegegne Henok Ayalew

6/16/2020

Loading and inspecting data

convert the data to factor character

the data checked for its value; either character, numeric or factor

I tabulated the total number of women alive and dead over the period according to their smoking habits

Mortality rates based on a new age category. The following classes were considered: 18-34 years, 34-54 years, 55-64 years, over 65 years