# Analyse paradoxe de Simpson

In [14]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import isoweek
import csv

## Importation données

In [2]:
data_url = "https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/-/raw/master/module3/Practical_session/Subject6_smoking.csv?inline=false"

Pour nous protéger contre une éventuelle disparition ou modification du serveur du Réseau Sentinelles, nous faisons une copie locale de ce jeux de données que nous préservons avec notre analyse. Nous téléchargeons les données seulement si la copie locale n'existe pas.

In [3]:
data_file = "donnees-Simpson.csv"

import os
import urllib.request
if not os.path.exists(data_file):
    urllib.request.urlretrieve(data_url, data_file)

In [4]:
raw_data = pd.read_csv(data_file)
raw_data

Unnamed: 0,Smoker,Status,Age
0,Yes,Alive,21.0
1,Yes,Alive,19.3
2,No,Dead,57.5
3,No,Alive,47.1
4,Yes,Alive,81.4
5,No,Alive,36.8
6,No,Alive,23.8
7,Yes,Dead,57.5
8,Yes,Alive,24.8
9,Yes,Alive,49.5


On cherche s'il y a des donnees vides, et on les supprime s'il y en a

In [5]:
raw_data[raw_data.isnull().any(axis=1)]
data = raw_data.dropna().copy()
data

Unnamed: 0,Smoker,Status,Age
0,Yes,Alive,21.0
1,Yes,Alive,19.3
2,No,Dead,57.5
3,No,Alive,47.1
4,Yes,Alive,81.4
5,No,Alive,36.8
6,No,Alive,23.8
7,Yes,Dead,57.5
8,Yes,Alive,24.8
9,Yes,Alive,49.5


## Premiere analyse

Il n y a pas de ligne vide. On peut commencer l analyse en calculant le nombre de fumeurs vivants ou morts, ainsi que le nombre de non-fumeurs vivants ou morts.

In [20]:
smoker_alive = 0
smoker_dead = 0
no_smoker_alive = 0
no_smoker_dead = 0
for it in range(len(data)):
    if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Alive"):
        smoker_alive = smoker_alive + 1
    if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Dead"):
        smoker_dead = smoker_dead + 1
    if(data["Smoker"][it]=="No" and data["Status"][it]=="Alive"):
        no_smoker_alive = no_smoker_alive + 1
    if(data["Smoker"][it]=="No" and data["Status"][it]=="Dead"):
        no_smoker_dead = no_smoker_dead + 1

On met en forme ces donnees dans un tableau

In [21]:
tableau = [ { "-":"Alive", "Smoker":smoker_alive, "Non Smoker":no_smoker_alive },
      { "-":"Dead", "Smoker":smoker_dead, "Non Smoker":no_smoker_dead },]
df = pd.DataFrame(tableau)
df

Unnamed: 0,-,Non Smoker,Smoker
0,Alive,502,443
1,Dead,230,139


Calcul du taux de mortalité des deux groupes (Fumeurs et Non Fumeurs) et on ajout à notre tableau

In [25]:
smoker_mortality_rate = int(100*smoker_dead/(smoker_alive+smoker_dead))
no_smoker_mortality_rate = int(100*no_smoker_dead/(no_smoker_alive+no_smoker_dead))

tableau = [ { "-":"Alive", "Smoker":smoker_alive, "Non Smoker":no_smoker_alive },
      { "-":"Dead", "Smoker":smoker_dead, "Non Smoker":no_smoker_dead }, 
           { "-":"Mortality (%)", "Smoker":smoker_mortality_rate, "Non Smoker":no_smoker_mortality_rate }]
df = pd.DataFrame(tableau)
df

Unnamed: 0,-,Non Smoker,Smoker
0,Alive,502,443
1,Dead,230,139
2,Mortality (%),31,23


Ce resultat est surprenant, on voit que le taux de mortalité est plus élevé chez les personnes déclarées comme "Non-Fumeurs".
On s'attendrait à voir le résultat inverse.

## Deuxieme analyse

On fait une seconde analyse, en séparant les données par classe d'age. 4 classes d'age sont ainsi définies : 18-34 ans, 35-54 ans, 55-64 ans, plus de 65 ans.

In [26]:
smoker_alive_18_34 = 0
smoker_dead_18_34 = 0
no_smoker_alive_18_34 = 0
no_smoker_dead_18_34 = 0

smoker_alive_35_54 = 0
smoker_dead_35_54 = 0
no_smoker_alive_35_54 = 0
no_smoker_dead_35_54 = 0

smoker_alive_55_64 = 0
smoker_dead_55_64 = 0
no_smoker_alive_55_64 = 0
no_smoker_dead_55_64 = 0

smoker_alive_65 = 0
smoker_dead_65 = 0
no_smoker_alive_65 = 0
no_smoker_dead_65 = 0

for it in range(len(data)):
    if(data["Age"][it]<=34):
        if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Alive"):
            smoker_alive_18_34 = smoker_alive_18_34 + 1
        if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Dead"):
            smoker_dead_18_34 = smoker_dead_18_34 + 1
        if(data["Smoker"][it]=="No" and data["Status"][it]=="Alive"):
            no_smoker_alive_18_34 = no_smoker_alive_18_34 + 1
        if(data["Smoker"][it]=="No" and data["Status"][it]=="Dead"):
            no_smoker_dead_18_34 = no_smoker_dead_18_34 + 1
    
    if(data["Age"][it]>34 and data["Age"][it]<=54):
        if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Alive"):
            smoker_alive_35_54 = smoker_alive_35_54 + 1
        if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Dead"):
            smoker_dead_35_54 = smoker_dead_35_54 + 1
        if(data["Smoker"][it]=="No" and data["Status"][it]=="Alive"):
            no_smoker_alive_35_54 = no_smoker_alive_35_54 + 1
        if(data["Smoker"][it]=="No" and data["Status"][it]=="Dead"):
            no_smoker_dead_35_54 = no_smoker_dead_35_54 + 1
    
    if(data["Age"][it]>54 and data["Age"][it]<=64):
        if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Alive"):
            smoker_alive_55_64 = smoker_alive_55_64 + 1
        if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Dead"):
            smoker_dead_55_64 = smoker_dead_55_64 + 1
        if(data["Smoker"][it]=="No" and data["Status"][it]=="Alive"):
            no_smoker_alive_55_64 = no_smoker_alive_55_64 + 1
        if(data["Smoker"][it]=="No" and data["Status"][it]=="Dead"):
            no_smoker_dead_55_64 = no_smoker_dead_55_64 + 1
    
    if(data["Age"][it]>64):
        if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Alive"):
            smoker_alive_65 = smoker_alive_65 + 1
        if(data["Smoker"][it]=="Yes" and data["Status"][it]=="Dead"):
            smoker_dead_65 = smoker_dead_65 + 1
        if(data["Smoker"][it]=="No" and data["Status"][it]=="Alive"):
            no_smoker_alive_65 = no_smoker_alive_65 + 1
        if(data["Smoker"][it]=="No" and data["Status"][it]=="Dead"):
            no_smoker_dead_65 = no_smoker_dead_65 + 1

smoker_18_34_mortality_rate = int(100*smoker_dead_18_34/(smoker_alive_18_34+smoker_dead_18_34))
no_smoker_18_34_mortality_rate = int(100*no_smoker_dead_18_34/(no_smoker_alive_18_34+no_smoker_dead_18_34))

smoker_35_54_mortality_rate = int(100*smoker_dead_35_54/(smoker_alive_35_54+smoker_dead_35_54))
no_smoker_35_54_mortality_rate = int(100*no_smoker_dead_35_54/(no_smoker_alive_35_54+no_smoker_dead_35_54))

smoker_55_64_mortality_rate = int(100*smoker_dead_55_64/(smoker_alive_55_64+smoker_dead_55_64))
no_smoker_55_64_mortality_rate = int(100*no_smoker_dead_55_64/(no_smoker_alive_55_64+no_smoker_dead_55_64))

smoker_65_mortality_rate = int(100*smoker_dead_65/(smoker_alive_65+smoker_dead_65))
no_smoker_65_mortality_rate = int(100*no_smoker_dead_65/(no_smoker_alive_65+no_smoker_dead_65))

On peut maintenant mettre ces donnees dans un tableau pour les visualiser

In [27]:
tableau_2 = [ { "-":"Alive (18-34)", "Smoker":smoker_alive_18_34, "Non Smoker":no_smoker_alive_18_34 },
      { "-":"Dead (18-34)", "Smoker":smoker_dead_18_34, "Non Smoker":no_smoker_dead_18_34 }, 
           { "-":"Mortality (18-34(%))", "Smoker":smoker_18_34_mortality_rate, "Non Smoker":no_smoker_18_34_mortality_rate },
            { "-":"Alive (35-54)", "Smoker":smoker_alive_35_54, "Non Smoker":no_smoker_alive_35_54 },
      { "-":"Dead (35-54)", "Smoker":smoker_dead_35_54, "Non Smoker":no_smoker_dead_35_54 }, 
           { "-":"Mortality (35-54(%))", "Smoker":smoker_35_54_mortality_rate, "Non Smoker":no_smoker_35_54_mortality_rate },
            { "-":"Alive (55-64)", "Smoker":smoker_alive_55_64, "Non Smoker":no_smoker_alive_55_64 },
      { "-":"Dead (55-64)", "Smoker":smoker_dead_55_64, "Non Smoker":no_smoker_dead_55_64 }, 
           { "-":"Mortality (55-64(%))", "Smoker":smoker_55_64_mortality_rate, "Non Smoker":no_smoker_55_64_mortality_rate },
            { "-":"Alive (65+)", "Smoker":smoker_alive_65, "Non Smoker":no_smoker_alive_65 },
      { "-":"Dead (65+)", "Smoker":smoker_dead_65, "Non Smoker":no_smoker_dead_65 }, 
           { "-":"Mortality (65+(%))", "Smoker":smoker_65_mortality_rate, "Non Smoker":no_smoker_65_mortality_rate }]
df_2 = pd.DataFrame(tableau_2)
df_2

Unnamed: 0,-,Non Smoker,Smoker
0,Alive (18-34),213,176
1,Dead (18-34),6,5
2,Mortality (18-34(%)),2,2
3,Alive (35-54),180,196
4,Dead (35-54),19,41
5,Mortality (35-54(%)),9,17
6,Alive (55-64),81,64
7,Dead (55-64),40,51
8,Mortality (55-64(%)),33,44
9,Alive (65+),28,7


On voit que pour les deux classes d'ages 18-34 et 65+, le taux de mortalité est le même pour les fumeurs et les non-fumeurs. En revanche, pour les classes d'age 35-54 et 55-64, le taux de mortalité des fumeurs est nettement plus élevé que celui des non-fumeurs