sujet6

parent a7fefca9
...@@ -37,7 +37,7 @@ ...@@ -37,7 +37,7 @@
"hidePrompt": false "hidePrompt": false
}, },
"source": [ "source": [
"## Étape 1 : Calcul des effectifs vivants et décédés par statut de fumeur" "## Étape 1 : Importation des bibliothèques et des données"
] ]
}, },
{ {
...@@ -47,12 +47,12 @@ ...@@ -47,12 +47,12 @@
"hidePrompt": false "hidePrompt": false
}, },
"source": [ "source": [
"Représentez dans un tableau le nombre total de femmes vivantes et décédées sur la période en fonction de leur habitude de tabagisme. Calculez dans chaque groupe (fumeuses / non fumeuses) le taux de mortalité (le rapport entre le nombre de femmes décédées dans un groupe et le nombre total de femmes dans ce groupe). Vous pourrez proposer une représentation graphique de ces données et calculer des intervalles de confiance si vous le souhaitez. En quoi ce résultat est-il surprenant ?" "La première étape consiste à importer les bibliothèques nécessaires (Pandas pour la gestion des données, Statsmodels pour la régression logistique, Seaborn et Matplotlib pour les visualisations), puis à charger les données depuis un fichier CSV."
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 71,
"metadata": { "metadata": {
"hideCode": false, "hideCode": false,
"hidePrompt": false "hidePrompt": false
...@@ -60,18 +60,905 @@ ...@@ -60,18 +60,905 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"%matplotlib inline\n", "%matplotlib inline\n",
"import matplotlib.pyplot as plt\n", "import matplotlib.pyplot as plt # Pour afficher les graphiques\n",
"import seaborn as sns # Pour la visualisation\n",
"import pandas as pd\n", "import pandas as pd\n",
"import isoweek" "import isoweek\n",
"import statsmodels.api as sm # Pour la régression logistique"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"### Charger les données depuis un fichier CSV"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 72,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"df = pd.read_csv('Subject6_smoking.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"On affiche les 5 premières lignes du fichier pour vérifier si tout fonctionne bien"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Smoker</th>\n",
" <th>Status</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>19.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>No</td>\n",
" <td>Dead</td>\n",
" <td>57.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>No</td>\n",
" <td>Alive</td>\n",
" <td>47.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>81.4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Smoker Status Age\n",
"0 Yes Alive 21.0\n",
"1 Yes Alive 19.3\n",
"2 No Dead 57.5\n",
"3 No Alive 47.1\n",
"4 Yes Alive 81.4"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"## Étape 2 : Calcul des effectifs vivants et décédés par statut de fumeur"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Dans cette étape, nous voulons calculer combien de femmes sont vivantes ou décédées en fonction de leur statut de fumeur (fumeuse ou non). On utilise ```groupby()``` pour regrouper les données par ```Smoker``` et ```Status```, puis on utilise ```size()``` pour compter le nombre d'éléments dans chaque groupe."
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"### Groupement des données par statut de fumeur et statut de vie/mort"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"grouped = df.groupby(['Smoker', 'Status']).size().unstack(fill_value=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Affichage du tableau des effectifs"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>Status</th>\n",
" <th>Alive</th>\n",
" <th>Dead</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Smoker</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>No</th>\n",
" <td>502</td>\n",
" <td>230</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Yes</th>\n",
" <td>443</td>\n",
" <td>139</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Status Alive Dead\n",
"Smoker \n",
"No 502 230\n",
"Yes 443 139"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grouped"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"## Étape 3 : Calcul du taux de mortalité"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Le taux de mortalité est défini comme le nombre de décès divisé par le nombre total de personnes dans chaque groupe (vivantes + décédées)."
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"### Calcul du taux de mortalité par groupe de fumeur"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"grouped['Mortality Rate'] = grouped['Dead'] / (grouped['Alive'] + grouped['Dead'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Affichage des résultats avec le taux de mortalité"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>Status</th>\n",
" <th>Alive</th>\n",
" <th>Dead</th>\n",
" <th>Mortality Rate</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Smoker</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>No</th>\n",
" <td>502</td>\n",
" <td>230</td>\n",
" <td>0.314208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Yes</th>\n",
" <td>443</td>\n",
" <td>139</td>\n",
" <td>0.238832</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Status Alive Dead Mortality Rate\n",
"Smoker \n",
"No 502 230 0.314208\n",
"Yes 443 139 0.238832"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grouped"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"## Étape 4 : Introduction des classes d'âge"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Les classes d'âge sont divisées en intervalles (18-34, 34-54, 55-64, 65+), et ces catégories sont ajoutées à notre DataFrame à l'aide de la fonction ```pd.cut()```."
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Définition des tranches d'âge"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"bins = [0, 34, 54, 64, 100] # Tranches d'âge\n",
"labels = ['18-34', '34-54', '55-64', '65+']"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Ajouter une colonne 'Age Group' à df"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"source": [
"Groupement par statut de fumeur, groupe d'âge et statut de vie/mort"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"hideCode": false,
"hidePrompt": false
},
"outputs": [],
"source": [
"grouped_age = df.groupby(['Smoker', 'Age Group', 'Status']).size().unstack(fill_value=0)"
]
},
{
"cell_type": "markdown",
"metadata": { "metadata": {
"hideCode": false, "hideCode": false,
"hidePrompt": false "hidePrompt": false
}, },
"source": [
"Calcul du taux de mortalité par groupe d'âge et statut de fumeur"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"grouped_age['Mortality Rate'] = grouped_age['Dead'] / (grouped_age['Alive'] + grouped_age['Dead'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Affichage des effectifs et du taux de mortalité par groupe d'âge et statut de fumeur"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Status</th>\n",
" <th>Alive</th>\n",
" <th>Dead</th>\n",
" <th>Mortality Rate</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Smoker</th>\n",
" <th>Age Group</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"4\" valign=\"top\">No</th>\n",
" <th>18-34</th>\n",
" <td>213</td>\n",
" <td>6</td>\n",
" <td>0.027397</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34-54</th>\n",
" <td>180</td>\n",
" <td>19</td>\n",
" <td>0.095477</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55-64</th>\n",
" <td>80</td>\n",
" <td>39</td>\n",
" <td>0.327731</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65+</th>\n",
" <td>29</td>\n",
" <td>166</td>\n",
" <td>0.851282</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"4\" valign=\"top\">Yes</th>\n",
" <th>18-34</th>\n",
" <td>174</td>\n",
" <td>5</td>\n",
" <td>0.027933</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34-54</th>\n",
" <td>198</td>\n",
" <td>41</td>\n",
" <td>0.171548</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55-64</th>\n",
" <td>64</td>\n",
" <td>51</td>\n",
" <td>0.443478</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65+</th>\n",
" <td>7</td>\n",
" <td>42</td>\n",
" <td>0.857143</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Status Alive Dead Mortality Rate\n",
"Smoker Age Group \n",
"No 18-34 213 6 0.027397\n",
" 34-54 180 19 0.095477\n",
" 55-64 80 39 0.327731\n",
" 65+ 29 166 0.851282\n",
"Yes 18-34 174 5 0.027933\n",
" 34-54 198 41 0.171548\n",
" 55-64 64 51 0.443478\n",
" 65+ 7 42 0.857143"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grouped_age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Étape 5 : Régression logistique"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ici, nous analysons la probabilité de décès en fonction de l'âge et du statut de fumeur à l'aide d'une régression logistique."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Création de la variable binaire 'Death' où 1 = mort, 0 = vivant"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"df['Death'] = df['Status'].apply(lambda x: 1 if x == 'Dead' else 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Modèle de régression logistique : 'Death' ~ 'Age' + 'Smoker'"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [],
"source": [
"X = pd.get_dummies(df[['Age', 'Smoker']], drop_first=True) # Convertir 'Smoker' en variables binaires\n",
"y = df['Death']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ajouter une constante pour l'interception"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"X = sm.add_constant(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Création du modèle logistique"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.381244\n",
" Iterations 7\n"
]
}
],
"source": [
"model = sm.Logit(y, X)\n",
"result = model.fit()"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"simpletable\">\n",
"<caption>Logit Regression Results</caption>\n",
"<tr>\n",
" <th>Dep. Variable:</th> <td>Death</td> <th> No. Observations: </th> <td> 1314</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Model:</th> <td>Logit</td> <th> Df Residuals: </th> <td> 1311</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Method:</th> <td>MLE</td> <th> Df Model: </th> <td> 2</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Date:</th> <td>Sun, 10 Nov 2024</td> <th> Pseudo R-squ.: </th> <td>0.3579</td> \n",
"</tr>\n",
"<tr>\n",
" <th>Time:</th> <td>20:08:45</td> <th> Log-Likelihood: </th> <td> -500.95</td> \n",
"</tr>\n",
"<tr>\n",
" <th>converged:</th> <td>True</td> <th> LL-Null: </th> <td> -780.16</td> \n",
"</tr>\n",
"<tr>\n",
" <th> </th> <td> </td> <th> LLR p-value: </th> <td>5.534e-122</td>\n",
"</tr>\n",
"</table>\n",
"<table class=\"simpletable\">\n",
"<tr>\n",
" <td></td> <th>coef</th> <th>std err</th> <th>z</th> <th>P>|z|</th> <th>[0.025</th> <th>0.975]</th> \n",
"</tr>\n",
"<tr>\n",
" <th>const</th> <td> -6.3519</td> <td> 0.360</td> <td> -17.637</td> <td> 0.000</td> <td> -7.058</td> <td> -5.646</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Age</th> <td> 0.0998</td> <td> 0.006</td> <td> 17.290</td> <td> 0.000</td> <td> 0.089</td> <td> 0.111</td>\n",
"</tr>\n",
"<tr>\n",
" <th>Smoker_Yes</th> <td> 0.2787</td> <td> 0.165</td> <td> 1.689</td> <td> 0.091</td> <td> -0.045</td> <td> 0.602</td>\n",
"</tr>\n",
"</table>"
],
"text/plain": [
"<class 'statsmodels.iolib.summary.Summary'>\n",
"\"\"\"\n",
" Logit Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Death No. Observations: 1314\n",
"Model: Logit Df Residuals: 1311\n",
"Method: MLE Df Model: 2\n",
"Date: Sun, 10 Nov 2024 Pseudo R-squ.: 0.3579\n",
"Time: 20:08:45 Log-Likelihood: -500.95\n",
"converged: True LL-Null: -780.16\n",
" LLR p-value: 5.534e-122\n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"const -6.3519 0.360 -17.637 0.000 -7.058 -5.646\n",
"Age 0.0998 0.006 17.290 0.000 0.089 0.111\n",
"Smoker_Yes 0.2787 0.165 1.689 0.091 -0.045 0.602\n",
"==============================================================================\n",
"\"\"\""
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cette section vous donnera les coefficients du modèle et vous permettra d'interpréter l'effet de l'âge et du tabagisme sur la probabilité de décès."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Étape 6: Visualisation des résultats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dans cette étape, nous créons un graphique pour visualiser le taux de mortalité par groupe d'âge et statut de fumeur. Nous utilisons Seaborn pour créer un diagramme à barres."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calcul du taux de mortalité par statut de fumeur et groupe d'âge"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"mortality_by_group = df.groupby(['Age Group', 'Smoker'])['Death'].mean().reset_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Renommer la colonne 'Death' en 'Mortality Rate'"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": [
"mortality_by_group.rename(columns={'Death': 'Mortality Rate'}, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Graphique des taux de mortalité par groupe d'âge et statut de fumeur"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7fcb88ee6828>"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.barplot(data=mortality_by_group, x='Age Group', y='Mortality Rate', hue='Smoker')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Titre et affichage du graphique"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 413.359x360 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.lmplot(data=df, x='Age', y='Predicted Death Probability', hue='Smoker', logistic=True)\n",
"\n",
"plt.title(\"Probabilité de décès par âge et statut de fumeur\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [], "outputs": [],
"source": [] "source": []
} }
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment