no commit message

parent 2537a5a2
{ {
"cells": [], "cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Autour du Paradoxe de Simpson"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import des librairies nécessaires"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import numpy as np\n",
"from statsmodels.tools.tools import add_constant\n",
"from statsmodels.discrete.discrete_model import Logit\n",
"from tqdm import tqdm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Récupération et analyse des données\n",
"\n",
"Nous pouvons charger la donnée stockée dans le dossier Gitlab ou en utilisant ce [lien](https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/blob/master/module3/Practical_session/Subject6_smoking.csv)\n",
"Chaque ligne indique si la personne fume ou non, si elle est vivante ou décédée au moment de la seconde étude (1995), et son âge lors du premier sondage (1977)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Smoker</th>\n",
" <th>Status</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>19.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>No</td>\n",
" <td>Dead</td>\n",
" <td>57.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>No</td>\n",
" <td>Alive</td>\n",
" <td>47.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Yes</td>\n",
" <td>Alive</td>\n",
" <td>81.4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Smoker Status Age\n",
"0 Yes Alive 21.0\n",
"1 Yes Alive 19.3\n",
"2 No Dead 57.5\n",
"3 No Alive 47.1\n",
"4 Yes Alive 81.4"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_path = \"Subject6_smoking.csv\"\n",
"# data_path = \"https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources/blob/master/module3/Practical_session/Subject6_smoking.csv\"\n",
"\n",
"raw_data = pd.read_csv(data_path)\n",
"raw_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Regardons si certaines données sont manquantes : "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Smoker</th>\n",
" <th>Status</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [Smoker, Status, Age]\n",
"Index: []"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_data[raw_data.isnull().any(axis=1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Il ne manque pas de données ici."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Représentation des décès en fonction de l'habitude de tabagisme\n",
"\n",
"Nous pouvons regarder le nombre de personnes vivantes ou décédées en 1995 en fonction de si elles fument ou pas."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>Status</th>\n",
" <th>Alive</th>\n",
" <th>Dead</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Smoker</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>No</th>\n",
" <td>502</td>\n",
" <td>230</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Yes</th>\n",
" <td>443</td>\n",
" <td>139</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Status Alive Dead\n",
"Smoker \n",
"No 502 230\n",
"Yes 443 139"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tableau_croise = pd.crosstab(raw_data[\"Smoker\"], raw_data[\"Status\"])\n",
"tableau_croise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous pouvons également calculer le taux de mortalité par catégorie de fumeur"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Smoker\n",
"No 0.314208\n",
"Yes 0.238832\n",
"dtype: float64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"taux_mortalite = tableau_croise['Dead'] / tableau_croise.sum(axis=1)\n",
"taux_mortalite"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous observons un taux de mortalité plus important pour les personnes non fumeuses (31% contre 24%), ce qui peut sembler surprenant à première vue."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous pouvons maintenant étudier si l'âge a un impact sur ce taux de mortalité"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tableau croisé :\n",
"Status Alive Dead\n",
"Smoker Age_Category \n",
"No 18-34 ans 213 6\n",
" 34-54 ans 180 19\n",
" 55-64 ans 80 39\n",
" plus de 65 ans 29 166\n",
"Yes 18-34 ans 174 5\n",
" 34-54 ans 198 41\n",
" 55-64 ans 64 51\n",
" plus de 65 ans 7 42\n",
"\n",
"Taux de mortalité :\n",
"Smoker Age_Category \n",
"No 18-34 ans 0.027397\n",
" 34-54 ans 0.095477\n",
" 55-64 ans 0.327731\n",
" plus de 65 ans 0.851282\n",
"Yes 18-34 ans 0.027933\n",
" 34-54 ans 0.171548\n",
" 55-64 ans 0.443478\n",
" plus de 65 ans 0.857143\n",
"dtype: float64\n"
]
}
],
"source": [
"bins = [18, 34, 54, 64, float('inf')] # Limites des catégories\n",
"labels = ['18-34 ans', '34-54 ans', '55-64 ans', 'plus de 65 ans'] # Noms des catégories\n",
"raw_data['Age_Category'] = pd.cut(raw_data['Age'], bins=bins, labels=labels, right=False)\n",
"\n",
"# Création du tableau croisé en fonction de Smoker, Status et Age_Category\n",
"tableau_croise_age = pd.crosstab([raw_data['Smoker'], raw_data['Age_Category']], raw_data['Status'])\n",
"\n",
"# Calcul du taux de mortalité par catégorie de fumeur et d'âge\n",
"taux_mortalite_age = tableau_croise_age['Dead'] / tableau_croise_age.sum(axis=1)\n",
"\n",
"print(\"Tableau croisé :\")\n",
"print(tableau_croise_age)\n",
"print(\"\\nTaux de mortalité :\")\n",
"print(taux_mortalite_age)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"En séparant par catégorie d'âge, le taux de mortalité des fumeurs est toujours supérieur à celui des non fumeurs. Cela peut s'expliquer par le fait que certaines variables ne sont pas indépendantes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Régression logistique\n",
"\n",
"Créons dans un premier temps la variable Death"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"raw_data['Death'] = raw_data['Status'].apply(lambda x: 1 if x == 'Dead' else 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Séparons ensuite les données en fonction du groupe fumeurs ou non-fumeurs"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"data_fumeurs = raw_data[raw_data['Smoker'] == 'Yes']\n",
"data_non_fumeurs = raw_data[raw_data['Smoker'] == 'No']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Créons les modèles de régression logistique pour les deux catégories"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimization terminated successfully.\n",
" Current function value: 0.412727\n",
" Iterations 7\n",
"Optimization terminated successfully.\n",
" Current function value: 0.354560\n",
" Iterations 7\n",
"Résumé de la régression pour les fumeurs :\n",
" Logit Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Death No. Observations: 582\n",
"Model: Logit Df Residuals: 580\n",
"Method: MLE Df Model: 1\n",
"Date: Wed, 06 Nov 2024 Pseudo R-squ.: 0.2492\n",
"Time: 14:40:04 Log-Likelihood: -240.21\n",
"converged: True LL-Null: -319.94\n",
" LLR p-value: 1.477e-36\n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"const -5.5081 0.466 -11.814 0.000 -6.422 -4.594\n",
"Age 0.0890 0.009 10.203 0.000 0.072 0.106\n",
"==============================================================================\n",
"\n",
"Résumé de la régression pour les non-fumeurs :\n",
" Logit Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Death No. Observations: 732\n",
"Model: Logit Df Residuals: 730\n",
"Method: MLE Df Model: 1\n",
"Date: Wed, 06 Nov 2024 Pseudo R-squ.: 0.4304\n",
"Time: 14:40:04 Log-Likelihood: -259.54\n",
"converged: True LL-Null: -455.62\n",
" LLR p-value: 2.808e-87\n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"const -6.7955 0.479 -14.174 0.000 -7.735 -5.856\n",
"Age 0.1073 0.008 13.742 0.000 0.092 0.123\n",
"==============================================================================\n"
]
}
],
"source": [
"def logistic_regression(data):\n",
" X = add_constant(data['Age']) # Ajoute une constante pour l'interception\n",
" y = data['Death']\n",
" model = Logit(y, X)\n",
" result = model.fit()\n",
" return result\n",
"\n",
"result_fumeurs = logistic_regression(data_fumeurs)\n",
"result_non_fumeurs = logistic_regression(data_non_fumeurs)\n",
"\n",
"print(\"Résumé de la régression pour les fumeurs :\")\n",
"print(result_fumeurs.summary())\n",
"\n",
"print(\"\\nRésumé de la régression pour les non-fumeurs :\")\n",
"print(result_non_fumeurs.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous allons maintenant faire des prédictions pour un éventail d'âges allant de 0 à 100 ans."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"ages = np.linspace(raw_data['Age'].min(), raw_data['Age'].max(), 100)\n",
"X_ages = add_constant(ages)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculons les prédictions et les intervalles de confiance"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"# Fonction pour calculer les intervalles de confiance manuellement\n",
"def compute_confidence_interval(result, X, alpha=0.05):\n",
" predictions = result.predict(X)\n",
" # Calcul de l'erreur standard\n",
" gradient = X @ result.cov_params() @ X.T\n",
" std_error = np.sqrt(np.diag(gradient))\n",
" \n",
" # Calcul des intervalles de confiance (normal approx)\n",
" z = 1.96 # Pour un intervalle de confiance de 95%\n",
" lower_bound = predictions - z * std_error\n",
" upper_bound = predictions + z * std_error\n",
" return predictions, lower_bound, upper_bound\n",
"\n",
"# Intervalles de confiance pour les fumeurs\n",
"pred_fumeurs, lower_fumeurs, upper_fumeurs = compute_confidence_interval(result_fumeurs, X_ages)\n",
"\n",
"# Intervalles de confiance pour les non-fumeurs\n",
"pred_non_fumeurs, lower_non_fumeurs, upper_non_fumeurs = compute_confidence_interval(result_non_fumeurs, X_ages)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Enfin, affichons les résultats"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10, 6))\n",
"\n",
"# Courbe pour les fumeurs\n",
"plt.plot(ages, pred_fumeurs, color='blue', label='Fumeurs')\n",
"plt.fill_between(ages, lower_fumeurs, upper_fumeurs, color='blue', alpha=0.2)\n",
"\n",
"# Courbe pour les non-fumeurs\n",
"plt.plot(ages, pred_non_fumeurs, color='red', label='Non-fumeurs')\n",
"plt.fill_between(ages, lower_non_fumeurs, upper_non_fumeurs, color='red', alpha=0.2)\n",
"\n",
"plt.xlabel(\"Âge\")\n",
"plt.ylabel(\"Probabilité de décès\")\n",
"plt.title(\"Probabilité de décès en fonction de l'âge pour les fumeurs et non-fumeurs\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"La régression ne montre pas d'écart important entre les probabilités de décès des fumeurs et des non-fumeurs."
]
}
],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3",
...@@ -16,10 +600,9 @@ ...@@ -16,10 +600,9 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.6.3" "version": "3.6.4"
} }
}, },
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 2 "nbformat_minor": 2
} }
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment