From 758aa33b59651b170581f204067e54a744bff83b Mon Sep 17 00:00:00 2001 From: 7f9d4a2f9f536fc2da1beb7df3382bb3 <7f9d4a2f9f536fc2da1beb7df3382bb3@app-learninglab.inria.fr> Date: Fri, 19 Dec 2025 19:42:13 +0000 Subject: [PATCH] Add computational document on Simpson paradox --- module3/exo3/exercice_en.ipynb | 317 ++++++++++++++++++++++++++++++++- 1 file changed, 314 insertions(+), 3 deletions(-) diff --git a/module3/exo3/exercice_en.ipynb b/module3/exo3/exercice_en.ipynb index 0bbbe37..924d6ae 100644 --- a/module3/exo3/exercice_en.ipynb +++ b/module3/exo3/exercice_en.ipynb @@ -1,5 +1,317 @@ { - "cells": [], + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Autour du paradoxe de Simpson\n", + "\n", + "## Objectif\n", + "Le paradoxe de Simpson décrit une situation statistique dans laquelle une tendance observée\n", + "dans plusieurs sous-groupes disparaît ou s’inverse lorsque les données sont agrégées.\n", + "\n", + "L’objectif de ce document est :\n", + "- d’illustrer le paradoxe sur un jeu de données simple,\n", + "- de visualiser les tendances par sous-groupes et globalement,\n", + "- de discuter les implications pour l’analyse de données et la reproductibilité.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
GroupeSuccèsTotalTraitementTaux_succès
0Jeunes90100A0.9
1Âgés10100A0.1
2Jeunes80100B0.8
3Âgés20100B0.2
\n", + "
" + ], + "text/plain": [ + " Groupe Succès Total Traitement Taux_succès\n", + "0 Jeunes 90 100 A 0.9\n", + "1 Âgés 10 100 A 0.1\n", + "2 Jeunes 80 100 B 0.8\n", + "3 Âgés 20 100 B 0.2" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Exemple classique du paradoxe de Simpson\n", + "data = pd.DataFrame({\n", + " \"Traitement\": [\"A\", \"A\", \"B\", \"B\"],\n", + " \"Groupe\": [\"Jeunes\", \"Âgés\", \"Jeunes\", \"Âgés\"],\n", + " \"Succès\": [90, 10, 80, 20],\n", + " \"Total\": [100, 100, 100, 100]\n", + "})\n", + "\n", + "data[\"Taux_succès\"] = data[\"Succès\"] / data[\"Total\"]\n", + "data\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TraitementAB
Groupe
Jeunes0.90.8
Âgés0.10.2
\n", + "
" + ], + "text/plain": [ + "Traitement A B\n", + "Groupe \n", + "Jeunes 0.9 0.8\n", + "Âgés 0.1 0.2" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pivot = data.pivot(index=\"Groupe\", columns=\"Traitement\", values=\"Taux_succès\")\n", + "pivot\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SuccèsTotalTaux_succès
Traitement
A1002000.5
B1002000.5
\n", + "
" + ], + "text/plain": [ + " Succès Total Taux_succès\n", + "Traitement \n", + "A 100 200 0.5\n", + "B 100 200 0.5" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "global_data = data.groupby(\"Traitement\")[[\"Succès\", \"Total\"]].sum()\n", + "global_data[\"Taux_succès\"] = global_data[\"Succès\"] / global_data[\"Total\"]\n", + "global_data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Discussion\n", + "\n", + "Ce paradoxe provient du fait que les groupes ne sont pas répartis de manière équilibrée\n", + "entre les traitements. Le poids relatif des sous-groupes influence fortement le résultat\n", + "agrégé.\n", + "\n", + "Ce phénomène souligne l’importance :\n", + "- de stratifier les données avant analyse,\n", + "- de comprendre les variables de confusion,\n", + "- de ne pas se fier uniquement aux statistiques globales.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Reproductibilité\n", + "\n", + "- Les données sont intégrées directement dans le code.\n", + "- Les calculs sont déterministes (pas d’aléatoire).\n", + "- Les bibliothèques utilisées sont standards (pandas, matplotlib).\n", + "- Le document peut être réexécuté intégralement sur une autre machine.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "Le paradoxe de Simpson montre que des conclusions opposées peuvent être tirées\n", + "selon le niveau d’agrégation des données.\n", + "\n", + "Il rappelle que l’analyse de données nécessite :\n", + "- une compréhension fine du contexte,\n", + "- une exploration multi-niveaux,\n", + "- une grande prudence dans l’interprétation des résultats.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], "metadata": { "kernelspec": { "display_name": "Python 3", @@ -16,10 +328,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.3" + "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 } - -- 2.18.1