{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# !pip install folium scikit-learn scipy" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import folium\n", "import matplotlib.pyplot as plt\n", "from sklearn.cluster import KMeans\n", "from scipy.spatial import distance\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# L'épidémie de choléra à Londres en 1854\n", "\n", "Cette étude porte sur la construction d'une **carte épidémiologique** afin de mieux comprendre l'épidémie de choléra dans le quartier de Soho à Londres en 1854. Par l'analyse des données, nous cherchons à trouver le **centre de l'épidémie** et prouver sa proximité avec l'une des pompes d'une quartier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chargement et aperçu des données" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data_death = pd.read_csv(\"deaths.csv\")\n", "data_pumps = pd.read_csv(\"pumps.csv\")\n", "data_death_pumps = pd.read_csv(\"deaths_and_pumps.csv\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Death dataset columns : ['Death', 'X coordinate', 'Y coordinate']\n", "Pumps dataset columns : ['Pump Name', 'X coordinate', 'Y coordinate']\n", "Death/Pumps dataset columns : ['Number of deaths', 'X coordinate', 'Y coordinate']\n", "\n" ] } ], "source": [ "print(\"\"\"\n", "Death dataset columns : {}\n", "Pumps dataset columns : {}\n", "Death/Pumps dataset columns : {}\n", "\"\"\".format(list(data_death.columns), list(data_pumps.columns), list(data_death_pumps.columns)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On renomme les colonnes pour éviter les typos à cause des majuscules et des espaces." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "death_cols = {\n", " list(data_death.columns)[0]: 'd_count',\n", " list(data_death.columns)[1]: 'x', \n", " list(data_death.columns)[2]: 'y'}\n", "pump_cols = {\n", " list(data_pumps.columns)[0]: 'name',\n", " list(data_pumps.columns)[1]: 'x', \n", " list(data_pumps.columns)[2]: 'y'}\n", "d_p_cols = {\n", " list(data_death_pumps.columns)[0]: 'death_per_pumps',\n", " list(data_death_pumps.columns)[1]: 'x', \n", " list(data_death_pumps.columns)[2]: 'y'}\n", "\n", "data_death.rename(columns=death_cols, inplace=True)\n", "data_pumps.rename(columns=pump_cols, inplace=True)\n", "data_death_pumps.rename(columns=d_p_cols, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Un petit regard sur la donnée." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " d_count x y\n", "0 1 51.513418 -0.137930\n", "1 1 51.513418 -0.137930\n", "2 1 51.513418 -0.137930\n", "3 1 51.513361 -0.137883\n", "4 1 51.513361 -0.137883\n", "\n", "\n", " name x y\n", "0 Broad St. 51.513341 -0.136668\n", "1 Crown Chapel 51.513876 -0.139586\n", "2 Gt Marlborough 51.514906 -0.139671\n", "3 Dean St. 51.512354 -0.131630\n", "4 So Soho 51.512139 -0.133594\n", "\n", "\n", " death_per_pumps x y\n", "0 3 51.513418 -0.137930\n", "1 2 51.513361 -0.137883\n", "2 1 51.513317 -0.137853\n", "3 1 51.513262 -0.137812\n", "4 4 51.513204 -0.137767\n" ] } ], "source": [ "print(data_death.head())\n", "print('\\n')\n", "print(data_pumps.head())\n", "print('\\n')\n", "print(data_death_pumps.head())" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Donnée manquante dans le dataset death.csv : 0\n", "Donnée manquante dans le dataset pumps.csv : 0\n", "Donnée manquante dans le dataset death_and_pumps.csv : 0\n" ] } ], "source": [ "print(\"Donnée manquante dans le dataset death.csv : {}\".format(len(data_death[data_death.isnull().any(axis=1)])))\n", "print(\"Donnée manquante dans le dataset pumps.csv : {}\".format(len(data_pumps[data_pumps.isnull().any(axis=1)])))\n", "print(\"Donnée manquante dans le dataset death_and_pumps.csv : {}\".format(len(data_death_pumps[data_death_pumps.isnull().any(axis=1)])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Création de la carte\n", "\n", "### Les décès\n", "\n", "On commence par afficher les décès sur la carte en pointant vers une coordonnée disponible dans le dataset." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "data_death_df = data_death.groupby(['x', 'y']).d_count.count().to_frame()\n", "data_death_df.reset_index(inplace=True)\n", "death_coordinates = data_death_df[[\"x\",\"y\"]]\n", "death_coordinates = death_coordinates.values.tolist()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "soho_c = death_coordinates[0]\n", "death_map = folium.Map(location=soho_c, tiles='Stamen Toner', zoom_start=17)\n", "for p in range(0, len(death_coordinates)):\n", " folium.CircleMarker(death_coordinates[p], radius=2*int(data_death_df['d_count'][p]), \n", " color='blue', fill=True, fill_color='blue',\n", " opacity = 0.4).add_to(death_map)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "death_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Les pompes\n", "\n", "On y ajoute ensuite les emplacements de pompes." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "pump_coordinates = data_pumps[[\"x\",\"y\"]]\n", "pump_coordinates = pump_coordinates.values.tolist()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "death_pump_map = death_map\n", "for p in range(0, len(pump_coordinates)):\n", " folium.Marker(pump_coordinates[p],\n", " popup='Name : {}'.format(data_pumps['name'][p]),\n", " icon=folium.Icon(color='red', icon='info-sign')).add_to(death_pump_map)\n", "death_pump_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recherche de la pompe au centre de l'épidémie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sur la carte précédente on voit très clairement par un cercle de diamètre supérieur aux autres, que la plus grande densité de décès se trouve au plus près de la pompe de Broad St. Essayons de le démontrer par l'analyse.
\n", "On peut par exemple utiliser l'algorithme [K-means](https://fr.wikipedia.org/wiki/K-moyennes) pour former des **clusters** et vérifier quelle pompe se trouve au centre du cluster contenant le plus de cas.
\n", "On commence par initialiser K-means avec un nombre de clusters correspondant au nombre de pompes." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "n_pumps = len(data_pumps)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "kmeans = KMeans(n_clusters = n_pumps, init ='k-means++')\n", "kmeans.fit(data_death[data_death.columns[1:3]])\n", "data_death['cluster_label'] = kmeans.fit_predict(data_death[data_death.columns[1:3]])\n", "centers = kmeans.cluster_centers_\n", "labels = kmeans.predict(data_death[data_death.columns[1:3]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On observe la répartition des clusters." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data_death.plot.scatter(x = 'x', y = 'y', c=labels, s=20, \n", " ylim=[data_pumps['y'].min()-0.001, data_pumps['y'].max()-0.001],\n", " xlim=[data_pumps['x'].min()+0.0015, data_pumps['x'].max()+0.001], cmap='viridis')\n", "plt.scatter(centers[:, 0], centers[:, 1], c='red', s=100, alpha=0.8)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ici on voit que le nombre de clusters est trop grand. Pour certains points, l'appartenance à un cluster plus qu'un autre n'apparaît pas clair. On voit d'ailleurs sur la carte que beaucoup de pompe sont à l'extérieur du centre de l'épidémie.\n", "Il nous faut trouver le nombre optimal de cluster possible. Pour ça on utilise la méthode [Elbow Curve](https://en.wikipedia.org/wiki/Elbow_method_(clustering))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "K_clusters = range(1, n_pumps)\n", "kmeans = [KMeans(n_clusters=i) for i in K_clusters]\n", "Y_axis = data_death[['x']]\n", "X_axis = data_death[['y']]\n", "score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]\n", "plt.plot(K_clusters, score)\n", "plt.xlabel('Nombre de clusters')\n", "plt.ylabel('Score')\n", "plt.title('Elbow Curve')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La courbe nous montre que le nombre de **K** optimal est **3**." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "kmeans = KMeans(n_clusters = 3, init ='k-means++')\n", "kmeans.fit(data_death[data_death.columns[1:3]])\n", "data_death['cluster_label'] = kmeans.fit_predict(data_death[data_death.columns[1:3]])\n", "centers = kmeans.cluster_centers_\n", "labels = kmeans.predict(data_death[data_death.columns[1:3]])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data_death.plot.scatter(x = 'x', y = 'y', c=labels, s=20, \n", " ylim=[data_pumps['y'].min()-0.001, data_pumps['y'].max()-0.001],\n", " xlim=[data_pumps['x'].min()+0.0015, data_pumps['x'].max()+0.001], cmap='viridis')\n", "plt.scatter(centers[:, 0], centers[:, 1], c='red', s=100, alpha=0.8)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On récupère le cluster qui contient le plus de cas de décès." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cluster avec le plus grand nombre de morts : 0\n" ] } ], "source": [ "cluster_by_death = data_death.groupby('cluster_label').count()['d_count']\n", "max_cluster = cluster_by_death.idxmax()\n", "print('Cluster avec le plus grand nombre de morts : {}'.format(max_cluster))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cluster_label\n", "0 202\n", "1 137\n", "2 150\n", "Name: d_count, dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cluster_by_death" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut placer les clusters sur la map pour en avoir une meilleure représentation et vérifier que le cluster trouvé se trouve près de la pompe de Broad St." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cluster_map = death_pump_map\n", "for p in range(0, len(centers)):\n", " folium.Marker(centers[p],\n", " popup='Cluster : {}'.format(p),\n", " icon=folium.Icon(color='green')).add_to(death_pump_map)\n", "cluster_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On vérifie notre hypothèse visuellement. On peut aussi calculer **la distance euclidienne** entre le cluster 0 et les pompes pour vérifier qu'il est au plus près de la pompe de Broad St." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "c_0_coordinates = centers[max_cluster]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "euclidean_distances = []\n", "for i in pump_coordinates:\n", " euclidean_distances.append(distance.euclidean(i, c_0_coordinates))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "L'indice de la distance minimale nous donne l'indice de la pompe au centre de l'épidémie." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "pump_idx = euclidean_distances.index(min(euclidean_distances))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name Broad St.\n", "x 51.5133\n", "y -0.136668\n", "Name: 0, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_pumps.iloc[pump_idx]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**On prouve donc par clustering que la pompe de Broad St. est au centre de l'épidémie.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 4 }