{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyse de l'incidence du syndrome grippal\n", "\n", "Dans un premier temps nous allons inspecter les données et dans un deuxième temps nous allons les analyser et en traire une conclusion. \n", "**Remember the first step is to take a manual look at all the data all together!!**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#Importation des bibliothèques principales\n", " #Demander à python de garder les fichier inside the document and not on outside widows.\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import isoweek" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weekindicatorincinc_lowinc_upinc100inc100_lowinc100_upgeo_inseegeo_name
0202011310170493652.0109756.0154142.0166.0FRFrance
1202010310497796650.0113304.0159146.0172.0FRFrance
22020093110696102066.0119326.0168155.0181.0FRFrance
32020083143753133984.0153522.0218203.0233.0FRFrance
42020073183610172812.0194408.0279263.0295.0FRFrance
.................................
184119844837862060634.096606.0143110.0176.0FRFrance
184219844737202954274.089784.013199.0163.0FRFrance
184319844638733067686.0106974.0159123.0195.0FRFrance
18441984453135223101414.0169032.0246184.0308.0FRFrance
184519844436842220056.0116788.012537.0213.0FRFrance
\n", "

1846 rows × 10 columns

\n", "
" ], "text/plain": [ " week indicator inc inc_low inc_up inc100 inc100_low \\\n", "0 202011 3 101704 93652.0 109756.0 154 142.0 \n", "1 202010 3 104977 96650.0 113304.0 159 146.0 \n", "2 202009 3 110696 102066.0 119326.0 168 155.0 \n", "3 202008 3 143753 133984.0 153522.0 218 203.0 \n", "4 202007 3 183610 172812.0 194408.0 279 263.0 \n", "... ... ... ... ... ... ... ... \n", "1841 198448 3 78620 60634.0 96606.0 143 110.0 \n", "1842 198447 3 72029 54274.0 89784.0 131 99.0 \n", "1843 198446 3 87330 67686.0 106974.0 159 123.0 \n", "1844 198445 3 135223 101414.0 169032.0 246 184.0 \n", "1845 198444 3 68422 20056.0 116788.0 125 37.0 \n", "\n", " inc100_up geo_insee geo_name \n", "0 166.0 FR France \n", "1 172.0 FR France \n", "2 181.0 FR France \n", "3 233.0 FR France \n", "4 295.0 FR France \n", "... ... ... ... \n", "1841 176.0 FR France \n", "1842 163.0 FR France \n", "1843 195.0 FR France \n", "1844 308.0 FR France \n", "1845 213.0 FR France \n", "\n", "[1846 rows x 10 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_url= \"https://www.sentiweb.fr/datasets/incidence-PAY-3.csv\"\n", "raw_data = pd.read_csv(data_url, skiprows=1) #Otherwise unable to read the first raw which is messy\n", "raw_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing data wrangling\n", "Now that we have seen a line with missing data we are going to supress it well8" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weekindicatorincinc_lowinc_upinc100inc100_lowinc100_upgeo_inseegeo_name
0202011310170493652.0109756.0154142.0166.0FRFrance
1202010310497796650.0113304.0159146.0172.0FRFrance
22020093110696102066.0119326.0168155.0181.0FRFrance
32020083143753133984.0153522.0218203.0233.0FRFrance
42020073183610172812.0194408.0279263.0295.0FRFrance
.................................
184119844837862060634.096606.0143110.0176.0FRFrance
184219844737202954274.089784.013199.0163.0FRFrance
184319844638733067686.0106974.0159123.0195.0FRFrance
18441984453135223101414.0169032.0246184.0308.0FRFrance
184519844436842220056.0116788.012537.0213.0FRFrance
\n", "

1845 rows × 10 columns

\n", "
" ], "text/plain": [ " week indicator inc inc_low inc_up inc100 inc100_low \\\n", "0 202011 3 101704 93652.0 109756.0 154 142.0 \n", "1 202010 3 104977 96650.0 113304.0 159 146.0 \n", "2 202009 3 110696 102066.0 119326.0 168 155.0 \n", "3 202008 3 143753 133984.0 153522.0 218 203.0 \n", "4 202007 3 183610 172812.0 194408.0 279 263.0 \n", "... ... ... ... ... ... ... ... \n", "1841 198448 3 78620 60634.0 96606.0 143 110.0 \n", "1842 198447 3 72029 54274.0 89784.0 131 99.0 \n", "1843 198446 3 87330 67686.0 106974.0 159 123.0 \n", "1844 198445 3 135223 101414.0 169032.0 246 184.0 \n", "1845 198444 3 68422 20056.0 116788.0 125 37.0 \n", "\n", " inc100_up geo_insee geo_name \n", "0 166.0 FR France \n", "1 172.0 FR France \n", "2 181.0 FR France \n", "3 233.0 FR France \n", "4 295.0 FR France \n", "... ... ... ... \n", "1841 176.0 FR France \n", "1842 163.0 FR France \n", "1843 195.0 FR France \n", "1844 308.0 FR France \n", "1845 213.0 FR France \n", "\n", "[1845 rows x 10 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#raw_data[raw_data.dropna().copy()]\n", "data = raw_data.dropna().copy()\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Nos données utilisent une convention inhabituelle: le numéro de\n", "semaine est collé à l'année, donnant l'impression qu'il s'agit\n", "de nombre entier. C'est comme ça que Pandas les interprète.Un deuxième problème est que Pandas ne comprend pas les numéros de\n", "semaine. Il faut lui fournir les dates de début et de fin de\n", "semaine. Nous utilisons pour cela la bibliothèque isoweek.Comme la conversion des semaines est devenu assez complexe, nous\n", "écrivons une petite fonction Python pour cela. Ensuite, nous\n", "l'appliquons à tous les points de nos donnés. Les résultats vont\n", "dans une nouvelle colonne 'period'." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2020-03-09/2020-03-15\n", "1 2020-03-02/2020-03-08\n", "2 2020-02-24/2020-03-01\n", "3 2020-02-17/2020-02-23\n", "4 2020-02-10/2020-02-16\n", " ... \n", "1841 1984-11-26/1984-12-02\n", "1842 1984-11-19/1984-11-25\n", "1843 1984-11-12/1984-11-18\n", "1844 1984-11-05/1984-11-11\n", "1845 1984-10-29/1984-11-04\n", "Name: period, Length: 1845, dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def convert_week(year_week_int):\n", " year_week_str= str(year_week_int)\n", " year = int(year_week_str[:4])\n", " week = int(year_week_str[4:])\n", " w = isoweek.Week(year, week)\n", " return pd.Period(w.day(0), 'W')\n", "\n", "data['period'] = [convert_week(yw) for yw in data['week']]\n", "data['period']" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#Sort data to make human sense\n", "sorted_data= data.set_index('period').sort_index()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1989-05-01/1989-05-07 1989-05-15/1989-05-21\n" ] } ], "source": [ "#Verify that each of the period is subsequent to the other\n", "periods = sorted_data.index\n", "for p1, p2 in zip(periods[:-1], periods[1:]):\n", " delta = p2.to_timestamp() - p1.end_time\n", " if delta > pd.Timedelta('1s'):\n", " print(p1, p2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ceci est du au fait que nous avons enlevé une semaine de l'année 1989 parce que nous trouvions pas de données pertinentes. Nous avons ici un bon exemple d'elimination des donnéees qui est pertinante'" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sorted_data['inc'].plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questions et réponses\n", "\n", "### Quelles ont été les épidémies les plus fortes ? \n", "\n", "Epidemics are calculated in yearly incidence not in added months so the problem is to find a convention to define and add up all the weekly incidences. Here is one way " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "first_august_week = [pd.Period(pd.Timestamp(y, 8, 1), 'W')\n", " for y in range(1985,\n", " sorted_data.index[-1].year)]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "year = []\n", "yearly_incidence = []\n", "for week1, week2 in zip(first_august_week[:-1],\n", " first_august_week[1:]):\n", " one_year = sorted_data['inc'][week1:week2-1]\n", " assert abs(len(one_year)-52) < 2\n", " yearly_incidence.append(one_year.sum())\n", " year.append(week2.year)\n", "yearly_incidence = pd.Series(data=yearly_incidence, index=year)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "yearly_incidence.plot(style=\"*\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quelle est la distribution des épidémies? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }