Lecture du fichier et génération du DataFrame pandas.

parent de4a10b3
{ {
"cells": [ "cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sujet 4 : Estimation de la latence et de la capacité d’une connexion à partir de mesures asymétriques"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On commence par déclarer les bibliothèques utilisés :\n",
"\n",
"Note : `urllib.request` n'y est pas car elle n'est utilisée que dans le cas où on doit télécharger les données."
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 1,
"metadata": {}, "metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import re\n",
"import gzip\n",
"import time\n",
"import pandas\n",
"import io\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On commence par récupérer les données à étudier :"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [ "outputs": [
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Les données n'existent pas en local, on les télécharges.\n", "Les données sont déjà présentes en local.\n"
"Fichier récupéré.\n"
] ]
} }
], ],
...@@ -22,7 +59,6 @@ ...@@ -22,7 +59,6 @@
"if data_file[-7:] != \".log.gz\":\n", "if data_file[-7:] != \".log.gz\":\n",
" raise Exception(\"Le fichier nom de fichier \"+data_file+\" ne finit pas par \\\".log.gz\\\" !\")\n", " raise Exception(\"Le fichier nom de fichier \"+data_file+\" ne finit pas par \\\".log.gz\\\" !\")\n",
"\n", "\n",
"import os\n",
"if not os.access(data_file, os.R_OK):\n", "if not os.access(data_file, os.R_OK):\n",
" import urllib.request\n", " import urllib.request\n",
" print(\"Les données n'existent pas en local, on les télécharges.\")\n", " print(\"Les données n'existent pas en local, on les télécharges.\")\n",
...@@ -35,12 +71,579 @@ ...@@ -35,12 +71,579 @@
" print(\"Les données sont déjà présentes en local.\")" " print(\"Les données sont déjà présentes en local.\")"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On définit la fonction qui va lire chaque ligne pour en extraire les données. La ligne retournée sera formatée en CSV.\n",
"\n",
"Comme ce qui nous intéresse est le temps mis pour latence (ou \"ping\") il faut impérativement que celle ci soit présente pour que la ligne soit reconnue, pour les lignes dans ce cas on retournera `Ǹone`.\n",
"\n",
"Si la ligne est totalement illisible on soulèvera une exception afin d'avertir l'utilisateur qu'il y a des lignes dont le format est illisible par le programme. Ceci est préférable au fait de retourner `Ǹone` car si c'était le cas on risquerait de masquer des données utiles, par exemple si `ping` avait retourné des données en secondes plutôt qu'en millisecondes."
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
"source": [] {
"name": "stdout",
"output_type": "stream",
"text": [
"1421761682.052172,665,22.5\n",
"\n",
"None\n",
"\n",
"Exception (attendue) : La ligne \"[1421761682.052172] 665 bytes from lig-publig.imag.fr (129.88.11.7): icmp_seq=1 ttl=60 time=22.5 s\" n'est pas dans le format attendu.\n"
]
}
],
"source": [
"extractDataFromLineRegExp = re.compile(\"^\\[([0-9\\.]+)\\] ([0-9]+) bytes[^:]*: icmp_seq=[0-9]+ ttl=[0-9]+( time=([0-9\\.]+) ms)?$\")\n",
"def extractDataFromLine(line):\n",
" match = extractDataFromLineRegExp.match(line)\n",
" if match and match[4]:\n",
" return match[1]+\",\"+match[2]+\",\"+match[4]+\"\\n\"\n",
" elif match:\n",
" return None\n",
" else:\n",
" raise Exception(\"La ligne \\\"\"+line+\"\\\" n'est pas dans le format attendu.\")\n",
"\n",
"# Quelques essais\n",
"print(extractDataFromLine(\"[1421761682.052172] 665 bytes from lig-publig.imag.fr (129.88.11.7): icmp_seq=1 ttl=60 time=22.5 ms\")) # Le retour à la ligne est inclus dans ce qui est retourné\n",
"print(extractDataFromLine(\"[1421773281.582445] 13 bytes from stackoverflow.com (198.252.206.140): icmp_seq=1 ttl=50\"))\n",
"print()\n",
"try:\n",
" print(extractDataFromLine(\"[1421761682.052172] 665 bytes from lig-publig.imag.fr (129.88.11.7): icmp_seq=1 ttl=60 time=22.5 s\"))\n",
" print(\"On devrait avoir une exception ici.\")\n",
"except Exception as e:\n",
" print(\"Exception (attendue) : \"+e.args[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lit les données du fichier, utilise la fonction `extractDataFromLine` définit précédemment pour extraire les données et les placer dans une variables `csv_data` qui contiendra les données au format CSV.\n",
"\n",
"J'ai dans un premier temps essayé de ne pas passer par une variable intermédiaire et ajouter les données directement dans le DataFrame mais c'était extrêmement lent. Il aurait aussi été possible de passer par un fichier intermédiaire. En cas de données plus imposantes cela aurait été nécessaire."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lu 44413 lignes en 0.185 sec\n"
]
}
],
"source": [
"nb = 0\n",
"start_time = time.time()\n",
"csv_data = '\"date\",\"size\",\"time\"\\n' # La première ligne du CSV à les noms de champs\n",
"with gzip.open(data_file, 'rb') as file:\n",
" for line in file:\n",
" line_data = extractDataFromLine(line.decode('utf-8').strip())\n",
" if line_data:\n",
" csv_data += line_data\n",
" nb += 1\n",
"\n",
"print (\"Lu %d lignes en %.3f sec\" % (nb, time.time() - start_time))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Traduit le tableau du format CSV en temps que DataFrame pandas. Comme on lit depuis le contenu d'une variable on utilise `io.StringIO` qui permet de lire une variable comme on lit un fichier."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>date</th>\n",
" <th>size</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.421762e+09</td>\n",
" <td>665</td>\n",
" <td>22.50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1373</td>\n",
" <td>21.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.421762e+09</td>\n",
" <td>262</td>\n",
" <td>21.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1107</td>\n",
" <td>23.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1128</td>\n",
" <td>1.41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1.421762e+09</td>\n",
" <td>489</td>\n",
" <td>21.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1759</td>\n",
" <td>78.70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1146</td>\n",
" <td>25.10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1.421762e+09</td>\n",
" <td>884</td>\n",
" <td>24.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1422</td>\n",
" <td>19.50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1180</td>\n",
" <td>18.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>1.421762e+09</td>\n",
" <td>999</td>\n",
" <td>18.80</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1020</td>\n",
" <td>24.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>1.421762e+09</td>\n",
" <td>71</td>\n",
" <td>3.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>1.421762e+09</td>\n",
" <td>34</td>\n",
" <td>5.85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1843</td>\n",
" <td>2.31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>1.421762e+09</td>\n",
" <td>407</td>\n",
" <td>1.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>1.421762e+09</td>\n",
" <td>356</td>\n",
" <td>1.10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1511</td>\n",
" <td>2.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>1.421762e+09</td>\n",
" <td>587</td>\n",
" <td>1.27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>1.421762e+09</td>\n",
" <td>809</td>\n",
" <td>1.33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1364</td>\n",
" <td>1.51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1153</td>\n",
" <td>1.44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>1.421762e+09</td>\n",
" <td>853</td>\n",
" <td>1.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1510</td>\n",
" <td>2.17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>1.421762e+09</td>\n",
" <td>123</td>\n",
" <td>1.21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>1.421762e+09</td>\n",
" <td>1966</td>\n",
" <td>2.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>1.421762e+09</td>\n",
" <td>933</td>\n",
" <td>1.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>1.421762e+09</td>\n",
" <td>922</td>\n",
" <td>1.42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>1.421762e+09</td>\n",
" <td>24</td>\n",
" <td>1.12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44006</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1772</td>\n",
" <td>28.80</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44007</th>\n",
" <td>1.421771e+09</td>\n",
" <td>41</td>\n",
" <td>1.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44008</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1944</td>\n",
" <td>2.32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44009</th>\n",
" <td>1.421771e+09</td>\n",
" <td>400</td>\n",
" <td>1.98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44010</th>\n",
" <td>1.421771e+09</td>\n",
" <td>226</td>\n",
" <td>3.01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44011</th>\n",
" <td>1.421771e+09</td>\n",
" <td>466</td>\n",
" <td>7.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44012</th>\n",
" <td>1.421771e+09</td>\n",
" <td>350</td>\n",
" <td>13.50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44013</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1829</td>\n",
" <td>45.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44014</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1954</td>\n",
" <td>58.50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44015</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1074</td>\n",
" <td>1.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44016</th>\n",
" <td>1.421771e+09</td>\n",
" <td>46</td>\n",
" <td>1.11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44017</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1844</td>\n",
" <td>2.26</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44018</th>\n",
" <td>1.421771e+09</td>\n",
" <td>645</td>\n",
" <td>1.24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44019</th>\n",
" <td>1.421771e+09</td>\n",
" <td>444</td>\n",
" <td>1.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44020</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1940</td>\n",
" <td>2.46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44021</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1411</td>\n",
" <td>1.47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44022</th>\n",
" <td>1.421771e+09</td>\n",
" <td>49</td>\n",
" <td>1.21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44023</th>\n",
" <td>1.421771e+09</td>\n",
" <td>420</td>\n",
" <td>1.55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44024</th>\n",
" <td>1.421771e+09</td>\n",
" <td>227</td>\n",
" <td>1.22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44025</th>\n",
" <td>1.421771e+09</td>\n",
" <td>947</td>\n",
" <td>1.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44026</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1960</td>\n",
" <td>2.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44027</th>\n",
" <td>1.421771e+09</td>\n",
" <td>531</td>\n",
" <td>1.19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44028</th>\n",
" <td>1.421771e+09</td>\n",
" <td>374</td>\n",
" <td>1.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44029</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1503</td>\n",
" <td>2.19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44030</th>\n",
" <td>1.421771e+09</td>\n",
" <td>572</td>\n",
" <td>1.29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44031</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1338</td>\n",
" <td>1.47</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44032</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1515</td>\n",
" <td>7.02</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44033</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1875</td>\n",
" <td>2.33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44034</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1006</td>\n",
" <td>1.61</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44035</th>\n",
" <td>1.421771e+09</td>\n",
" <td>1273</td>\n",
" <td>1.35</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>44036 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" date size time\n",
"0 1.421762e+09 665 22.50\n",
"1 1.421762e+09 1373 21.20\n",
"2 1.421762e+09 262 21.20\n",
"3 1.421762e+09 1107 23.30\n",
"4 1.421762e+09 1128 1.41\n",
"5 1.421762e+09 489 21.90\n",
"6 1.421762e+09 1759 78.70\n",
"7 1.421762e+09 1146 25.10\n",
"8 1.421762e+09 884 24.00\n",
"9 1.421762e+09 1422 19.50\n",
"10 1.421762e+09 1180 18.00\n",
"11 1.421762e+09 999 18.80\n",
"12 1.421762e+09 1020 24.30\n",
"13 1.421762e+09 71 3.45\n",
"14 1.421762e+09 34 5.85\n",
"15 1.421762e+09 1843 2.31\n",
"16 1.421762e+09 407 1.14\n",
"17 1.421762e+09 356 1.10\n",
"18 1.421762e+09 1511 2.18\n",
"19 1.421762e+09 587 1.27\n",
"20 1.421762e+09 809 1.33\n",
"21 1.421762e+09 1364 1.51\n",
"22 1.421762e+09 1153 1.44\n",
"23 1.421762e+09 853 1.30\n",
"24 1.421762e+09 1510 2.17\n",
"25 1.421762e+09 123 1.21\n",
"26 1.421762e+09 1966 2.20\n",
"27 1.421762e+09 933 1.34\n",
"28 1.421762e+09 922 1.42\n",
"29 1.421762e+09 24 1.12\n",
"... ... ... ...\n",
"44006 1.421771e+09 1772 28.80\n",
"44007 1.421771e+09 41 1.14\n",
"44008 1.421771e+09 1944 2.32\n",
"44009 1.421771e+09 400 1.98\n",
"44010 1.421771e+09 226 3.01\n",
"44011 1.421771e+09 466 7.45\n",
"44012 1.421771e+09 350 13.50\n",
"44013 1.421771e+09 1829 45.90\n",
"44014 1.421771e+09 1954 58.50\n",
"44015 1.421771e+09 1074 1.45\n",
"44016 1.421771e+09 46 1.11\n",
"44017 1.421771e+09 1844 2.26\n",
"44018 1.421771e+09 645 1.24\n",
"44019 1.421771e+09 444 1.25\n",
"44020 1.421771e+09 1940 2.46\n",
"44021 1.421771e+09 1411 1.47\n",
"44022 1.421771e+09 49 1.21\n",
"44023 1.421771e+09 420 1.55\n",
"44024 1.421771e+09 227 1.22\n",
"44025 1.421771e+09 947 1.34\n",
"44026 1.421771e+09 1960 2.43\n",
"44027 1.421771e+09 531 1.19\n",
"44028 1.421771e+09 374 1.14\n",
"44029 1.421771e+09 1503 2.19\n",
"44030 1.421771e+09 572 1.29\n",
"44031 1.421771e+09 1338 1.47\n",
"44032 1.421771e+09 1515 7.02\n",
"44033 1.421771e+09 1875 2.33\n",
"44034 1.421771e+09 1006 1.61\n",
"44035 1.421771e+09 1273 1.35\n",
"\n",
"[44036 rows x 3 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_data = pandas.read_csv(io.StringIO(csv_data))\n",
"# csv_data = None # Libère la mémoire\n",
"raw_data"
]
} }
], ],
"metadata": { "metadata": {
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment