{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Subject4: Latency and capacity estimation for a network connection from asymmetric measurements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import isoweek\n",
    "from os import path\n",
    "import urllib.request as request\n",
    "import gzip as gz\n",
    "import shutil\n",
    "import re\n",
    "from collections import OrderedDict\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def download_file(url, filename) :\n",
    "    if not path.exists(filename) :\n",
    "        print(\"No local data copy available for \" + filename + \", creating new version\")\n",
    "        request.urlretrieve(url, filename)\n",
    "    else :\n",
    "        print(\"Using local version of \" + filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First data to examine, ping traces from on-campus connection.\n",
    "We first check if a local copy of the data exists. If no copy is available, a new archive is downloaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_url = \"http://mescal.imag.fr/membres/arnaud.legrand/teaching/2014/RICM4_EP_ping/liglab2.log.gz\"\n",
    "liglab2_file = \"liglab2.log.gz\"\n",
    "\n",
    "download_file(liglab2_url, liglab2_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second data to examine, ping traces from stackoverflow. As before, we download the file if a local copy doesn't already exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stackoverflow_url = \"http://mescal.imag.fr/membres/arnaud.legrand/teaching/2014/RICM4_EP_ping/stackoverflow.log.gz\"\n",
    "stackoverflow_file = \"stackoverflow.log.gz\"\n",
    "\n",
    "download_file(stackoverflow_url, stackoverflow_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The files are log files with the following syntaxe:\n",
    "\n",
    "_[1421761682.052172] 665 bytes from lig-publig.imag.fr (129.88.11.7): icmp_seq=1 ttl=60 time=22.5 ms_\n",
    "\n",
    "which can be described as the following\n",
    "\n",
    "_[`timestamp`] `size` bytes from `url` (`ip`): icmp_seq=`icmp_seq` ttl=`ttl` time=`time` ms_\n",
    "\n",
    "| Variable name  | Description                                                               |\n",
    "|----------------|---------------------------------------------------------------------------|\n",
    "| `timestamp`    | Epoch time in seconds as from the 1st January  1970                       |\n",
    "| `size`         | Size of packet transmitted in bytes                                       |\n",
    "| `url`          | DNS resolution of server with which packets are being exchanged           |\n",
    "| `ip`           | IPv4 addresse of server with which packets are being exchanged            |\n",
    "| `icmp_seq`     | The sequence number of the ICMP packet (**not used in this analysis**)      |\n",
    "| `ttl`          | The time-to-live of the ICMP packet (**not used in this analysis**)         |\n",
    "| `time`         | The round trip duration with the server in miliseconds                    |\n",
    "\n",
    "Both `icmp_seq` and `ttl` are not used in this analysis, but they will be extracted from the files to retain file integrety."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1st analysis: Liglab2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare the data\n",
    "\n",
    "We will start our analysis with the first datafile, from Liglab2. Because many of the operations will be the same for the other analysis, we will create functions to accomplish the tasks.\n",
    "\n",
    "The first operation is to decompress te archive and load the file into an array in memory. We will print the contents to check the operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def decompress_archive (archive):\n",
    "    with gz.open(archive, 'rt') as f_in:\n",
    "        content = f_in.read().split(\"\\n\")\n",
    "    return content\n",
    "\n",
    "liglab2_decompressed = decompress_archive(liglab2_file)\n",
    "liglab2_decompressed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, to analyse the data we must parse its textual form and extract all the variables. To do so we use multiple regex expressions to recover the different variables. If no variable is found, then it is set to `None`. Once again we print the contents to check if all went correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reg = [('timestamp',r'\\[([0-9]+\\.[0-9]+)\\]', float),\n",
    "        ('size', r'([1-9][0-9]*) bytes', int),\n",
    "        ('url', r'from ([-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b([-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)?)', str),\n",
    "        ('ip', r'\\(\\b((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]))\\b\\)', str),\n",
    "        ('icmp_seq', r'icmp_seq=([1-9][0-9]*)', int),\n",
    "        ('ttl', r'ttl=([0-9]|[1-9][0-9]*)', int),\n",
    "        ('time', r'time=([0-9]|[1-9][0-9]*(\\.[0-9]+)?) ms', float)]\n",
    "\n",
    "def extract_data (content, reg):\n",
    "    list = []\n",
    "    for line in content:\n",
    "        values = OrderedDict()\n",
    "        if line:\n",
    "            for (name, regex, func) in reg:\n",
    "                obj = re.findall(regex, line)\n",
    "                val = None\n",
    "                if len(obj) :\n",
    "                    val = obj[0]\n",
    "                    if isinstance(val, tuple) and len(val) :\n",
    "                        val = val[0]\n",
    "                    val = func(val)\n",
    "                values[name] = val\n",
    "            list.append(values)\n",
    "    return list\n",
    "\n",
    "liglab2_contents = extract_data(liglab2_decompressed, reg)\n",
    "liglab2_contents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "now that we have extracted the data and converted it into string and numerical values, we can create a pandas dataframe to be able to do our analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_raw_data = pd.DataFrame(liglab2_contents)\n",
    "liglab2_raw_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before continuing, we must verify if any invalid entries have been extracted from the data file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_raw_data[liglab2_raw_data.isnull().any(axis=1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we notive that there are multiple invalid values, where the `time` variable is empty. Since this value is important for the analysis, we cannot use this data entry. We can therefore extract the erroneous lines as they will have little impact on the overall analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_data = liglab2_raw_data.dropna().copy()\n",
    "liglab2_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make things easier, we will create a new category where we will convert the epoch timestamp into a datetime variable for easier understanding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_data['date'] = pd.to_datetime(liglab2_data['timestamp'], unit='s')\n",
    "liglab2_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The time specified in the log corresponds to a round trip between client and server. This time therefore corresponds to the delay between the first bit leaving the client and the receipt of the last bit back to client. However, this also includes a processing delay, where the server receives the integrety of the packet, analyses it, then responds.\n",
    "\n",
    "The expression of a round trip is : `packet delivery time = 2 * transmission time + processing delay`. However, since we do now know the processing delay, for the sake of this analysis we will simply associate `packet delivery time = 2 * tramsission time`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_data['time'] = liglab2_data['time'] / 2\n",
    "liglab2_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First view of the data\n",
    "\n",
    "Now that we have the data properly fitted out, we can begin by visualising the data, in particular the evolution of the transmission time throughout the log. Firstly however, we set the extracted date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_date_sorted = liglab2_data.set_index('date').sort_index()\n",
    "liglab2_date_sorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now visualise the data, of transmission time as a function of the date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_date_sorted['time'].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ca can see that there is a lot of unequal values, so we'll take a closer look at the last 200 entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_date_sorted['time'][-200:].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see relitably stable values, with intermittant spikes, some reaching very high. In the first plot, we can see a very high spike, so we can look closer between `15:00` and `15:02`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_date_sorted.between_time('15:00', '15:02')['time'].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, we can see a relitable level tendancy with multiple variations at the extremities, and one very large spike in the centre."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis of size and communication time\n",
    "\n",
    "The previous analysis indicated that there is a significant irregularity of transmition time in the log. Now we will examine the time but this time as a function of the packet size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_size_sorted = liglab2_data.set_index('size').sort_index()\n",
    "liglab2_size_sorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can visualise the new data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_size_sorted['time'].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can immediatly see that just short of *1500*, there is a significant increase in transmission time. We will look closer at this interval, between *1450* and *1500*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_size_sorted[(liglab2_size_sorted.index>=1450) & (liglab2_size_sorted.index<=1500)]['time'].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can clearly see that the increase happens at around *1481*. We can then split the dataset into 2 classes to differenciate the differet mean data values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_data_class_a = liglab2_size_sorted[(liglab2_size_sorted.index<1481)]\n",
    "liglab2_data_class_a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_data_class_b = liglab2_size_sorted[(liglab2_size_sorted.index>=1481)]\n",
    "liglab2_data_class_b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Estimate network time function\n",
    "\n",
    "In networking, the following function can specify transmission duration: `T(X) = L + S/C`\n",
    "\n",
    "| Variable name  | Description                                         |\n",
    "|----------------|-----------------------------------------------------|\n",
    "| `T`            | Transmission time, in **seconds**                   |\n",
    "| `S`            | Size of packet transmitted, in **bytes**            |\n",
    "| `L`            | Network latency, in **seconds**                     |\n",
    "| `C`            | Network capacity, in **bytes per second**           |\n",
    "\n",
    "We can vulgaraly associate the duration function to a linear function corresponding to `Y(X) = ax + b`. Using linear regression, we can estimate both the coeficient `a` and the constant `b`.\n",
    "\n",
    "Firstly, we convert and reshape the data corresponding to `X` in the formula, in this case the index corresponding to the packet `size`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_class_a_X = liglab2_data_class_a.index.values.reshape(-1, 1)\n",
    "liglab2_class_a_X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Secondly, we convert and reshape the `time` to correspond to the `Y` value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_class_a_Y = liglab2_data_class_a['time'].values.reshape(-1, 1)\n",
    "liglab2_class_a_Y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now prepare the regressor and fit it with the training data, before predicting the `Y` values corresponding to the linear function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lin_reg = LinearRegression()\n",
    "lin_reg.fit(liglab2_class_a_X, liglab2_class_a_Y)\n",
    "liglab2_class_a_Y_pred = lin_reg.predict(liglab2_class_a_X)\n",
    "liglab2_class_a_Y_pred"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We not create a plot, superposing the `Y prediction` values ontop of the real `Y` values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(liglab2_class_a_X, liglab2_class_a_Y)\n",
    "plt.plot(liglab2_class_a_X, liglab2_class_a_Y_pred, color='red')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the predicted values in red corresponding to the output of the function `Y(X) = ax + b`. Now however, we want to determine both `a` and `b`.\n",
    "\n",
    "If we associate this function with the function corresponding to network transmission time `T(S) = L + S/C`, we can extrapolate the `b = L` and `a = 1 / C`.\n",
    "\n",
    "Firstly, we will determine the value of `C`, the network capacity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_class_a_C = 1 / lin_reg.coef_[0][0]\n",
    "liglab2_class_a_C"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the transmission time is specified in *miliseconds* in the logs, the value of `C` here corresponds to `bytes / milisecond`. We will convert this into `bytes / second`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_class_a_C_s = liglab2_class_a_C * 1000\n",
    "liglab2_class_a_C_s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can convert it into a more human readable version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"C = \" + str(liglab2_class_a_C / 1024 / 1024 ) + \" MB/s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will determine the value of `L`, the network latency."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_class_a_L = lin_reg.intercept_[0]\n",
    "liglab2_class_a_L"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "like before, the latency here is specified in *miliseconds*, however the value of `L` is specified in `seconds`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "liglab2_class_a_L_s = liglab2_class_a_L / 1000\n",
    "liglab2_class_a_L_s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can therefore determine that for class **a**, the values for `C` and `L` are as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"C = \" + str(liglab2_class_a_C_s) + \" B/s\")\n",
    "print(\"L = \" + str(liglab2_class_a_L_s) + \" s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A more human friendly representation is the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"C = \" + str(liglab2_class_a_C_s / 1024 / 1024 ) + \" MB/s\")\n",
    "print(\"L = \" + str(liglab2_class_a_L) + \" ms\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}