# Subject4: Latency and capacity estimation for a network connection from asymmetric measurements

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import isoweek
from os import path
import urllib.request as request
import gzip as gz
import shutil
import re
from collections import OrderedDict
import numpy as np
from sklearn.linear_model import LinearRegression

## Data

In [None]:
def download_file(url, filename) :
 if not path.exists(filename) :
 print("No local data copy available for " + filename + ", creating new version")
 request.urlretrieve(url, filename)
 else :
 print("Using local version of " + filename)

First data to examine, ping traces from on-campus connection.
We first check if a local copy of the data exists. If no copy is available, a new archive is downloaded.

In [None]:
liglab2_url = "http://mescal.imag.fr/membres/arnaud.legrand/teaching/2014/RICM4_EP_ping/liglab2.log.gz"
liglab2_file = "liglab2.log.gz"

download_file(liglab2_url, liglab2_file)

Second data to examine, ping traces from stackoverflow. As before, we download the file if a local copy doesn't already exist.

In [None]:
stackoverflow_url = "http://mescal.imag.fr/membres/arnaud.legrand/teaching/2014/RICM4_EP_ping/stackoverflow.log.gz"
stackoverflow_file = "stackoverflow.log.gz"

download_file(stackoverflow_url, stackoverflow_file)

The files are log files with the following syntaxe:

_[1421761682.052172] 665 bytes from lig-publig.imag.fr (129.88.11.7): icmp_seq=1 ttl=60 time=22.5 ms_

which can be described as the following

_[`timestamp`] `size` bytes from `url` (`ip`): icmp_seq=`icmp_seq` ttl=`ttl` time=`time` ms_

| Variable name | Description |
|----------------|---------------------------------------------------------------------------|
| `timestamp` | Epoch time in seconds as from the 1st January 1970 |
| `size` | Size of packet transmitted in bytes |
| `url` | DNS resolution of server with which packets are being exchanged |
| `ip` | IPv4 addresse of server with which packets are being exchanged |
| `icmp_seq` | The sequence number of the ICMP packet (**not used in this analysis**) |
| `ttl` | The time-to-live of the ICMP packet (**not used in this analysis**) |
| `time` | The round trip duration with the server in miliseconds |

Both `icmp_seq` and `ttl` are not used in this analysis, but they will be extracted from the files to retain file integrety.

## 1st analysis: Liglab2

### Prepare the data

We will start our analysis with the first datafile, from Liglab2. Because many of the operations will be the same for the other analysis, we will create functions to accomplish the tasks.

The first operation is to decompress te archive and load the file into an array in memory. We will print the contents to check the operation.

In [None]:
def decompress_archive (archive):
 with gz.open(archive, 'rt') as f_in:
 content = f_in.read().split("\n")
 return content

liglab2_decompressed = decompress_archive(liglab2_file)
liglab2_decompressed

Next, to analyse the data we must parse its textual form and extract all the variables. To do so we use multiple regex expressions to recover the different variables. If no variable is found, then it is set to `None`. Once again we print the contents to check if all went correctly.

In [None]:
reg = [('timestamp',r'\[([0-9]+\.[0-9]+)\]', float),
 ('size', r'([1-9][0-9]*) bytes', int),
 ('url', r'from ([-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&\/=]*)?)', str),
 ('ip', r'\(\b((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]))\b\)', str),
 ('icmp_seq', r'icmp_seq=([1-9][0-9]*)', int),
 ('ttl', r'ttl=([0-9]|[1-9][0-9]*)', int),
 ('time', r'time=([0-9]|[1-9][0-9]*(\.[0-9]+)?) ms', float)]

def extract_data (content, reg):
 list = []
 for line in content:
 values = OrderedDict()
 if line:
 for (name, regex, func) in reg:
 obj = re.findall(regex, line)
 val = None
 if len(obj) :
 val = obj[0]
 if isinstance(val, tuple) and len(val) :
 val = val[0]
 val = func(val)
 values[name] = val
 list.append(values)
 return list

liglab2_contents = extract_data(liglab2_decompressed, reg)
liglab2_contents

now that we have extracted the data and converted it into string and numerical values, we can create a pandas dataframe to be able to do our analysis.

In [None]:
liglab2_raw_data = pd.DataFrame(liglab2_contents)
liglab2_raw_data

Before continuing, we must verify if any invalid entries have been extracted from the data file.

In [None]:
liglab2_raw_data[liglab2_raw_data.isnull().any(axis=1)]

Here we notive that there are multiple invalid values, where the `time` variable is empty. Since this value is important for the analysis, we cannot use this data entry. We can therefore extract the erroneous lines as they will have little impact on the overall analysis.

In [None]:
liglab2_data = liglab2_raw_data.dropna().copy()
liglab2_data

To make things easier, we will create a new category where we will convert the epoch timestamp into a datetime variable for easier understanding

In [None]:
liglab2_data['date'] = pd.to_datetime(liglab2_data['timestamp'], unit='s')
liglab2_data

The time specified in the log corresponds to a round trip between client and server. This time therefore corresponds to the delay between the first bit leaving the client and the receipt of the last bit back to client. However, this also includes a processing delay, where the server receives the integrety of the packet, analyses it, then responds.

The expression of a round trip is : `packet delivery time = 2 * transmission time + processing delay`. However, since we do now know the processing delay, for the sake of this analysis we will simply associate `packet delivery time = 2 * tramsission time`

In [None]:
liglab2_data['time'] = liglab2_data['time'] / 2
liglab2_data

### First view of the data

Now that we have the data properly fitted out, we can begin by visualising the data, in particular the evolution of the transmission time throughout the log. Firstly however, we set the extracted date.

In [None]:
liglab2_date_sorted = liglab2_data.set_index('date').sort_index()
liglab2_date_sorted

We can now visualise the data, of transmission time as a function of the date.

In [None]:
liglab2_date_sorted['time'].plot()

Ca can see that there is a lot of unequal values, so we'll take a closer look at the last 200 entries.

In [None]:
liglab2_date_sorted['time'][-200:].plot()

We can see relitably stable values, with intermittant spikes, some reaching very high. In the first plot, we can see a very high spike, so we can look closer between `15:00` and `15:02`.

In [None]:
liglab2_date_sorted.between_time('15:00', '15:02')['time'].plot()

Once again, we can see a relitable level tendancy with multiple variations at the extremities, and one very large spike in the centre.

### Analysis of size and communication time

The previous analysis indicated that there is a significant irregularity of transmition time in the log. Now we will examine the time but this time as a function of the packet size.

In [None]:
liglab2_size_sorted = liglab2_data.set_index('size').sort_index()
liglab2_size_sorted

Now we can visualise the new data.

In [None]:
liglab2_size_sorted['time'].plot()

We can immediatly see that just short of *1500*, there is a significant increase in transmission time. We will look closer at this interval, between *1450* and *1500*

In [None]:
liglab2_size_sorted[(liglab2_size_sorted.index>=1450) & (liglab2_size_sorted.index<=1500)]['time'].plot()

We can clearly see that the increase happens at around *1481*. We can then split the dataset into 2 classes to differenciate the differet mean data values.

In [None]:
liglab2_data_class_a = liglab2_size_sorted[(liglab2_size_sorted.index<1481)]
liglab2_data_class_a

In [None]:
liglab2_data_class_b = liglab2_size_sorted[(liglab2_size_sorted.index>=1481)]
liglab2_data_class_b

### Estimate network time function

In networking, the following function can specify transmission duration: `T(X) = L + S/C`

| Variable name | Description |
|----------------|-----------------------------------------------------|
| `T` | Transmission time, in **seconds** |
| `S` | Size of packet transmitted, in **bytes** |
| `L` | Network latency, in **seconds** |
| `C` | Network capacity, in **bytes per second** |

We can vulgaraly associate the duration function to a linear function corresponding to `Y(X) = ax + b`. Using linear regression, we can estimate both the coeficient `a` and the constant `b`.

Firstly, we convert and reshape the data corresponding to `X` in the formula, in this case the index corresponding to the packet `size`.

In [None]:
liglab2_class_a_X = liglab2_data_class_a.index.values.reshape(-1, 1)
liglab2_class_a_X

Secondly, we convert and reshape the `time` to correspond to the `Y` value.

In [None]:
liglab2_class_a_Y = liglab2_data_class_a['time'].values.reshape(-1, 1)
liglab2_class_a_Y

We now prepare the regressor and fit it with the training data, before predicting the `Y` values corresponding to the linear function.

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(liglab2_class_a_X, liglab2_class_a_Y)
liglab2_class_a_Y_pred = lin_reg.predict(liglab2_class_a_X)
liglab2_class_a_Y_pred

We not create a plot, superposing the `Y prediction` values ontop of the real `Y` values.

In [None]:
plt.scatter(liglab2_class_a_X, liglab2_class_a_Y)
plt.plot(liglab2_class_a_X, liglab2_class_a_Y_pred, color='red')
plt.show()

We can see the predicted values in red corresponding to the output of the function `Y(X) = ax + b`. Now however, we want to determine both `a` and `b`.

If we associate this function with the function corresponding to network transmission time `T(S) = L + S/C`, we can extrapolate the `b = L` and `a = 1 / C`.

Firstly, we will determine the value of `C`, the network capacity.

In [None]:
liglab2_class_a_C = 1 / lin_reg.coef_[0][0]
liglab2_class_a_C

Since the transmission time is specified in *miliseconds* in the logs, the value of `C` here corresponds to `bytes / milisecond`. We will convert this into `bytes / second`.

In [None]:
liglab2_class_a_C_s = liglab2_class_a_C * 1000
liglab2_class_a_C_s

We can convert it into a more human readable version

In [None]:
print("C = " + str(liglab2_class_a_C / 1024 / 1024 ) + " MB/s")

Now we will determine the value of `L`, the network latency.

In [None]:
liglab2_class_a_L = lin_reg.intercept_[0]
liglab2_class_a_L

like before, the latency here is specified in *miliseconds*, however the value of `L` is specified in `seconds`.


In [None]:
liglab2_class_a_L_s = liglab2_class_a_L / 1000
liglab2_class_a_L_s

We can therefore determine that for class **a**, the values for `C` and `L` are as follows:

In [None]:
print("C = " + str(liglab2_class_a_C_s) + " B/s")
print("L = " + str(liglab2_class_a_L_s) + " s")

A more human friendly representation is the following:

In [None]:
print("C = " + str(liglab2_class_a_C_s / 1024 / 1024 ) + " MB/s")
print("L = " + str(liglab2_class_a_L) + " ms")