diff --git a/module3/exo3/exercice_en.Rmd b/module3/exo3/exercice_en.Rmd index 13b258ddd0da29bc3bf08c64b6a1db742f6d5409..f5d323727c279f916e11af8e7d10d7ee32ae2fd2 100644 --- a/module3/exo3/exercice_en.Rmd +++ b/module3/exo3/exercice_en.Rmd @@ -1,33 +1,521 @@ --- -title: "Your title" -author: "Your name" -date: "Today's date" -output: html_document +title: "Notebook for the MOOC Reproducible Research, Module 3, Peer-reviewed Execirce, Subject 4" +subtitle: "Latency and capacity estimation for a network connection from asymmetric measurements" +author: "Killian Martin" +date: "5 mai 2020" +output: + + pdf_document: default + html_document: + df_print: paged + html_notebook: default +## note to self: complete the liglab2 estimation, finish the stackoverflow analysis, and try the quantile regression --- +# Preamble +Following [Subject 4 in the MOOC page](https://www.fun-mooc.fr/courses/course-v1:inria+41016+self-paced/courseware/5b932aa591d245d48d8943385cb3120a/57c96f2c7f7b42018eaac3e6b34546f4/), I attempt to estimate the performance of a network connection based on data regarding the time $T$ (in seconds) required to send a message as a function of its size $S$ (in bytes). More specifically, I will fit a simple linear model for the relation between $T$ and $S$ is $$T(S) = L + \frac{S}{C}$$ where: +- $L$ is the connection's *latency*, in seconds +- $C$ is the connection's *capacity*, in bytes/seconds -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) +The task given is to estimate $L$ and $C$ based on data acquired on $T$ as a function of $S$, using the command `ping`. +Each data point is originally a single row of the dataset (see below in the Data Input section for examples). +The data for this exercise comes from [A. Legrand website](http://polaris.imag.fr/arnaud.legrand/teaching/2014/RICM4_EP.php). The exact download links are in the Data Input section below, within the first code block of the Data sources sub-section. + +# Setting up the environment +## Packages +I first load all the packages I will use throughout the document: +```{r, message = F} +library(tidyverse) +library(magrittr) +library(broom) +``` +The tidyverse is used for data manipulation with `dplyr` and `purrr` and plotting with `ggplot2`. `broom`, `ggridges` and `magrittr` are mostly for single-use conveninence, introducing, respectively, concise model overview, ridge plots in `ggplot2`, and the `$%$` operator (which lets me use column names directly in functions, instead of using the `$` notation). + +## Working directory +No code necessary as the document by default looks into its own directory for files (this is also something I had not realised before, so I'm writing it down). + +## Details of the compiling environment +This notebook was computed in the following environment: +```{r} +sessionInfo() +``` + +# Data input +## Datasets used in the task +Two different datasets are provided for the task, `liglab2` and `stackoverflow`. Each dataset contains the data outlined in the preamble. `liglab2` contains data for a short-range on-campus connection, while `stackoverflow` contains data for a long-range Web-based connection. +Both datasets are hosted online at the URL stored below. + +NOTE: to get the script to work, I removed the .gz extensions that are present in the main MOOC page. Otherwise I get an error message when trying to read the URL. This does not matter if files are already present locally though. +```{r, echo=FALSE} + +liglab2_url <- "http://mescal.imag.fr/membres/arnaud.legrand/teaching/2014/RICM4_EP_ping/liglab2.log" +stackoverflow_url <- "http://mescal.imag.fr/membres/arnaud.legrand/teaching/2014/RICM4_EP_ping/stackoverflow.log" +``` + +## Loading the data +### Import function +For ease of use, the function `download` below handles everything. +The user merely needs to provide the URL and the file name to use to save the data to a file using the `url` and `filename` arguments. +The remaining arguments are optional and are for convenience: `path` lets the user specify a directory path to use instead of the default path, `overwrite` allows overwriting the data even if a file already exists at the saving location (set to TRUE to overwrite, keep FALSE to avoid overwriting), and `encoding` is used to specify the encoding (given as a character string, e.g. "UTF-8" for UTF-8 encoding). +The function reads the data at the provided URL using the base R `read.delim` function, then writes it to the desired location with `write.table`. + +```{r} +download <- function(url, filename, path = NULL, overwrite = F, ...){ + if (!is.null(path)){ + full_name = file.path(path, filename) + } else { + full_name = filename + } + + if (!file.exists(full_name) || overwrite){ + tab <- read.delim(file = url, header = FALSE, ...) + write.table(x = tab, file = full_name) + } else { + print(sprintf("File %s already found at provided location and overwrite = FALSE.", full_name)) + print("If you wish to overwrite the data anyway, set overwrite to TRUE") + } +} +``` + +### Downloading the data from the hosting webpages +I can then apply `download` to both of the URLs: +```{r} +# set overwrite to T if updating the logs is desired +download(url = liglab2_url, filename = "liglab2.log", overwrite = F) +download(url = stackoverflow_url, filename = "stackoverflow.log", overwrite = F) +``` + +### Importing the data +I finally import the datasets from the local files (whether or not I re-downloaded above), using the `read.table` function. To keep character strings as character strings, I specify the `stringsAsFactors` argument to `FALSE`. +```{r} +liglab2_original <- read.table(file = "liglab2.log", stringsAsFactors = FALSE) +stackoverflow_original <- read.table(file = "stackoverflow.log", stringsAsFactors = FALSE) + +head(liglab2_original) +head(stackoverflow_original) +``` + +### Formatting the datasets +We can then parse the data to extract the information of interest, that is +- the moment of measurement `M`, between brackets at the beginning of each row, in seconds since January 1st, 1970; +- the size in bytes `S` of the message sent ("XXX bytes"); +- the time to send the message `T` (time=XXX ms") at the end. + +```{r} +reformat <- function(tab){ + transmute(tab, + M = str_extract(string = V1, pattern = "(?<=\\[).+(?=\\])"), + # here the pattern is the middle part not between of interest, while the other parts frame it between brackets, but do not include the brackes in the result + S = ifelse(str_detect(string = V1, pattern = "bytes"), + str_extract(string = V1, pattern = "[0-9]+ bytes"), + NA) %>% + str_remove(pattern = " bytes"), # keep only the number + T = ifelse(str_detect(string = V1, pattern = "time="), + str_extract(string = V1, pattern = "time=[:alnum:]+"), + NA) %>% + str_remove(pattern = "time=")) %>% # keep only the time + type.convert() # convert from character strings to numeric doubles + # coulv have used type_convert for more general purposes but in this case it removes numbers after the decimal mark +} +``` + +```{r} +liglab2_original <- reformat(liglab2_original) +stackoverflow_original <- reformat(stackoverflow_original) + +head(liglab2_original) +head(stackoverflow_original) +print(length(unique(liglab2_original$M)) == nrow(liglab2_original) ) # make sure no unique value is lost +``` +Do note that the display used rounds the number to their closest integer value, but the true values are maintained: +```{r} +liglab2_original$M[1] - round(liglab2_original$M[1]) +``` + +Now that we have the data we are interested in (and in the correct formats, thanks to our function), we are ready to proceed. +From now on,I separate the analysis into first the `liglab2` dataset, and then the `stackoverflow` dataset. + + +## Analysis on the `liglab2` dataset +### Missing data +First I check the proportion of NAs in the liglab2_formatted dataset: +```{r} +NAs_liglab2 <- liglab2_original %>% + filter(is.na(S) | is.na(T)) + +print(sprintf("Total number of rows with NA: %i", nrow(NAs_liglab2))) +print(sprintf("Proportion of rows with NA: %f", nrow(NAs_liglab2)/nrow(liglab2_original) * 100)) ``` +Given the proportion of less than 0.1% of rows containing NAs, I would feel quite comfortable simply removing the rows with NAs, however it could introduce bias if the missing values are not Missing At Random, which I can check graphically: -## Some explanations +```{r} +liglab2_original %>% + ggplot(aes(x = S)) + + geom_histogram() + + facet_wrap(~is.na(T), scales = "free_x") +``` +We can see that indeed, $T$ seems to be missing only for very low values of $S$. +We will therefore have to be a bit careful interpreting our results, but there is not much else we can actually do. +However, given we can probably expect messages sent in a real use case to exceed 30 bytes most of the time, it should not be too much of a problem. + +```{r} +liglab2 <- liglab2_original %>% + filter(!is.na(T)) # remove all missing values +``` + +### Graphical exploration + +To get a better idea of our data, let's draw a scatterplot of the relation $T = f(S)$, using `ggplot2`. +To get an idea of possible effects of time, I add $M$ as the colour parameter for the points. +```{r} +ggplot(liglab2, aes(x = S, y = T, colour = M)) + + geom_point() +``` +The relation looks highly non-linear, with a threshold of S (hereafter $S_{threshold}$) a bit under 1500 bytes. This threshold separates two regimes of $T$: one with low variance, where $S < S_{threshold}$, and one with high variance, where $S > S_{threshold}$. +Moreover, adding the time of measurement $M$ seems to have no effect on $T$. For a more direct visualisation of this, let's plot $T$ directly as a function of $M$: +```{r} +ggplot(liglab2, aes(x = M, y = T)) + + geom_point() +``` +Indeed, there is no evident relation between $T$ and $M$ +Therefore, we can focus on estimating the value of $S_{threshold}$ + +### Graphical estimation of $S_{threshold}$ +First I take a zoomed-in version of the graph above, to see if we can separate the two regimes by eye only: +```{r} +liglab2 %>% + filter(between(S, 1450, 1500)) %>% # take only values of S between 1450 and 1500 + ggplot(aes(x = S, y = T)) + + geom_point() +``` + +We can see here that the separation between regimes of $S$ is between values of 1480 and 1481 bytes. +We then set $S_{threshold} = 1480.5$ for simplicity (given only integers are possible values of S, as a half-byte object sounds difficult to encode, by definition), and store it in a variable for later: +```{r} +S_threshold <- 1480.5 +``` + + +### Estimation of $L$ and $C$ by regime of $S$ via linear regression (`lm`) +As a reminder, we have the following linear relationship between $T$ and $S$: $$T(S) = L + \frac{S}{C}$$ +We can separate the dataset by separating the values of $S$ below and above 1480.5, which can do with `dplyr`: + +```{r} +liglab2 <- liglab2 %>% + mutate(`S>S_threshold` = S > 1480.5) %>% # check if a given row has S above or below S_threshold + group_by(`S>S_threshold`) %>% # make two groups, corresponding to S being below or above S_threshold + nest() +head(liglab2) +``` + +Here the FALSE row corresponds to the sub-dataset including values of $S$ below $S_{threshold}$, and the TRUE row to the one with values of $S$ above $S_{threshold}$. +Then we can fit two linear models (however, defined by the same equation) on those sub-datasets, again using `dplyr`: + +```{r} +liglab2 <- liglab2 %>% + mutate(models = map(.x = data, .f = ~ lm(data = .x, T~S))) +liglab2 +``` + +Finally, we can extract the model coefficients as well as some statistics on them using the `broom` package: + +```{r} +liglab2 %<>% + mutate(model_stats = map(.x = models, .f = glance), # glance: gives a few model-wide statistics like R-squared, AIC, BIC... + model_coefs = map(.x = models, .f = tidy)) # tidy: gives the model coefficients, their standard error and the results of a t-test of their significance +``` -This is an R Markdown document that you can easily export to HTML, PDF, and MS Word formats. For more information on R Markdown, see . +We can check these statistics: +```{r} +liglab2 %>% + select(c("S>S_threshold", "model_stats")) %>% + unnest(c(model_stats)) # extract the model_stats column for inspection +``` + +We can see that the $R^2$ values of the two models are extremely low in both cases, which suggests that our linear model is a poor fit to the data, and in fact linear modelling at all is probably not the best method to apply to our question. -When you click on the button **Knit**, the document will be compiled in order to re-execute the R code and to include the results into the final document. As we have shown in the video, R code is inserted as follows: +Indeed, as we saw on the graph above, the variability of $T$ is quite high for a given $S$, which is supported by the coefficients of our models: -```{r cars} -summary(cars) +```{r} +liglab2_coefs <- liglab2 %>% + select(c(1, 5)) %>% + unnest(c(model_coefs)) +liglab2_coefs ``` -It is also straightforward to include figures. For example: +Here, we don't quite have our coefficients here, as while the $(Intercept)$ term is exactly $L$, the $S$ term is actually $\frac{1}{C}$. However, we can already see that while $L$ is significantly different from 0 in both regimes of $S$, $C$ is not, especially in the regime below $S_{threshold}$, whereas in the regime above $S_{threshold}$ it comes closer to a trend. -```{r pressure, echo=FALSE} -plot(pressure) +We could interpret this as the fact that there is always a flat duration necessary for sending a message at all, but *on average*, this duration hardly increases as the message becomes larger. However, given the $R^2$, this interpretation should obviously be taken with a pinch of salt. + +Nevertheless, in both cases the value of $\frac{1}{C}$ is extremely low, which makes an estimation $C$ difficult at best (as small variations of the latter lead to high variations of the latter by taking the inverse). +However, we can simply apply a transformation on these rows (and also relabel the terms so we have the correct notation): +```{r} +liglab2_coefs %>% + mutate(estimate = ifelse(term == "S", 1/estimate, estimate), + term = ifelse(term == "S", "C", "L")) +``` +Indeed, the values of $C$ are extremely high, but they are also probably unreliable for the reasons cited above. +However, this illustrate the fact that the relation between $T$ and $S$ is *not* as simple as we assumed above, and in particular that both $L$ and $C$ are dependent on $S$, such that a better relation would be: +$$T = L(S) + \frac{S}{C(S)}$$ +Where the estimation of $L$ and $C$ would be more complex than what we have done. +Finally, note that the std.error column is still on the model scale, largely because given how poor a fit the model is, I would not pursue this analysis further in a real case. +One could however try to use to obtain confidence intervals. + +### Plotting the relationships +Despite our models not being very interesting to explain the relation between $T$ and $S$, we can still try to plot the relationships between the two, separated by regime of $S$ by taking the original plot above, and adding two `geom_smooth` commands, with the two datasets defined above and below $S_{threshold}$, using `method = "lm"` to specify the plotting method and get the 95% confidence intervals: +```{r} +# unnest to be able to access the values of S and T, as otherwise they are stored in a list and not in a data frame + +unnest(liglab2, cols = c(data)) %>% + ggplot(aes(x = S, y = T)) + + geom_point() + + geom_smooth(data = liglab2$data[[1]], method = "lm", size = .1) + + geom_smooth(data = liglab2$data[[2]], method = "lm", size = .1) +``` +The confidence intervals are there (which can be seen by playing with the `size` argument in the `geom_smooth` calls), but they are tiny. This could be because we have many points in our data. + + +## Same analysis with the `stackoverflow` dataset +Since I will follow largly the same beginning workflow as above, I will merely repeat the titles and code blocks, and not the reasoning for them. +### Missing data +```{r} +NAs_so <- stackoverflow_original %>% + filter(is.na(T)) + +print(sprintf("Total number of rows with NA: %i", nrow(NAs_so))) +print(sprintf("Proportion of rows with NA: %f", nrow(NAs_so)/nrow(stackoverflow_original) * 100)) +``` + +```{r} +stackoverflow_original %>% + ggplot(aes(x = S)) + + geom_histogram() + + facet_wrap(~is.na(T), scales = "free_x") +``` +Like with the `liglab2` dataset, the NAs are not Missing at Random, but given with have roughly 1% of them in all the data, I drop them for the analysis, for the same reasons as above (namely, that messages are more likely to be larger than the sizes for who $T$ is missing in the dataset). +```{r} +stackoverflow <- stackoverflow_original %>% + filter(!(is.na(S) | is.na(T))) +``` + +### Graphical exploration +### Exploration of the data via plotting +```{r} +ggplot(stackoverflow, aes(x = S, y = T, colour = M)) + + geom_point() +``` +Again, $M$ seems to have little effect on $T$ (i.e. we have temporal stability of $T$): +```{r} +ggplot(stackoverflow, aes(x = M, y = T)) + + geom_point() +``` + +Moreove, it looks like we have the same threshold of $S$ as for the `liglab2` dataset. +Let's check it directly with the same plot as above, around S = 1480: +```{r} +stackoverflow %>% + filter(between(S, 1475, 1485)) %>% + ggplot(aes(x = S, y = log(T))) + + geom_point() +``` +Here the separation is a little less clear than for the previous dataset, but still appears likely to have the same threshold value. +Therefore, we'll keep the same value of $S_{threshold}$. +For comparison, we can try repeating the analysis with $S_{threshold} = 1481.5$, but it's unlikely to change much. + +### Estimating $L$ and $C$ +The pipeline is a bit more involved here, but only because I'm condensing all the steps I did for `liglab2` in one code block. +```{r} +stackoverflow <- stackoverflow %>% + mutate(`S > S_threshold` = S > S_threshold) %>% + group_by(`S > S_threshold`) %>% + nest() %>% + mutate(models = map(.x = data, .f = ~ lm(data = .x, T~S)), + coefs = map(.x = models, .f = broom::tidy), + stats = map(.x = models, .f = broom::glance)) +``` + +Let's check the model statistics: + +```{r} +stackoverflow %>% + select(c("S > S_threshold", "stats")) %>% + unnest(c(stats)) +``` + +Largely the same observation here, that the $R^2$ values are extremely low, so the models are poor fits to the data. +Then for the coefficients: + +```{r} +stackoverflow %>% + select(c("S > S_threshold", "coefs")) %>% + unnest(c(coefs)) %>% + mutate(estimate = ifelse(term == "S", 1/estimate, estimate), + term = ifelse(term == "S", "C", "L")) +``` +The situation is the same, but even stronger than for `liglab2`: there is no effect whatsoever of $C$ on the value of $T$, as only $L$ differs significantly from the default. + +Interestingly, we find a *negative* value of $C$ for large messages, implying (statistical interpretation notwithstanding) that large messages are sent faster the larger they are, which would be quite a good thing if it was real... Unfortunately it is most probably simply an artefact due to the instability of the estimatation of $C$. + +### Plotting the relationships +```{r} +ggplot(unnest(stackoverflow, cols = data), aes(x = S, y = T)) + + geom_point() + + geom_smooth(data = stackoverflow$data[[1]], method = "lm", size = .1) + + geom_smooth(data = stackoverflow$data[[2]], method = "lm", size = .1) +``` +Once again, the confidence intervals are shown here but are very narrow. Although they also are likely to make little sense, given the previous observations on the models... + + +## Going further: Quantile regression +### Estimating the minimal values of $T$ as a function of $S$ +As we saw, estimating the average of $T$ as a function of $L$ made little statistical sense, in large part due to the variability of $T$. However, the lower quantiles of $T$ are a little less variable (such as can be seen in the graph just above). We might then wonder if, instead of the *average* value of $T$ like in `lm` we might not have better luck trying to predict the *minimum* value of $T$. Moreover, the lower value of $T$ does not seem to follow different regimes of $S$. + +One way to accomplish this, suggested on the MOOC page, is to filter the smallest values of $T$ for each message size $S$. + +First, I want to get access to the original data. We can go back to the original formatted data and remove the missing values: +```{r} +liglab2 <- liglab2_original %>% + filter(!is.na(T)) +stackoverflow <- stackoverflow_original %>% + filter(!is.na(T)) +``` + +Given that we're going to do the exact same things on both datasets, I then combine them into a single data frame, adding a column to record the origin of each data point: +```{r} +dataset_full <- bind_rows(liglab2 = liglab2, # passing a named list to bind_rows allows using the names of the list in the .id column + stackoverflow = stackoverflow, + .id = "dataset") # this only gives a name to the .id column +head(dataset_full) +``` +We can go ahead and add the `S > S_threshold` column as well, so we can separate the two regimes of $S$ later: + +```{r} +dataset_full <- dataset_full %>% + mutate(`S > S_threshold` = S > S_threshold) +``` + + +Next, we'll use `group_by` and `summarise` to take the minimum values of $T$ for each value of $S$: +```{r} +min_T_per_S <- function(df){ + df %>% + group_by(dataset, S) %>% + filter(T == min(T)) +} +dataset_min <- min_T_per_S(dataset_full) +``` +Interestingly, we can see that this actually keeps a large part of the original data anyway: + +```{r} +nrow(dataset_min[dataset_min$dataset == "liglab2", ])/nrow(dataset_full[dataset_full$dataset == "liglab2", ]) +nrow(dataset_min[dataset_min$dataset == "stackoverflow", ])/nrow(dataset_full[dataset_full$dataset == "stackoverflow", ]) +``` + +### Working only with the minima of $T$ over $S$ + +We can plot again the data here to get an idea of what we have to work with: +```{r} +ggplot(dataset_min, aes(x = S, y = T)) + + geom_point() + + facet_wrap(~dataset, scales = "free_y") +``` +Interestingly, while this suppressed the variability of $T$ completely for `liglab2` (aside from the regimes of $S$, which are even more blatant here), some variability remains for the `stackoverflow` dataset. + +However, in this case, the `liglab2` dataset, the model could be summed up as a simple enough equation: +$$T = \begin{cases} 1\quad if\; S < S_{threshold} \\\\ 2\quad if\; S > S_{threshold} \end{cases}$$ +Which makes it of little interest to actually make a model of the relation, meaning we lost quite a bit of information by restricting the data to the minimum values of $T$ for each value of $S$. We could of course do it, but it would make little theoretical sense. Therefore, let's move on. + +However, for `stackoverflow`, there is still some variability left, some we could try making a model to estimate $L$ and $C$ like we did before: +```{r} +so_min <- dataset_min %>% + filter(dataset == "stackoverflow") %>% # remove the liglab2 data + group_by(`S > S_threshold`) %>% + nest() %>% + mutate(models = map(.x = data, .f = ~ lm(T ~ S, data = .x)), + coefs = map(.x = models, .f = tidy), + stats = map(.x = models, .f = glance)) +``` +Let's inspect the stats: +```{r} +so_min %>% + select(c("S > S_threshold", "stats")) %>% + unnest(c(stats)) +``` +Looks like we're having the same issue as before with the $R^2$, so taking the minimal values of $T$ probably didn't help too much to get a good model. +Let's see the coefficients of the models: + +```{r} +so_min %>% + select(c("S > S_threshold", "coefs")) %>% + unnest(c(coefs)) %>% + mutate(estimate = ifelse(term == "S", 1/estimate, estimate), + term = ifelse(term == "S", "C", "L")) +``` +Ah, unlike earlier, we do have a significant effect of $C$ for $S$ under $S_threshold$, but not over it. +Let's see what this looks like in a plot: +```{r} +ggplot(unnest(so_min, cols = data), aes(x = S, y = T)) + + geom_point() + + geom_smooth(data = so_min$data[[1]], method = "lm", size = 1) + + geom_smooth(data = so_min$data[[2]], method = "lm", size = 1) +``` +Not really the most explicit of graphs either... The regressions are barely more than lines, like above. +Looks like just taking the minimal values of $T$ doesn't solve our problem, so let's try something else, a little more general: quantile regression. + + +### Quantile regression with `quantreg` +Instead of estimating the average of $T$ as a function of $S$, or artificially filtering the values of $T$ over $S$ like we have done above, we can try to estimate *quantiles* of $T$ as a function of $S$, using quantile regression. +In R, this is handled by the `quantereg` package: +```{r} +library(quantreg) +``` + +We then apply the `rq` function to do quantile regression on both datasets. +To spice things up, instead of a single model, let's treat `tau`, the parameter of `rq` the determines the quantile of $T$ to predict, as a hyperparameter, and try to see how it changes the resulting models. +The models are extremely unlikely to change under around $tau = 0.65$ (given that 84 and 65 percents of the data in the `liglab2` and `stackoverflow` datasets, respectively, are equal to the minimum value of T), but we'll check it anyway, because it will change the distribution of $S$ ever so slightly. +We can do the estimation by taking values of `tau` every .05 from 0.05 to 0.95 (so from the bottom 5% of $T$ to the bottom 95%): +**NB: this will take some time (around 30 s on my MacBook 2015)** +```{r} +models <- dataset_full %>% + group_by(dataset, `S > S_threshold`) %>% + nest() %>% + mutate(model = map(.x = data, + .f = ~ rq(T~S, tau = seq(0.05, .95, by = .05), data=.x)), + coefs = map(.x = model, .f = tidy, conf.level = 0.95)) # computes the 95% confidence interval of the coefficients + # note: .95 is the default value anyway, but it is not specified in the outputs, so I put it here to remember +``` +Then we can unpack the resulting coefficients: +```{r} +coefs <- models %>% + select(c("dataset", "S > S_threshold", "coefs")) %>% + unnest(c(coefs)) %>% + mutate(term = ifelse(term == "S", "1/C", "L")) +coefs +``` +Immediately (and somewhat coincidentally), we can tell that we'll have a bit of a problem with the lower quantiles for `liglab2`, which makes sense as seen above for the lower quantiles of $T$, there's no variation in $S$ whatsoever, so of course the value of $\frac{1}{C}$ will be 0. +To get a better idea of the coefficients, let's plot their value as a function of $tau$. +For clarity, we'll separate the terms of the models, as well as the datasets, and use different colours of the two different regimes of $S$. + +```{r} +ggplot(coefs, aes(x = tau, y = estimate, colour = `S > S_threshold`)) + + geom_point() + + geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) + + facet_grid(term~dataset, scales = "free_y") +``` +Predictably, only the higher quartiles start to differ, so let's just zoom in for a better look, say above .75: +```{r} +ggplot(coefs %>% filter(tau >= 0.75), # take the higher quartile models only + aes(x = tau, y = estimate, colour = `S > S_threshold`)) + + geom_point() + + geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) + + facet_grid(term~dataset, scales = "free_y") ``` +### Conclusions +We can see here than, much like we found precedently, $C$ has relatively little influence on $T$, although this increases at the higher quantiles of $T$. -Note the parameter `echo = FALSE` that indicates that the code will not appear in the final version of the document. We recommend not to use this parameter in the context of this MOOC, because we want your data analyses to be perfectly transparent and reproducible. +Interestingly, what little trends are there are contrary between the two datasets, with $1/C$ increasing at higher values of `tau` (and so $C$ would decrease) for the `liglab2` dataset, and vice-versa for the `stackoverflow` dataset. However, this trend only exists for values of $S$ above $S_{threshold}$ and not under. However, the capacity of the connection appears to vary little when any other parameter varies, which makes sense. -Since the results are not stored in Rmd files, you should generate an HTML or PDF version of your exercises and commit them. Otherwise reading and checking your analysis will be difficult for anyone else but you. +Moreover, $L$ has a much higher value for the `stackoverflow` dataset than for the `liglab2` dataset, regardless of the value of `tau`. This can likely be explained by the distance of the two connections: `liglab2` comes from a local, on-campus connection, while `stackoverflow` comes from a long-distance connection, which automatically means a higher latence to transfer information. -Now it's your turn! You can delete all this information and replace it by your computational document. +We have compared the parameters of two different connections using linear and quantile regression. +We have seen that linear regression was a poor choice of method for the data, as was a simple restriction of the values, thus orienting our choice to a different technique, that bypasses the problems of linear regression (namely, that it estimates the *average* value of the response variable), by estimating *quantiles* of the response variable $T$.