#+TITLE: Analysis of custom data - Module 2 - Exercise 4 #+AUTHOR: Miguel Felipe Silva Vasconcelos #+DATE: 28/02/2021 #+LANGUAGE: en # #+PROPERTY: header-args :eval never-export #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: #+HTML_HEAD: * Introduction For the purpose of only solving this exercise, I'm using data that was randomly generated. The file /data.csv/ contains two columns: - The first column represents the day of the month - The second column represents how many minutes were spent doing determined task (in this case, studying for this MOOC) The analysis will present the following metrics: median, average, standard deviation, maximum, and minimum value, regarding the time spent on each day. * Results of the experiments I'm using the [[https://pandas.pydata.org/][Pandas library]] to facilitate reading the date from the CSV file and to learn a new tool :). #+begin_src python :results value :session *python* :exports both #using value, prints the variable without showing the console output import pandas as pd # using pandas to facilitate working with date and time dataframe = pd.read_csv("data.csv", parse_dates=[0], delimiter = ';', header=None) dataframe #+end_src #+RESULTS: #+begin_example 0 1 0 2021-02-01 81.819914 1 2021-02-02 45.630108 2 2021-02-03 70.870649 3 2021-02-04 5.975111 4 2021-02-05 101.240122 5 2021-02-06 103.766044 6 2021-02-07 52.724327 7 2021-02-08 68.712419 8 2021-02-09 24.769924 9 2021-02-10 118.519012 10 2021-02-11 72.366803 11 2021-02-12 114.271576 12 2021-02-13 22.577226 13 2021-02-14 9.454489 14 2021-02-15 82.041779 15 2021-02-16 113.367189 16 2021-02-17 69.055952 17 2021-02-18 23.393082 18 2021-02-19 59.451386 19 2021-02-20 11.830620 20 2021-02-21 38.629430 21 2021-02-22 55.876251 22 2021-02-23 69.602759 23 2021-02-24 12.494400 24 2021-02-25 115.595595 25 2021-02-26 56.179007 26 2021-02-27 64.323035 27 2021-02-28 4.862036 #+end_example * Calculating the average/mean We can use pandas' [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html][mean method]] #+begin_src python :results output :session *python* :exports both #using output, prints only what is shown in the console average = dataframe[1].mean() print(average) #+end_src #+RESULTS: : 59.621437271033706 * Calculating the standard deviation We can use pandas' [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html][std method]] #+begin_src python :results value :session *python* :exports both std = dataframe[1].std() std #+end_src #+RESULTS: : 36.12909565271962 * Calculating the median We can use pandas' [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html][median method]] #+begin_src python :results value :session *python* :exports both median = dataframe[1].median() median #+end_src #+RESULTS: : 61.88721048734205 * Finding the minimum value (time spent) We can use pandas' [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html][min method]] #+begin_src python :results value :session *python* :exports both min = dataframe[1].min() min #+end_src #+RESULTS: : 4.86203636954475 * Finding the day with the minimum time spent studying We can use pandas' [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmin.html][idxmin method]] #+begin_src python :results output :session *python* :exports both idmin = dataframe[1].idxmin() idmin print (dataframe[0][idmin] , dataframe[1][idmin] ) #+end_src #+RESULTS: : 2021-02-28 00:00:00 4.86203636954475 * Finding the maximum value (time spent) We can use pandas' [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html][max method]] #+begin_src python :results value :session *python* :exports both max = dataframe[1].max() max #+end_src #+RESULTS: : 118.519011934154 * Finding the day with the maximum time spent studying We can use pandas' [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html][idxmax method]] #+begin_src python :results output :session *python* :exports both idmax = dataframe[1].idxmax() idmax print (dataframe[0][idmax],dataframe[1][idmax] ) #+end_src #+RESULTS: : 2021-02-10 00:00:00 118.519011934154 * Generating a graphic of the data: #+begin_src python :results output file :session *python* :var matplot_lib_filename2="simple_plot.png" :exports both from matplotlib import pyplot as plt fig, ax = plt.subplots(figsize=(12, 12)) ax.bar(dataframe.index.values, dataframe[1], color='purple') ax.set(xlabel="Date", ylabel="Time Spent", title="Daily Time spent studying for the MOOC on reproducible research - feb/2021") plt.savefig(matplot_lib_filename2) print(matplot_lib_filename2) #+end_src #+RESULTS: [[file:simple_plot.png]]