From 335924324dc3613c52631a8359ed634e1d6bbcc7 Mon Sep 17 00:00:00 2001
From: Arnaud Legrand <arnaud.legrand@imag.fr>
Date: Thu, 28 Mar 2019 12:34:53 +0100
Subject: [PATCH] English version

---
 module2/exo4/stat_activity.org | 159 ++++++++++++++++-----------------
 1 file changed, 76 insertions(+), 83 deletions(-)
diff --git a/module2/exo4/stat_activity.org b/module2/exo4/stat_activity.org
index b224e96..29bbb00 100644
--- a/module2/exo4/stat_activity.org
+++ b/module2/exo4/stat_activity.org
@@ -1,4 +1,4 @@
-#+TITLE: FIXME Analyse des mots-clés de mon journal
+#+TITLE: Analyzing my journal's keywords
 #+LANGUAGE: fr
 
 #+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/htmlize.css"/>
@@ -10,33 +10,32 @@
 
 #+PROPERTY: header-args  :session  :exports both  :eval never-export
 
-J'ai la chance de ne pas avoir de comptes à rendre trop précis sur le
-temps que je passe à faire telle ou telle chose. Ça tombe bien car je
-n'aime pas vraiment suivre précisément et quotidiennement le temps que
-je passe à faire telle ou telle chose. Par contre, comme vous avez pu
-le voir dans une des vidéos de ce module, je note beaucoup
-d'informations dans mon journal et j'étiquette (quand j'y pense) ces
-informations. Je me suis dit qu'il pourrait être intéressant de voir
-si l'évolution de l'utilisation de ces étiquettes révélait quelque
-chose sur mes centres d'intérêts professionnels. Je ne compte pas en
-déduire grand chose de significatif sur le plan statistique vu que je
-sais que ma rigueur dans l'utilisation de ces étiquettes et leur
-sémantique a évolué au fil des années mais bon, on va bien voir ce
-qu'on y trouve.
-
-* Mise en forme des données
-Mon journal est stocké dans ~/home/alegrand/org/journal.org~. Les
-entrées de niveau 1 (une étoile) indiquent l'année, celles de niveau 2
-(2 étoiles) le mois, celles de niveau 3 (3 étoiles) la date du jour et
-enfin, celles de profondeur plus importantes ce sur quoi j'ai
-travaillé ce jour là. Ce sont généralement celles-ci qui sont
-étiquetées avec des mots-clés entre ":" à la fin de la ligne. 
-
-Je vais donc chercher à extraire les lignes comportant trois ~*~ en
-début de ligne et celles commençant par une ~*~ et terminant par des
-mots-clés (des ~:~ suivis éventuellement d'un espace). L'expression
-régulière n'est pas forcément parfaite mais ça me donne une première
-idée de ce que j'aurai besoin de faire en terme de remise en forme.
+I'm a lucky person as I do not have to account too precisely for how
+much time I spend working on such or such topic. This is good as I
+really like my freedom and I feel I would not like having to monitor
+my activity on a daily basis. However, as you may have noticed in the
+videos of this module, I keep track of a large amount of information
+in my journal and I tag them (most of the time). So I thought it might
+be interesting to see whether these tags could reveal something about
+the evolution of my professional interest. I have no intention to
+deduce anything really significant from a statistical point of view,
+in particular as I know my tagging rigor and the tag semantic has
+evolved through time. So it will be purely exploratory..
+
+* Data Processing and Curation
+My journal is stored in ~/home/alegrand/org/journal.org~. Level 1
+entries (1 star) indicate the year. Level 2 entries (2 stars) indicate
+the month. Level 3 entries (3 stars) indicate the day and finally
+entries with a depth larger than 3 are generally the important ones
+and indicate on what I've been working on this particular day. These
+are the entries that may be tagged. The tags appear in the end of
+theses lines and are surrounded with =:=.
+
+So let's try to extract the lines with exactly three ~*~ in the beginning
+of the line (the date) and those that start with a ~*~ and end with tags
+(between ~:~ and possibly followed by spaces). The corresponding regular
+expression is not perfect but it is a first attempt and will give me
+an idea of how much parsing and string processing I'll have to do.
 
 #+begin_src shell :results output :exports both :eval never-export 
 grep -e '^\*\*\* ' -e '^\*.*:.*: *$' ~/org/journal.org | tail -n 20
@@ -66,14 +65,13 @@ grep -e '^\*\*\* ' -e '^\*.*:.*: *$' ~/org/journal.org | tail -n 20
 ,**** Point budget/contrats POLARIS                         :POLARIS:INRIA:
 #+end_example
 
-OK, je suis sur la bonne voie. Je vois qu'il y a pas mal d'entrées
-sans annotation. Tant pis. Il y a aussi souvent plusieurs mots-clés
-pour une même date et pour pouvoir bien rajouter la date du jour en
-face de chaque mot-clé, je vais essayer un vrai langage plutôt que
-d'essayer de faire ça à coup de commandes shell. Je suis de l'ancienne
-génération donc j'ai plus l'habitude de Perl que de Python pour ce
-genre de choses. Curieusement, ça s'écrit bien plus facilement (ça m'a
-pris 5 minutes) que ça ne se relit... \smiley
+OK, that's not so bad. There are actually many entries that are not
+tagged. Never mind! There are also often several tags for a same entry
+and several entries for a same day. If I want to add the date in front
+of each key word, I'd rather use a real language rather than trying to
+do it only with shell commands. I'm old-school so I'm more used to
+using Perl than using Python. Amusingly, it is way easier to write (it
+took me about 5 minutes) than to read... \smiley
 
 #+begin_src perl :results output :exports both :eval never-export
 open INPUT, "/home/alegrand/org/journal.org" or die $_;
@@ -100,7 +98,7 @@ while(defined($line=<INPUT>)) {
 
 #+RESULTS:
 
-Vérifions à quoi ressemble le résultat :
+Let's check the result:
 #+begin_src shell :results output :exports both
 head org_keywords.csv
 echo "..."
@@ -132,12 +130,12 @@ Date,Keyword
 2018-06-26,INRIA
 #+end_example
 
-C'est parfait !
+Awesome! That's exactly what I wanted.
 
-* Statistiques de base
-Je suis bien plus à l'aise avec R qu'avec Python. J'utiliserai les
-package du tidyverse dès que le besoin s'en fera sentir. Commençons
-par lire ces données :
+* Basic Statistics
+Again, I'm much more comfortable using R than using Python. I'll try
+not to reinvent the wheel and I'll use the tidyverse packages as soon
+as they appear useful. Let's start by reading data::
 #+begin_src R :results output :session *R* :exports both
 library(lubridate) # à installer via install.package("tidyverse")
 library(dplyr)
@@ -169,7 +167,7 @@ The following objects are masked from ‘package:base’:
     intersect, setdiff, setequal, union
 #+end_example
 
-Alors, à quoi ressemblent ces données :
+What does it look like ?
 #+begin_src R :results output :session *R* :exports both
 str(df)
 summary(df)
@@ -191,7 +189,7 @@ summary(df)
  (Other)   :537   (Other) :271
 #+end_example
 
-Les types ont l'air corrects, 568 entrées, tout va bien.
+Types appear to be correct. 568 entries. Nothing strange, let's keep going.
 #+begin_src R :results output :session *R* :exports both
 df %>% group_by(Keyword, Year) %>% summarize(Count=n()) %>% 
    ungroup() %>% arrange(Keyword,Year) -> df_summarized
@@ -216,7 +214,7 @@ df_summarized
 # ... with 110 more rows
 #+end_example
 
-Commençons par compter combien d'annotations je fais par an.
+Let's start by counting how many annotations I do per year:
 #+begin_src R :results output :session *R* :exports both
 df_summarized_total_year = df_summarized %>% group_by(Year) %>% summarize(Cout=sum(Count))
 df_summarized_total_year
@@ -237,11 +235,10 @@ df_summarized_total_year
 8  2018    48
 #+end_example
 
-Ah, visiblement, je m'améliore au fil du temps et en 2014, j'ai oublié
-de le faire régulièrement.
+Good. It looks like I'm improving over time. 2014 was a bad year and I
+apparently forgot to review and tag on a regular basis.
 
-L'annotation étant libre, certains mots-clés sont peut-être très peu
-présents. Regardons ça.
+Tags are free so maybe some tags are scarcely used. Let's have a look.
 #+begin_src R :results output :session *R* :exports both
 df_summarized %>% group_by(Keyword) %>% summarize(Count=sum(Count)) %>%  arrange(Count) %>% as.data.frame()
 #+end_src
@@ -287,15 +284,14 @@ df_summarized %>% group_by(Keyword) %>% summarize(Count=sum(Count)) %>%  arrange
 36           WP4    77
 #+end_example
 
-OK, par la suite, je me restraindrai probablement à ceux qui
-apparaissent au moins trois fois.
+OK, in the following, I'll restrict to the tags that appear at least
+three times.
 
-* Représentations graphiques
-Pour bien faire, il faudrait que je mette une sémantique et une
-hiérarchie sur ces mots-clés mais je manque de temps là. Comme
-j'enlève les mots-clés peu fréquents, je vais quand même aussi
-rajouter le nombre total de mots-clés pour avoir une idée de ce que
-j'ai perdu. Tentons une première représentation graphique :
+* Nice Looking Graphics
+Ideally, I would define a semantic and a hierarchy for my tags but I'm
+running out of time. Since I've decided to remove rare tags, I'll also
+count the total number of tags to get an idea of how much information
+I've lost. Let's try a first representation:
 #+begin_src R :results output graphics :file barchart1.png :exports both :width 600 :height 400 :session *R* 
 library(ggplot2)
 df_summarized %>% filter(Count > 3) %>%
@@ -308,13 +304,11 @@ df_summarized %>% filter(Count > 3) %>%
 #+RESULTS:
 [[file:barchart1.png]]
 
-Aouch. C'est illisible avec une telle palette de couleurs mais vu
-qu'il y a beaucoup de valeurs différentes, difficile d'utiliser une
-palette plus discriminante. Je vais quand même essayer rapidement
-histoire de dire... Pour ça, j'utiliserai une palette de couleur
-("Set1") où les couleurs sont toutes bien différentes mais elle n'a
-que 9 couleurs. Je vais donc commencer par sélectionner les 9
-mots-clés les plus fréquents.
+Aouch! This is very hard to read, in particular because of the many
+different colors and the continuous palette that prevents to
+distinguish between tags. Let's try an other palette ("Set1") where
+colors are very different. Unfortunately there are only 9 colors in
+this palette so I'll first have to select the 9 more frequent tags.
 
 #+begin_src R :results output graphics :file barchart2.png :exports both :width 600 :height 400 :session *R* 
 library(ggplot2)
@@ -332,23 +326,22 @@ df_summarized %>% filter(Keyword %in% frequent_keywords$Keyword) %>%
 #+RESULTS:
 [[file:barchart2.png]]
 
-OK. Visiblement, la part liée à l'administration (~Inria~, ~LIG~, ~POLARIS~)
-et à l'enseignement (~Teaching~) augmente. L'augmentation des parties
-sur ~R~ est à mes yeux signe d'une amélioration de ma maîtrise de
-l'outil. L'augmentation de la partie ~Seminar~ ne signifie pas grand
-chose car ce n'est que récemment que j'ai commencé à étiqueter
-systématiquement les notes que je prenais quand j'assiste à un
-exposé. Les étiquettes sur ~WP~ ont trait à la terminologie d'un ancien
-projet ANR que j'ai continué à utiliser (~WP4~ = prédiction de
-performance HPC, ~WP7~ = analyse et visualisation, ~WP8~ = plans
-d'expérience et moteurs d'expérimentation...). Le fait que ~WP4~
-diminue est plutôt le fait que les informations à ce sujet sont
-maintenant plutôt les journaux de mes doctorants qui réalisent
-vraiment les choses que je ne fais que superviser.
-
-Bon, une analyse de ce genre ne serait pas digne de ce nom sans un
-/wordcloud/ (souvent illisible, mais tellement sexy! \smiley). Pour ça, je
-m'inspire librement de ce post :
+OK. That's much better. It appears like the administration part
+(~Inria~, ~LIG~, ~POLARIS~) and the teaching part (~Teaching~) increases. The
+increasing usage of the ~R~ tag is probably reflecting my improvement in
+using this tool. The evolution of the ~Seminar~ tag is meaningless as I
+only recently started to systematically tag my seminar notes. The ~WP~
+tags are related to a former ANR project but I've kept using the same
+annotation style (~WP4~ = performance evaluation of HPC systems, ~WP7~ =
+data analysis and visualization, ~WP8~ = design of
+experiments/experiment engines/reproducible research...). 
+~WP4~ is decreasing but it is because most of the work on this topic is
+now in my students' labbbooks since they are doing all the real work
+which I'm mostly supervising.
+
+Well, such kind of exploratory analysis would not be complete without
+a /wordcloud/ (most of the time completely unreadable but also so hype!
+\smiley). To this end, I followed the ideas presented in this blog post:
 http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html
 
 #+begin_src R :results output graphics :file wordcloud.png :exports both :width 600 :height 400 :session *R* 
@@ -363,5 +356,5 @@ wordcloud(df_summarized_keyword$Keyword, df_summarized_keyword$Count,
 #+RESULTS:
 [[file:wordcloud.png]]
 
-Bon... voilà, c'est "joli" mais sans grand intérêt, tout
-particulièrement quand il y a si peu de mots différents.
+Voilà! It is "nice" but rather useless, especially with so few words
+and such a poor semantic.
-- 
2.18.1