Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
M
mooc-rr-ressources
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
4
Merge Requests
4
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Commits
Issue Boards
Open sidebar
Learning Lab
mooc-rr-ressources
Commits
33592432
Commit
33592432
authored
Mar 28, 2019
by
Arnaud Legrand
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
English version
parent
60a8d0d2
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
76 additions
and
83 deletions
+76
-83
stat_activity.org
module2/exo4/stat_activity.org
+76
-83
No files found.
module2/exo4/stat_activity.org
View file @
33592432
#+
TITLE
:
FIXME
Analyse
des
mots
-
cl
é
s
de
mon
journal
#+TITLE:
Analyzing my journal's keywords
#+LANGUAGE: fr
#+LANGUAGE: fr
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/htmlize.css"/>
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/htmlize.css"/>
...
@@ -10,33 +10,32 @@
...
@@ -10,33 +10,32 @@
#+PROPERTY: header-args :session :exports both :eval never-export
#+PROPERTY: header-args :session :exports both :eval never-export
J
'ai la chance de ne pas avoir de comptes à rendre trop précis sur le
I'm a lucky person as I do not have to account too precisely for how
temps que je passe à faire telle ou telle chose. Ça tombe bien car je
much time I spend working on such or such topic. This is good as I
n'
aime
pas
vraiment
suivre
pr
é
cis
é
ment
et
quotidiennement
le
temps
que
really like my freedom and I feel I would not like having to monitor
je
passe
à
faire
telle
ou
telle
chose
.
Par
contre
,
comme
vous
avez
pu
my activity on a daily basis. However, as you may have noticed in the
le
voir
dans
une
des
vid
é
os
de
ce
module
,
je
note
beaucoup
videos of this module, I keep track of a large amount of information
d
'informations dans mon journal et j'
é
tiquette
(
quand
j
'y pense) ces
in my journal and I tag them (most of the time). So I thought it might
informations. Je me suis dit qu'
il
pourrait
ê
tre
int
é
ressant
de
voir
be interesting to see whether these tags could reveal something about
si
l
'évolution de l'
utilisation
de
ces
é
tiquettes
r
é
v
é
lait
quelque
the evolution of my professional interest. I have no intention to
chose
sur
mes
centres
d
'intérêts professionnels. Je ne compte pas en
deduce anything really significant from a statistical point of view,
déduire grand chose de significatif sur le plan statistique vu que je
in particular as I know my tagging rigor and the tag semantic has
sais que ma rigueur dans l'
utilisation
de
ces
é
tiquettes
et
leur
evolved through time. So it will be purely exploratory..
s
é
mantique
a
é
volu
é
au
fil
des
ann
é
es
mais
bon
,
on
va
bien
voir
ce
qu
'on y trouve.
* Data Processing and Curation
My journal is stored in ~/home/alegrand/org/journal.org~. Level 1
* Mise en forme des données
entries (1 star) indicate the year. Level 2 entries (2 stars) indicate
Mon journal est stocké dans ~/home/alegrand/org/journal.org~. Les
the month. Level 3 entries (3 stars) indicate the day and finally
entrées de niveau 1 (une étoile) indiquent l'
ann
é
e
,
celles
de
niveau
2
entries with a depth larger than 3 are generally the important ones
(
2
é
toiles
)
le
mois
,
celles
de
niveau
3
(
3
é
toiles
)
la
date
du
jour
et
and indicate on what I've been working on this particular day. These
enfin
,
celles
de
profondeur
plus
importantes
ce
sur
quoi
j
'ai
are the entries that may be tagged. The tags appear in the end of
travaillé ce jour là. Ce sont généralement celles-ci qui sont
theses lines and are surrounded with =:=.
étiquetées avec des mots-clés entre ":" à la fin de la ligne.
So let's try to extract the lines with exactly three ~*~ in the beginning
Je vais donc chercher à extraire les lignes comportant trois ~*~ en
of the line (the date) and those that start with a ~*~ and end with tags
début de ligne et celles commençant par une ~*~ et terminant par des
(between ~:~ and possibly followed by spaces). The corresponding regular
mots-clés (des ~:~ suivis éventuellement d'
un
espace
).
L
'expression
expression is not perfect but it is a first attempt and will give me
régulière n'
est
pas
forc
é
ment
parfaite
mais
ç
a
me
donne
une
premi
è
re
an idea of how much parsing and string processing I'll have to do.
id
é
e
de
ce
que
j
'aurai besoin de faire en terme de remise en forme.
#+begin_src shell :results output :exports both :eval never-export
#+begin_src shell :results output :exports both :eval never-export
grep -e '^\*\*\* ' -e '^\*.*:.*: *$' ~/org/journal.org | tail -n 20
grep -e '^\*\*\* ' -e '^\*.*:.*: *$' ~/org/journal.org | tail -n 20
...
@@ -66,14 +65,13 @@ grep -e '^\*\*\* ' -e '^\*.*:.*: *$' ~/org/journal.org | tail -n 20
...
@@ -66,14 +65,13 @@ grep -e '^\*\*\* ' -e '^\*.*:.*: *$' ~/org/journal.org | tail -n 20
,**** Point budget/contrats POLARIS :POLARIS:INRIA:
,**** Point budget/contrats POLARIS :POLARIS:INRIA:
#+end_example
#+end_example
OK
,
je
suis
sur
la
bonne
voie
.
Je
vois
qu
'il y a pas mal d'
entr
é
es
OK, that's not so bad. There are actually many entries that are not
sans
annotation
.
Tant
pis
.
Il
y
a
aussi
souvent
plusieurs
mots
-
cl
é
s
tagged. Never mind! There are also often several tags for a same entry
pour
une
m
ê
me
date
et
pour
pouvoir
bien
rajouter
la
date
du
jour
en
and several entries for a same day. If I want to add the date in front
face
de
chaque
mot
-
cl
é
,
je
vais
essayer
un
vrai
langage
plut
ô
t
que
of each key word, I'd rather use a real language rather than trying to
d
'essayer de faire ça à coup de commandes shell. Je suis de l'
ancienne
do it only with shell commands. I'm old-school so I'm more used to
g
é
n
é
ration
donc
j
'ai plus l'
habitude
de
Perl
que
de
Python
pour
ce
using Perl than using Python. Amusingly, it is way easier to write (it
genre
de
choses
.
Curieusement
,
ç
a
s
'écrit bien plus facilement (ça m'
a
took me about 5 minutes) than to read... \smiley
pris
5
minutes
)
que
ç
a
ne
se
relit
...
\
smiley
#+begin_src perl :results output :exports both :eval never-export
#+begin_src perl :results output :exports both :eval never-export
open INPUT, "/home/alegrand/org/journal.org" or die $_;
open INPUT, "/home/alegrand/org/journal.org" or die $_;
...
@@ -100,7 +98,7 @@ while(defined($line=<INPUT>)) {
...
@@ -100,7 +98,7 @@ while(defined($line=<INPUT>)) {
#+RESULTS:
#+RESULTS:
V
é
rifions
à
quoi
ressemble
le
r
é
sultat
:
Let's check the result
:
#+begin_src shell :results output :exports both
#+begin_src shell :results output :exports both
head org_keywords.csv
head org_keywords.csv
echo "..."
echo "..."
...
@@ -132,12 +130,12 @@ Date,Keyword
...
@@ -132,12 +130,12 @@ Date,Keyword
2018-06-26,INRIA
2018-06-26,INRIA
#+end_example
#+end_example
C
'est parfait !
Awesome! That's exactly what I wanted.
*
Statistiques de base
*
Basic Statistics
Je suis bien plus à l'
aise
avec
R
qu
'avec Python. J'
utiliserai
les
Again, I'm much more comfortable using R than using Python. I'll try
package
du
tidyverse
d
è
s
que
le
besoin
s
'en fera sentir. Commençons
not to reinvent the wheel and I'll use the tidyverse packages as soon
par lire ces données
:
as they appear useful. Let's start by reading data:
:
#+begin_src R :results output :session *R* :exports both
#+begin_src R :results output :session *R* :exports both
library(lubridate) # à installer via install.package("tidyverse")
library(lubridate) # à installer via install.package("tidyverse")
library(dplyr)
library(dplyr)
...
@@ -169,7 +167,7 @@ The following objects are masked from ‘package:base’:
...
@@ -169,7 +167,7 @@ The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
intersect, setdiff, setequal, union
#+end_example
#+end_example
Alors, à quoi ressemblent ces données :
What does it look like ?
#+begin_src R :results output :session *R* :exports both
#+begin_src R :results output :session *R* :exports both
str(df)
str(df)
summary(df)
summary(df)
...
@@ -191,7 +189,7 @@ summary(df)
...
@@ -191,7 +189,7 @@ summary(df)
(Other) :537 (Other) :271
(Other) :537 (Other) :271
#+end_example
#+end_example
Les types ont l'
air
corrects
,
568
entr
é
es
,
tout
va
bien
.
Types appear to be correct. 568 entries. Nothing strange, let's keep going
.
#+begin_src R :results output :session *R* :exports both
#+begin_src R :results output :session *R* :exports both
df %>% group_by(Keyword, Year) %>% summarize(Count=n()) %>%
df %>% group_by(Keyword, Year) %>% summarize(Count=n()) %>%
ungroup() %>% arrange(Keyword,Year) -> df_summarized
ungroup() %>% arrange(Keyword,Year) -> df_summarized
...
@@ -216,7 +214,7 @@ df_summarized
...
@@ -216,7 +214,7 @@ df_summarized
# ... with 110 more rows
# ... with 110 more rows
#+end_example
#+end_example
Commen
ç
ons
par
compter
combien
d
'annotations je fais par an.
Let's start by counting how many annotations I do per year:
#+begin_src R :results output :session *R* :exports both
#+begin_src R :results output :session *R* :exports both
df_summarized_total_year = df_summarized %>% group_by(Year) %>% summarize(Cout=sum(Count))
df_summarized_total_year = df_summarized %>% group_by(Year) %>% summarize(Cout=sum(Count))
df_summarized_total_year
df_summarized_total_year
...
@@ -237,11 +235,10 @@ df_summarized_total_year
...
@@ -237,11 +235,10 @@ df_summarized_total_year
8 2018 48
8 2018 48
#+end_example
#+end_example
Ah, visiblement, je m'
am
é
liore
au
fil
du
temps
et
en
2014
,
j
'ai oublié
Good. It looks like I'm improving over time. 2014 was a bad year and I
de le faire régulièrement
.
apparently forgot to review and tag on a regular basis
.
L'
annotation
é
tant
libre
,
certains
mots
-
cl
é
s
sont
peut
-
ê
tre
tr
è
s
peu
Tags are free so maybe some tags are scarcely used. Let's have a look.
pr
é
sents
.
Regardons
ç
a
.
#+begin_src R :results output :session *R* :exports both
#+begin_src R :results output :session *R* :exports both
df_summarized %>% group_by(Keyword) %>% summarize(Count=sum(Count)) %>% arrange(Count) %>% as.data.frame()
df_summarized %>% group_by(Keyword) %>% summarize(Count=sum(Count)) %>% arrange(Count) %>% as.data.frame()
#+end_src
#+end_src
...
@@ -287,15 +284,14 @@ df_summarized %>% group_by(Keyword) %>% summarize(Count=sum(Count)) %>% arrange
...
@@ -287,15 +284,14 @@ df_summarized %>% group_by(Keyword) %>% summarize(Count=sum(Count)) %>% arrange
36 WP4 77
36 WP4 77
#+end_example
#+end_example
OK
,
par
la
suite
,
je
me
restraindrai
probablement
à
ceux
qui
OK,
in the following, I'll restrict to the tags that appear at least
apparaissent
au
moins
trois
foi
s
.
three time
s.
*
Repr
é
sentations
graphiques
* Nice Looking Graphics
Pour
bien
faire
,
il
faudrait
que
je
mette
une
s
é
mantique
et
une
Ideally, I would define a semantic and a hierarchy for my tags but I'm
hi
é
rarchie
sur
ces
mots
-
cl
é
s
mais
je
manque
de
temps
l
à
.
Comme
running out of time. Since I've decided to remove rare tags, I'll also
j
'enlève les mots-clés peu fréquents, je vais quand même aussi
count the total number of tags to get an idea of how much information
rajouter le nombre total de mots-clés pour avoir une idée de ce que
I've lost. Let's try a first representation:
j'
ai
perdu
.
Tentons
une
premi
è
re
repr
é
sentation
graphique
:
#+begin_src R :results output graphics :file barchart1.png :exports both :width 600 :height 400 :session *R*
#+begin_src R :results output graphics :file barchart1.png :exports both :width 600 :height 400 :session *R*
library(ggplot2)
library(ggplot2)
df_summarized %>% filter(Count > 3) %>%
df_summarized %>% filter(Count > 3) %>%
...
@@ -308,13 +304,11 @@ df_summarized %>% filter(Count > 3) %>%
...
@@ -308,13 +304,11 @@ df_summarized %>% filter(Count > 3) %>%
#+RESULTS:
#+RESULTS:
[[file:barchart1.png]]
[[file:barchart1.png]]
Aouch
.
C
'est illisible avec une telle palette de couleurs mais vu
Aouch! This is very hard to read, in particular because of the many
qu'
il
y
a
beaucoup
de
valeurs
diff
é
rentes
,
difficile
d
'utiliser une
different colors and the continuous palette that prevents to
palette plus discriminante. Je vais quand même essayer rapidement
distinguish between tags. Let's try an other palette ("Set1") where
histoire de dire... Pour ça, j'
utiliserai
une
palette
de
couleur
colors are very different. Unfortunately there are only 9 colors in
(
"Set1"
)
o
ù
les
couleurs
sont
toutes
bien
diff
é
rentes
mais
elle
n
'a
this palette so I'll first have to select the 9 more frequent tags.
que 9 couleurs. Je vais donc commencer par sélectionner les 9
mots-clés les plus fréquents.
#+begin_src R :results output graphics :file barchart2.png :exports both :width 600 :height 400 :session *R*
#+begin_src R :results output graphics :file barchart2.png :exports both :width 600 :height 400 :session *R*
library(ggplot2)
library(ggplot2)
...
@@ -332,23 +326,22 @@ df_summarized %>% filter(Keyword %in% frequent_keywords$Keyword) %>%
...
@@ -332,23 +326,22 @@ df_summarized %>% filter(Keyword %in% frequent_keywords$Keyword) %>%
#+RESULTS:
#+RESULTS:
[[file:barchart2.png]]
[[file:barchart2.png]]
OK. Visiblement, la part liée à l'
administration
(~
Inria
~,
~
LIG
~,
~
POLARIS
~)
OK. That's much better. It appears like the administration part
et
à
l
'enseignement (~Teaching~) augmente. L'
augmentation
des
parties
(~Inria~, ~LIG~, ~POLARIS~) and the teaching part (~Teaching~) increases. The
sur
~
R
~
est
à
mes
yeux
signe
d
'une amélioration de ma maîtrise de
increasing usage of the ~R~ tag is probably reflecting my improvement in
l'
outil
.
L
'augmentation de la partie ~Seminar~ ne signifie pas grand
using this tool. The evolution of the ~Seminar~ tag is meaningless as I
chose car ce n'
est
que
r
é
cemment
que
j
'ai commencé à étiqueter
only recently started to systematically tag my seminar notes. The ~WP~
systématiquement les notes que je prenais quand j'
assiste
à
un
tags are related to a former ANR project but I've kept using the same
expos
é
.
Les
é
tiquettes
sur
~
WP
~
ont
trait
à
la
terminologie
d
'un ancien
annotation style (~WP4~ = performance evaluation of HPC systems, ~WP7~ =
projet ANR que j'
ai
continu
é
à
utiliser
(~
WP4
~
=
pr
é
diction
de
data analysis and visualization, ~WP8~ = design of
performance
HPC
,
~
WP7
~
=
analyse
et
visualisation
,
~
WP8
~
=
plans
experiments/experiment engines/reproducible research...).
d
'expérience et moteurs d'
exp
é
rimentation
...).
Le
fait
que
~
WP4
~
~WP4~ is decreasing but it is because most of the work on this topic is
diminue
est
plut
ô
t
le
fait
que
les
informations
à
ce
sujet
sont
now in my students' labbbooks since they are doing all the real work
maintenant
plut
ô
t
les
journaux
de
mes
doctorants
qui
r
é
alisent
which I'm mostly supervising.
vraiment
les
choses
que
je
ne
fais
que
superviser
.
Well, such kind of exploratory analysis would not be complete without
Bon
,
une
analyse
de
ce
genre
ne
serait
pas
digne
de
ce
nom
sans
un
a /wordcloud/ (most of the time completely unreadable but also so hype!
/
wordcloud
/
(
souvent
illisible
,
mais
tellement
sexy
! \smiley). Pour ça, je
\smiley). To this end, I followed the ideas presented in this blog post:
m
'inspire librement de ce post :
http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html
http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html
#+begin_src R :results output graphics :file wordcloud.png :exports both :width 600 :height 400 :session *R*
#+begin_src R :results output graphics :file wordcloud.png :exports both :width 600 :height 400 :session *R*
...
@@ -363,5 +356,5 @@ wordcloud(df_summarized_keyword$Keyword, df_summarized_keyword$Count,
...
@@ -363,5 +356,5 @@ wordcloud(df_summarized_keyword$Keyword, df_summarized_keyword$Count,
#+RESULTS:
#+RESULTS:
[[file:wordcloud.png]]
[[file:wordcloud.png]]
Bon... voilà, c'
est
"joli"
mais
sans
grand
int
é
r
ê
t
,
tout
Voilà! It is "nice" but rather useless, especially with so few words
particuli
è
rement
quand
il
y
a
si
peu
de
mots
diff
é
rents
.
and such a poor semantic
.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment