...
 
Commits (4)
# Partie 1
## Sous-partie 1 : texte
Une phrase sans rien
*Une phrase en italique*
**Une phrase en gras**
Un lien vers [fun-mooc.fr](https://www.fun-mooc.fr/)
Une ligne de `code`
## Sous-partie 2 : listes
**Liste à puce**
- item
- sous-item
- sous-item
- item
- item
**Liste numérotée**
1. item
2. item
3. item
## Sous-partie 3 : code
```
# Extrait de code
```
\ No newline at end of file
#+TITLE: Your title #+TITLE: The London cholera epidemic of 1854
#+AUTHOR: Your name #+AUTHOR: LT
#+DATE: Today's date
#+LANGUAGE: en #+LANGUAGE: en
# #+PROPERTY: header-args :eval never-export
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/htmlize.css"/> #+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/htmlize.css"/>
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/readtheorg.css"/> #+HTML_HEAD: <link rel="stylesheet" type="text/css" href="http://www.pirilampo.org/styles/readtheorg/css/readtheorg.css"/>
...@@ -11,84 +9,353 @@ ...@@ -11,84 +9,353 @@
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/lib/js/jquery.stickytableheaders.js"></script>
#+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script> #+HTML_HEAD: <script type="text/javascript" src="http://www.pirilampo.org/styles/readtheorg/js/readtheorg.js"></script>
* Some explanations #+LATEX_HEADER: \usepackage{a4}
#+LATEX_HEADER: \usepackage[french]{babel}
This is an org-mode document with code examples in R. Once opened in #+PROPERTY: header-args :session :exports both :eval never-export
Emacs, this document can easily be exported to HTML, PDF, and Office
formats. For more information on org-mode, see
https://orgmode.org/guide/.
When you type the shortcut =C-c C-e h o=, this document will be # FIXME! https://towardsdatascience.com/plotting-a-map-of-london-crime-data-using-r-8dcefef1c397
exported as HTML. All the code in it will be re-executed, and the
results will be retrieved and included into the exported document. If
you do not want to re-execute all code each time, you can delete the #
and the space before ~#+PROPERTY:~ in the header of this document.
Like we showed in the video, Python code is included as follows (and * Data recovery and pre-processing
is exxecuted by typing ~C-c C-c~): Let's start by loading a number of libraries which we
will be useful later.
#+begin_src R :results output :session *R* :exports both
library(ggplot2)
library(stringr)
library(dplyr)
library(tidyr)
#+end_src
#+RESULTS:
#+begin_example
Package attachment : ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
#+end_example
Now download the data on the Cholera epidemic of 1854.
#+begin_src python :results output :exports both #+begin_src R :results output :session *R* :exports both
print("Hello world!") cholera_file = "./cholera.csv";
url = "https://fusiontables.google.com/exporttable?query=select+*+from+1HsIb_r4gYYmIz8y_UE1h-X8yUtAYW2INy99BR_c&o=csv"
if(!file.exists(cholera_file)) { download.file(url, cholera_file) }
cholera = read.csv(cholera_file, strip.white=T)
head(cholera)
str(cholera)
#+end_src #+end_src
#+RESULTS: #+RESULTS:
: Hello world! #+begin_example
essai de l'URL 'https://fusiontables.google.com/exporttable?query=select+*+from+1HsIb_r4gYYmIz8y_UE1h-X8yUtAYW2INy99BR_c&o=csv'
downloaded 16 KB
count geometry
1 3 <Point><coordinates>-0.13793,51.513418</coordinates></Point>
2 2 <Point><coordinates>-0.137883,51.513361</coordinates></Point>
3 1 <Point><coordinates>-0.137853,51.513317</coordinates></Point>
4 1 <Point><coordinates>-0.137812,51.513262</coordinates></Point>
5 4 <Point><coordinates>-0.137767,51.513204</coordinates></Point>
6 2 <Point><coordinates>-0.137537,51.513184</coordinates></Point>
'data.frame': 250 obs. of 2 variables:
$ count : int 3 2 1 1 4 2 2 2 3 2 ...
$ geometry: Factor w/ 250 levels "<Point><coordinates>-0.132933,51.513258</coordinates></Point>",..: 202 199 197 195 193 185 213 206 217 214 ...
#+end_example
And now the same but in an Python session. With a session, Python's
state, i.e. the values of all the variables, remains persistent from
one code block to the next. The code is still executed using ~C-c
C-c~.
#+begin_src python :results output :session :exports both #+begin_src R :results output :session *R* :exports both
import numpy pump_file = "./water_pump.csv";
x=numpy.linspace(-15,15) url = "https://fusiontables.google.com/exporttable?query=select+*+from+1Al4BQYgV0BzoyEaWmTcCRgLMkOJkiBo8QUQT_tU&o=csv"
print(x) if(!file.exists(pump_file)) { download.file(url, pump_file) }
pump = read.csv(pump_file, strip.white=T)
head(pump)
str(pump)
#+end_src #+end_src
#+RESULTS: #+RESULTS:
#+begin_example #+begin_example
[-15. -14.3877551 -13.7755102 -13.16326531 -12.55102041 essai de l'URL 'https://fusiontables.google.com/exporttable?query=select+*+from+1Al4BQYgV0BzoyEaWmTcCRgLMkOJkiBo8QUQT_tU&o=csv'
-11.93877551 -11.32653061 -10.71428571 -10.10204082 -9.48979592 Content type 'application/x-download; charset=UTF-8' length 520 bytes
-8.87755102 -8.26530612 -7.65306122 -7.04081633 -6.42857143 ==================================================
-5.81632653 -5.20408163 -4.59183673 -3.97959184 -3.36734694 downloaded 520 bytes
-2.75510204 -2.14285714 -1.53061224 -0.91836735 -0.30612245 geometry
0.30612245 0.91836735 1.53061224 2.14285714 2.75510204 1 <Point><coordinates>-0.136668,51.513341</coordinates></Point>
3.36734694 3.97959184 4.59183673 5.20408163 5.81632653 2 <Point><coordinates>-0.139586,51.513876</coordinates></Point>
6.42857143 7.04081633 7.65306122 8.26530612 8.87755102 3 <Point><coordinates>-0.139671,51.514906</coordinates></Point>
9.48979592 10.10204082 10.71428571 11.32653061 11.93877551 4 <Point><coordinates>-0.13163,51.512354</coordinates></Point>
12.55102041 13.16326531 13.7755102 14.3877551 15. ] 5 <Point><coordinates>-0.133594,51.512139</coordinates></Point>
6 <Point><coordinates>-0.135919,51.511542</coordinates></Point>
'data.frame': 8 obs. of 1 variable:
$ geometry: Factor w/ 8 levels "<Point><coordinates>-0.13163,51.512354</coordinates></Point>",..: 5 7 8 1 2 4 3 6
#+end_example #+end_example
Finally, an example for graphical output:
#+begin_src python :results output file :session :var matplot_lib_filename="./cosxsx.png" :exports results
import matplotlib.pyplot as plt
plt.figure(figsize=(10,5))
plt.plot(x,numpy.cos(x)/x)
plt.tight_layout()
plt.savefig(matplot_lib_filename)
print(matplot_lib_filename) These coordinates in the form of a character string are not very practical. Let's extract the latitudes and longitudes.
#+begin_src R :results output :session *R* :exports both
cholera$geometry %>%
str_replace_all("[</>A-Za-z]*","") %>%
str_split(",") %>% do.call("rbind",.) %>% as.numeric() -> cholera[,c("lon","lat")]
cholera %>% select(-geometry) -> cholera
# the do.call tricks was inspired by http://ggorjan.blogspot.com/2011/01/converting-strsplit-output-to-dataframe.html
head(cholera)
str(cholera)
#+end_src #+end_src
#+RESULTS: #+RESULTS:
[[file:./cosxsx.png]] #+begin_example
count lon lat
1 3 -0.137930 51.51342
2 2 -0.137883 51.51336
3 1 -0.137853 51.51332
4 1 -0.137812 51.51326
5 4 -0.137767 51.51320
6 2 -0.137537 51.51318
'data.frame': 250 obs. of 3 variables:
$ count: int 3 2 1 1 4 2 2 2 3 2 ...
$ lon : num -0.138 -0.138 -0.138 -0.138 -0.138 ...
$ lat : num 51.5 51.5 51.5 51.5 51.5 ...
#+end_example
Note the parameter ~:exports results~, which indicates that the code
will not appear in the exported document. We recommend that in the
context of this MOOC, you always leave this parameter setting as
~:exports both~, because we want your analyses to be perfectly
transparent and reproducible.
Watch out: the figure generated by the code block is /not/ stored in #+begin_src R :results output :session *R* :exports both
the org document. It's a plain file, here named ~cosxsx.png~. You have pump$geometry %>%
to commit it explicitly if you want your analysis to be legible and str_replace_all("[</>A-Za-z]*","") %>%
understandable on GitLab. str_split(",") %>% do.call("rbind",.) %>% as.numeric() -> pump[,c("lon","lat")]
pump %>% select(-geometry) -> pump
# the do.call tricks was inspired by http://ggorjan.blogspot.com/2011/01/converting-strsplit-output-to-dataframe.html
head(pump)
str(pump)
#+end_src
#+RESULTS:
#+begin_example
lon lat
1 -0.136668 51.51334
2 -0.139586 51.51388
3 -0.139671 51.51491
4 -0.131630 51.51235
5 -0.133594 51.51214
6 -0.135919 51.51154
'data.frame': 8 obs. of 2 variables:
$ lon: num -0.137 -0.14 -0.14 -0.132 -0.134 ...
$ lat: num 51.5 51.5 51.5 51.5 51.5 ...
#+end_example
On the other hand, later, some algorithms (density estimation, / clustering /) that I will use do not manage well the fact that an entry includes several deaths. So I'm going to prepare a "flattened" version of ~ cholera ~. For this, I am inspired by the following entry on
[[https://stackoverflow.com/questions/46841463/expand-a-data-frame-to-have-as-many-rows-as-range-of-two-columns-in-original-row[[stackoverflow ]].
#+begin_src R :results output :session *R* :exports both
cholera %>%
rowwise() %>%
do(data.frame(lon= .$lon, lat= .$lat, value = 1:.$count)) %>%
as.data.frame() -> cholera_flat
head(cholera_flat )
#+end_src
#+RESULTS:
: lon lat value
: 1 -0.137930 51.51342 1
: 2 -0.137930 51.51342 2
: 3 -0.137930 51.51342 3
: 4 -0.137883 51.51336 1
: 5 -0.137883 51.51336 2
: 6 -0.137853 51.51332 1
* Visualisations
From there, it's easy to represent the previous data:
#+begin_src R :results output graphics :file "map1.png" :exports both :width 600 :height 600 :session *R*
ggplot(cholera,aes(x=lon,y=lat)) +
geom_point(aes(size=count),color="red3") +
geom_point(data=pump, color="blue", size=4) +
theme_bw() + coord_fixed()
#+end_src
#+RESULTS:
[[file:map1.png]]
** With a map of London (current)
But to have a real map, it is better to use the right tool: [[https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf[[ggmap]]. So I read this documentation well to improve things. I will center my map on the median of the previous coordinates.
#+begin_src R :results output :session *R* :exports both
library(ggmap)
mapImageData <- get_map(location = c(lon = median(cholera$lon), lat = median(cholera$lat)),
color = "color",
source = "google",
maptype = "toner",
zoom = 16)
LondonMap = ggmap(mapImageData,
extent = "device",
ylab = "Latitude",
xlab = "Longitude")
#+end_src
#+RESULTS:
#+begin_example
Google Maps API Terms of Service: http://developers.google.com/maps/terms.
Please cite ggmap if you use it: see citation('ggmap') for details.
maptype = "toner" is only available with source = "stamen".
resetting to source = "stamen"...
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=51.513379,-0.136288&zoom=16&size=640x640&scale=2&maptype=terrain&sensor=false
Map from URL : http://tile.stamen.com/toner/16/32741/21789.png
Map from URL : http://tile.stamen.com/toner/16/32742/21789.png
Map from URL : http://tile.stamen.com/toner/16/32743/21789.png
Map from URL : http://tile.stamen.com/toner/16/32744/21789.png
Map from URL : http://tile.stamen.com/toner/16/32741/21790.png
Map from URL : http://tile.stamen.com/toner/16/32742/21790.png
Map from URL : http://tile.stamen.com/toner/16/32743/21790.png
Map from URL : http://tile.stamen.com/toner/16/32744/21790.png
Map from URL : http://tile.stamen.com/toner/16/32741/21791.png
Map from URL : http://tile.stamen.com/toner/16/32742/21791.png
Map from URL : http://tile.stamen.com/toner/16/32743/21791.png
Map from URL : http://tile.stamen.com/toner/16/32744/21791.png
Warning message:
`panel.margin` is deprecated. Please use `panel.spacing` property instead
#+end_example
Let’s go. As before, in red the cases of Cholera and in blue the
sources of water.
#+begin_src R :results output graphics :file :file "map2.png" :exports both :width 800 :height 800 :session *R*
LondonMap +
geom_point(data=cholera,aes(x=lon,y=lat, size=count),color="red3") +
geom_point(data=pump, color="blue",size=4) + theme_bw()
#+end_src
#+RESULTS:
[[file:map2.png]]
** Clustering
L'histoire, dit qu'une des dames qui a contracté le Choléra faisait un
peu office d'/outlier/ car elle n'habitait pas à proximité de la pompe
infectée. Peut-être est-ce le petit point le plus en haut à droite ?
Peut-être qu'un algorithme de clustering aurait été capable d'aider à
le repérer ? Essayons de faire ça bien avec une bibliothèque faisant
de l'estimation par maximum de vraisemblance de mixtures de
gaussiennes. Pour ça, je m'inspire des [[https://www.statmethods.net/advstats/cluster.html][commandes données ici]].
#+begin_src R :results output :session *R* :exports both
library(mclust)
fit <- Mclust(as.matrix(cholera_flat[,c("lon","lat")]),verbose=F)
#+end_src
#+RESULTS:
: __ ___________ __ _____________
: / |/ / ____/ / / / / / ___/_ __/
: / /|_/ / / / / / / / /\__ \ / /
: / / / / /___/ /___/ /_/ /___/ // /
: /_/ /_/\____/_____/\____//____//_/ version 5.4.1
: Type 'citation("mclust")' for citing this R package in publications.
#+begin_src R :results output graphics :file "clustering_BIC.png" :exports both :width 600 :height 400 :session *R*
plot(fit,what = c("BIC"))
#+end_src
#+RESULTS:
[[file:clustering_BIC.png]]
Ouh là, je ne trouve pas que ça soit bon signe, un BIC qui monte comme
ça sans maximum bien clair.
#+begin_src R :results output :session *R* :exports both
summary(fit)
#+end_src
#+RESULTS:
#+begin_example
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VEV (ellipsoidal, equal shape) model with 7 components:
log.likelihood n df BIC ICL
5523.087 489 35 10829.44 10693.22
Clustering table:
1 2 3 4 5 6 7
121 119 26 73 83 23 44
#+end_example
Let's take a look at what the most likely classification looks like.
#+begin_src R :results output graphics :file "clustering.png" :exports both :width 600 :height 400 :session *R*
plot(fit,what = c("classification"))
#+end_src
#+RESULTS:
[[file:clustering.png]]
Bof, it's a little anything but hey, let's try anyway
to use this information on the previous card:
#+begin_src R :results output graphics :file "map3.png" :exports both :width 800 :height 800 :session *R*
cholera_flat$cluster = as.factor(fit$classification)
cluster = as.data.frame(t(fit$parameters$mean))
cluster$id = 1:nrow(cluster)
cholera_clustered = cholera_flat %>% group_by(lon,lat,cluster) %>% summarize(count=n())
outlier = cholera[cholera$lat == max(cholera$lat),]
LondonMap +
geom_point(data=cholera_clustered,
aes(x=lon, y=lat, size=count, color=cluster)) +
stat_ellipse(data=cholera_flat,aes(x=lon, y=lat, fill=cluster, group=cluster),
type = "norm",geom = "polygon", alpha = .3) +
geom_label(data = cluster, aes(x=lon, y=lat, label=id)) +
geom_label(data = outlier, aes(x=lon, y=lat+0.00012), fill="red", label="outlier",alpha=.3) +
scale_colour_brewer(palette="Set1") +
scale_fill_brewer(palette="Set1") + theme_bw()
#+end_src
#+RESULTS:
[[file:map3.png]]
** Estimation the density
#+begin_src R :results output graphics :file "map4.png" :exports both :width 800 :height 800 :session *R*
LondonMap +
stat_density2d(data=cholera_flat,
aes(x = lon, y = lat, fill = ..level.., alpha = ..level..),
size = 2, bins = 4,
geom = "polygon",
fill="red"
) +
geom_point(data=cholera,aes(x=lon,y=lat, size=count),color="red3") +
geom_point(data=pump, color="blue",size=4) +
geom_label(data = outlier, aes(x=lon, y=lat+0.00012), fill="red", label="outlier",alpha=.3) +
theme_bw()
#+end_src
#+RESULTS:
[[file:map4.png]]
#+begin_src R :results output graphics :file "map5.png" :exports both :width 800 :height 800 :session *R*
LondonMap +
stat_density2d(data=cholera,
aes(x = lon, y = lat, fill = ..level.., alpha = ..level..),
size = 2, bins = 4,
geom = "polygon",
fill="red"
) +
stat_ellipse(data=cholera,aes(x=lon, y=lat),
type = "norm",geom = "polygon", alpha = .3) +
geom_point(data=cholera,aes(x=lon,y=lat, size=count),color="red3") +
geom_point(data=pump, color="blue",size=4) +
geom_label(data = outlier, aes(x=lon, y=lat+0.00012), fill="red", label="outlier",alpha=.3) +
theme_bw()
#+end_src
#+RESULTS:
[[file:map5.png]]
Finally, don't forget that we provide in the resource section of this
MOOC a configuration with a few keyboard shortcuts that allow you to
quickly create code blocks in Python by typing ~<p~, ~<P~ or ~<PP~
followed by ~Tab~.
Now it's your turn! You can delete all this information and replace it
by your computational document.