diff --git a/module3/Peer_Review_cholera.Rmd b/module3/Peer_Review_cholera.Rmd index 8555d9ebcc4426ef95f12ba00f8a88772cb17630..dc9016894d44504e62e18ebfdf3fdd947ac26f3c 100644 --- a/module3/Peer_Review_cholera.Rmd +++ b/module3/Peer_Review_cholera.Rmd @@ -1,10 +1,10 @@ --- title: "Peer_Review_Cholera" -author: "Marc" -date: "15/04/2020" +author: "Marc OUDART" +date: "15-20/04/2020" output: - html_document: default pdf_document: default + html_document: default --- ```{r setup, include=FALSE} @@ -13,9 +13,9 @@ knitr::opts_chunk$set(echo = TRUE) # Instructions -1. From numerical data, draw map in John Snow's spirit. Show death places with markers whose size indicates number of deaths. Show water pumps on the same map with a different symbol. +1. From numerical data, draw a map in John Snow's spirit. Show death places with markers whose size indicates number of deaths. Show water pumps on the same map with a different symbol. -2. Try to find different ways to show that the Broad street pump is at the center of the outbreak. (Density of deaths in the neighbourhood ? Other approaches ?). +2. Try to find different ways to show that the Broad street pump is at the center of the outbreak. (Density of deaths in the neighborhood ? Other approaches ?). 3. Submit work on FUN. @@ -23,11 +23,11 @@ Use `ggmap` with OpenStreetMaps as map template with `source="osm"`. # Importing data -There is already an version of John Snow's data sets and map in the package `HistData` in `Snow.deaths` and `Snow.pumps`. However, the coordinates fit John Snow's map and no other maps. +There is already a version of John Snow's data sets and map in the package `HistData` in `Snow.deaths` and `Snow.pumps`. However, the coordinates fit John Snow's map and no other maps. -First download and extract the zip file from data sets available at this adress : [http://rtwilson.com/downloads/SnowGIS_SHP.zip](http://rtwilson.com/downloads/SnowGIS_SHP.zip). It is the first link on [Robin's blog](http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/). +First download and extract the zip file from data sets available at this address : [http://rtwilson.com/downloads/SnowGIS_SHP.zip](http://rtwilson.com/downloads/SnowGIS_SHP.zip). It is the first link on [Robin's blog](http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/). Two files of interest are the `Cholera_Deaths.shp` and the `Pumps.shp` files containing death people and water pumps coordinates. -The next code lines test if the file `Cholera_Deaths.shp`exists. If not, you have to download it manually as I do not know how to unzip documents in R ... +The next code lines test if the file `Cholera_Deaths.shp` exists. If not, you have to download it manually as I do not know how to unzip documents in R ... ```{r} if(file.exists("SnowGIS_SHP/Cholera_Deaths.shp") == FALSE){ print("You must download the zip file as explained above") @@ -48,29 +48,29 @@ if(!require("maptools")){ } ``` Therefore, we can read the data with the `readShapePoints` function : -```{r} +```{r message=FALSE, warning=FALSE} Deaths <- readShapePoints("SnowGIS_SHP/Cholera_Deaths") head(Deaths) ``` We have coordinates, Id (I do not know what it is but I think it is not important) and Count that are the number of deaths at this location. -To extract coordinates, we can use the `coords`function. +To extract coordinates, we can use the `coords` function. ```{r} head(Deaths@coords) ``` -The number of recensed deaths is : +The number of registered deaths is : ```{r} sum(Deaths$Count) ``` -It seems that this is less than recorded deaths (616) displayed on the FUN website but let's suppose that it Robin's fault... +It seems that this is less than recorded deaths (616) displayed on the FUN website but let's suppose that it is Robin's fault... We can also read the file with the Pumps coordinates : -```{r} +```{r message=FALSE, warning=FALSE} Pumps <- readShapePoints("SnowGIS_SHP/Pumps") head(Pumps) ``` There are 6 pumps. -# Extracting london map +# Extracting London map For this purpose we will need the `ggmap` package. The next code lines test if ggmap is installed. If not, it does and load it : @@ -80,19 +80,21 @@ if(!require("ggmap")){ library("ggmap") } ``` -I got the london coordinates centrered on Broad Street on [Stamen Maps website](maps.stamen.com/toner/#17/51.51413/-0.13650 +I got the London coordinates centered on Broad Street on [Stamen Maps website](maps.stamen.com/toner/#17/51.51413/-0.13650 ) : ```{r} -london <- c(left = -0.14454, bottom = 51.51139, right = -0.13119, top = 51.51630) +london <- c(left = -0.143, bottom = 51.51, right = -0.131, top = 51.517) ``` -We will ge the map based on the above coordinates with streets names with the `maptype = "toner"`. +We will get the map based on the above coordinates with streets names with the `maptype = "toner"`. ```{r} london_map = get_stamenmap(london, zoom = 17, maptype = "toner") ``` We can display the map with ggmap. ```{r} map <- ggmap(london_map) +map ``` +Good, it looks like John Snow's map ! # Displaying the Cholera deaths and pumps on the map We will first extract the points and their coordinates from the `.shp` file. @@ -101,10 +103,10 @@ Remember that we put the Deaths points in the `Deaths` variable and the Pumps po Deaths_coord <- data.frame(Deaths@coords) Pumps_coord <- data.frame(Pumps@coords) ``` -Unfortunetely, both coordinates are in _OSGB36 National Grid_ reference while our map is in classic _decimal degrees_ reference. -Fortunetely, I found on the internet this following code that works but I do not know exactly how it works ... +Unfortunately, both coordinates are in _OSGB36 National Grid_ reference while our map is in classic _decimal degrees_ reference. +Fortunately, I found on the internet this following code that works but I do not know exactly how it works ... It seems that it specifies the _Coordinate Reference System (CRS)_ first to the Deaths and Pumps coordinates with the `CRS` function. -And then, it transforms the system to a more classic longitude and lattitude system with the `sptransform` function. +And then, it transforms the system to a more classic longitude and latitude system with the `sptransform` function. ```{r} coordinates(Deaths_coord)=~coords.x1+coords.x2 coordinates(Pumps_coord)=~coords.x1+coords.x2 @@ -124,35 +126,47 @@ head(df_Pumps) ``` Looks fine for both points :)! -Let's project the Deaths points on our map with the `geom_point`function in red and with a size depending on the number of deaths recorded at the particular coordinates : +Let's project the Deaths points on our map with the `geom_point` function in red and with a size depending on the number of deaths recorded at the particular coordinates : ```{r} -map + geom_point(df_Deaths, mapping = aes(coords.x1, coords.x2), color = "red", size = Deaths$Count, alpha = 0.5) +mp <- map + geom_point(df_Deaths, mapping = aes(coords.x1, coords.x2), color = "red", + size = Deaths$Count, alpha = 0.5) +mp ``` Let's add the pumps in another color and symbol : ```{r} -map + geom_point(df_Deaths, mapping = aes(coords.x1, coords.x2), color = "red", size = Deaths$Count, alpha = 0.5) + - geom_point(df_Pumps, mapping = aes(coords.x1, coords.x2), color = "blue", shape = 24, size = 3, fill = "blue") +mp2 <- mp + geom_point(df_Pumps, mapping = aes(coords.x1, coords.x2), color = "blue", + shape = 24, size = 3, fill = "blue") +mp2 ``` -Highlighting the Broad street pump (number 1 in the coordinates list : +Highlighting the Broad street pump (number 1 in the coordinates list) : ```{r} -map + geom_point(df_Deaths, mapping = aes(coords.x1, coords.x2), color = "red", size = Deaths$Count, alpha = 0.5) + - geom_point(df_Pumps, mapping = aes(coords.x1, coords.x2), color = "blue", shape = 24, size = 3, fill = "blue") + - geom_point(df_Pumps, mapping = aes(coords.x1[1], coords.x2[1]), color = "green", shape = 24, size = 3, fill = "green") +mp3 <- mp2 + geom_point(df_Pumps, mapping = aes(coords.x1[1], coords.x2[1]), + color = "green", shape = 24, size = 3, fill = "green") +mp3 ``` +From this map, we can clearly see that the Pump on Broad street is in the center of the outbreak. # Analysis to show that the Broad street pump is at the center of the outbreak We can try to plot a density of Deaths with the `stat_density_2d` function : ```{r} -map + geom_point(df_Deaths, mapping = aes(coords.x1, coords.x2), color = "red", size = Deaths$Count, alpha = 0.5) + - stat_density_2d(data = df_Deaths, aes(coords.x1, coords.x2, fill = after_stat(level)), geom = "polygon", alpha = 0.3) + - geom_point(df_Pumps, mapping = aes(coords.x1, coords.x2), color = "blue", shape = 24, size = 3, fill = "blue") + - geom_point(df_Pumps, mapping = aes(coords.x1[1], coords.x2[1]), color = "green", shape = 24, size = 3, fill = "green") +mp4 <- mp3 + stat_density_2d(data = df_Deaths, aes(coords.x1, coords.x2, + fill = after_stat(level)), + geom = "polygon", alpha = 0.3) + + geom_point(df_Pumps, mapping = aes(coords.x1, coords.x2), color = "blue", + shape = 24, size = 3, fill = "blue") + + geom_point(df_Pumps, mapping = aes(coords.x1[1], coords.x2[1]), color = "green", + shape = 24, size = 3, fill = "green") + + theme(legend.position = "none") +mp4 ``` We can clearly see here that the green pump on broadwick street seems to be at the center of the outbreak. - -We can try to get the coordinates of the maximum density points with function in `MASS` and `raster` libraries : -```{r} + +## Computing distances between deaths density and each Pump + +In order to have numerical data to compare, we can find the maxima of the Deaths density and compute the distances between them and the pumps. +We can try to get the coordinates of the maximum density points with functions in `MASS` and `raster` libraries : +```{r message=FALSE, warning=FALSE} if(!require("MASS")){ install.packages("MASS") library("MASS") @@ -162,8 +176,8 @@ if(!require("raster")){ library("raster") } ``` -As I am no mathematician I copy pasted a code from the internet. -I think that it computes and retrieve the coordiantes of the local maxima from the Deaths data frame. +As I am no mathematician I copied pasted a code from the internet. +I think that it computes and retrieve the coordinates of the local maxima from the Deaths data frame. ```{r} w = matrix(1,3,3) x = kde2d(x = df_Deaths$coords.x1, y = df_Deaths$coords.x2) @@ -178,48 +192,99 @@ maxXY In maxXY there are the coordinates of the 5 local maxima. We can plot them on the map in yellow. ```{r} -map + geom_point(df_Deaths, mapping = aes(coords.x1, coords.x2), color = "red", size = Deaths$Count, alpha = 0.5) + - stat_density_2d(data = df_Deaths, aes(coords.x1, coords.x2, fill = after_stat(level)), geom = "polygon", alpha = 0.3) + - geom_point(df_Pumps, mapping = aes(coords.x1, coords.x2), color = "blue", shape = 24, size = 3, fill = "blue") + - geom_point(df_Pumps, mapping = aes(coords.x1[1], coords.x2[1]), color = "green", shape = 24, size = 3, fill = "green") + - geom_point(maxXY, mapping = aes(x, y), color = "yellow", size = 3, fill = "yellow") +mp5 <- mp4 + geom_point(maxXY, mapping = aes(x, y), color = "yellow", size = 3, fill = "yellow") +mp5 ``` It really seems that the closest pump to the maxima is the one on Broad street. Let's see if we can compute the distance between these maxima and the pumps to show what pump is the closest to what maximum. -If I remember well, the distance between 2 points on a graph can be calculated with the following equation : $$\frac{y2 - y1}{x2 - x1}$$. +If I remember well, the distance between 2 points on a graph can be calculated with the following equation : $$\frac{y2 - y1}{x2 - x1}$$ +In the next code lines, I initialized distance vectors between each 5 local maxima and the 8 pumps named `dist_pumpi` with i the number of the pump. +Then, it computes the distances with local maxima and the pump in a `for` loop to go around all local maxima. There is one `for` loop for each pump. +I think that I could have done one `for` loop in another loop to reduce the code and go around all pumps that way but it was fast the following way. ```{r} +dist_pump0 <- c() dist_pump1 <- c() dist_pump2 <- c() dist_pump3 <- c() dist_pump4 <- c() dist_pump5 <- c() +dist_pump6 <- c() +dist_pump7 <- c() +for (i in c(1:length(maxXY$x))){ + dist_pump0[i] <- abs((maxXY$y[i]-df_Pumps$coords.x2[1])/(maxXY$x[i]-df_Pumps$coords.x1[1])) +} +for (i in c(1:length(maxXY$x))){ + dist_pump1[i] <- abs((maxXY$y[i]-df_Pumps$coords.x2[2])/(maxXY$x[i]-df_Pumps$coords.x1[2])) +} for (i in c(1:length(maxXY$x))){ - dist_pump1[i] <- abs(maxXY$y[i]-df_Pumps$coords.x2[1])/(maxXY$x[i]-df_Pumps$coords.x1[1]) + dist_pump2[i] <- abs((maxXY$y[i]-df_Pumps$coords.x2[3])/(maxXY$x[i]-df_Pumps$coords.x1[3])) } for (i in c(1:length(maxXY$x))){ - dist_pump2[i] <- abs(maxXY$y[i]-df_Pumps$coords.x2[2])/(maxXY$x[i]-df_Pumps$coords.x1[2]) + dist_pump3[i] <- abs((maxXY$y[i]-df_Pumps$coords.x2[4])/(maxXY$x[i]-df_Pumps$coords.x1[4])) } for (i in c(1:length(maxXY$x))){ - dist_pump3[i] <- abs(maxXY$y[i]-df_Pumps$coords.x2[3])/(maxXY$x[i]-df_Pumps$coords.x1[3]) + dist_pump4[i] <- abs((maxXY$y[i]-df_Pumps$coords.x2[5])/(maxXY$x[i]-df_Pumps$coords.x1[5])) } for (i in c(1:length(maxXY$x))){ - dist_pump4[i] <- abs(maxXY$y[i]-df_Pumps$coords.x2[4])/(maxXY$x[i]-df_Pumps$coords.x1[4]) + dist_pump5[i] <- abs((maxXY$y[i]-df_Pumps$coords.x2[6])/(maxXY$x[i]-df_Pumps$coords.x1[6])) } for (i in c(1:length(maxXY$x))){ - dist_pump5[i] <- abs(maxXY$y[i]-df_Pumps$coords.x2[5])/(maxXY$x[i]-df_Pumps$coords.x1[5]) + dist_pump6[i] <- abs((maxXY$y[i]-df_Pumps$coords.x2[7])/(maxXY$x[i]-df_Pumps$coords.x1[7])) } +for (i in c(1:length(maxXY$x))){ + dist_pump7[i] <- abs((maxXY$y[i]-df_Pumps$coords.x2[8])/(maxXY$x[i]-df_Pumps$coords.x1[8])) +} +dist_pump0 dist_pump1 dist_pump2 dist_pump3 dist_pump4 dist_pump5 +dist_pump6 +dist_pump7 ``` +From these vectors, we see that the 2nd vector seems to have the shortest distances. It corresponds to the pump number 1 which is the one on Broad street. + +We can now plot the mean distance of each pump on a barplot to see if it is the case. +```{r} +Mean_dist_pump <- data.frame(index = c(0:7), dist = c(mean(dist_pump0),mean(dist_pump1), + mean(dist_pump2), mean(dist_pump3), + mean(dist_pump4), mean(dist_pump5), + mean(dist_pump6), mean(dist_pump7))) +Sd_dist_pump <- data.frame(index = c(0:7), dist = c(sd(dist_pump0),sd(dist_pump1), + sd(dist_pump2), sd(dist_pump3), + sd(dist_pump4), sd(dist_pump5), + sd(dist_pump6), sd(dist_pump7))) +``` +We call ggplot with `geom_bar` and `geom_error_bar` to plot the mean and its standard deviation. +```{r} +gg <- ggplot(data = Mean_dist_pump, aes(x = index, y = dist)) + + geom_bar(stat = "identity") + + geom_errorbar(aes(ymin=dist-Sd_dist_pump$dist, ymax=dist+Sd_dist_pump$dist)) + + labs( x = "Pump number", y = "Mean distance between local maxima and pump", + title ="Distances of pumps from outbreak") + + scale_x_discrete(limits = c(0:7)) + + scale_x_continuous(breaks = c(-1:50)) +gg +``` +Pump number 1 is on average closer to the local maxima of Deaths and the number 5 is the farthest. +Perhaps, we can try to test if all the pump distance means are statistically greater than the mean distance of pump 1 by testing 2 by 2 with a one-sided Mann-Whitney test with the function `wilcox.test` and the option `alternative = "g"` with "g" standing for "greater". +```{r} +wilcox.test(dist_pump0, dist_pump1, alternative = "g") +wilcox.test(dist_pump2, dist_pump1, alternative = "g") +wilcox.test(dist_pump3, dist_pump1, alternative = "g") +wilcox.test(dist_pump4, dist_pump1, alternative = "g") +wilcox.test(dist_pump5, dist_pump1, alternative = "g") +wilcox.test(dist_pump6, dist_pump1, alternative = "g") +wilcox.test(dist_pump7, dist_pump1, alternative = "g") +``` +Except for pumps number 5, 6 and 7, none of the means are greater than the mean of pump number 1. +# Conclusion - - - - +I think that I kind of showed that the pump on Broad street was in the middle of the outbreak. However, my statistical tests are not really conclusive ... This is because I do not know how to deal with density and distances. The fact that I reduced the density in local maxima lower the tests power and that is why it is not significantly different. + +For the troubles part, I had difficulties in getting the data as I do not have experience with maps in R. At first, I almost believed that there was something wrong with Robin's blog website because some links did not work. But, by going on forums on the internet, I understood that I could read `.shp` files. Then I had real difficulties getting a map. The OpenStreetMaps source in the `ggmap` function did not work as it asked a Google API key ... I got around by getting Stanemmap maps. However, as I finally get the map to work, I realized that the files with the deaths and pumps coordinates had a different coordinate system than the map. I am lucky that the code I copied pasted worked. I think, I would have used the map that was in the folder `SnowGIS_SHP` if it did not work. Next steps was quite easy as I do know how to plot with `ggplot`. diff --git a/module3/Peer_Review_cholera.pdf b/module3/Peer_Review_cholera.pdf index f0142339c2b175716098341c3a721bace48de6ad..6af1f4b46029e238079bb765d964e88094352344 100644 Binary files a/module3/Peer_Review_cholera.pdf and b/module3/Peer_Review_cholera.pdf differ