1. From numerical data, draw map in John Snow's spirit. Show death places with markers whose size indicates number of deaths. Show water pumps on the same map with a different symbol.
1. From numerical data, draw a map in John Snow's spirit. Show death places with markers whose size indicates number of deaths. Show water pumps on the same map with a different symbol.
2. Try to find different ways to show that the Broad street pump is at the center of the outbreak. (Density of deaths in the neighbourhood ? Other approaches ?).
2. Try to find different ways to show that the Broad street pump is at the center of the outbreak. (Density of deaths in the neighborhood ? Other approaches ?).
3. Submit work on FUN.
...
...
@@ -23,11 +23,11 @@ Use `ggmap` with OpenStreetMaps as map template with `source="osm"`.
# Importing data
There is already an version of John Snow's data sets and map in the package `HistData` in `Snow.deaths` and `Snow.pumps`. However, the coordinates fit John Snow's map and no other maps.
There is already a version of John Snow's data sets and map in the package `HistData` in `Snow.deaths` and `Snow.pumps`. However, the coordinates fit John Snow's map and no other maps.
First download and extract the zip file from data sets available at this adress : [http://rtwilson.com/downloads/SnowGIS_SHP.zip](http://rtwilson.com/downloads/SnowGIS_SHP.zip). It is the first link on [Robin's blog](http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/).
First download and extract the zip file from data sets available at this address : [http://rtwilson.com/downloads/SnowGIS_SHP.zip](http://rtwilson.com/downloads/SnowGIS_SHP.zip). It is the first link on [Robin's blog](http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/).
Two files of interest are the `Cholera_Deaths.shp` and the `Pumps.shp` files containing death people and water pumps coordinates.
The next code lines test if the file `Cholera_Deaths.shp`exists. If not, you have to download it manually as I do not know how to unzip documents in R ...
The next code lines test if the file `Cholera_Deaths.shp`exists. If not, you have to download it manually as I do not know how to unzip documents in R ...
# Displaying the Cholera deaths and pumps on the map
We will first extract the points and their coordinates from the `.shp` file.
...
...
@@ -101,10 +103,10 @@ Remember that we put the Deaths points in the `Deaths` variable and the Pumps po
Deaths_coord <- data.frame(Deaths@coords)
Pumps_coord <- data.frame(Pumps@coords)
```
Unfortunetely, both coordinates are in _OSGB36 National Grid_ reference while our map is in classic _decimal degrees_ reference.
Fortunetely, I found on the internet this following code that works but I do not know exactly how it works ...
Unfortunately, both coordinates are in _OSGB36 National Grid_ reference while our map is in classic _decimal degrees_ reference.
Fortunately, I found on the internet this following code that works but I do not know exactly how it works ...
It seems that it specifies the _Coordinate Reference System (CRS)_ first to the Deaths and Pumps coordinates with the `CRS` function.
And then, it transforms the system to a more classic longitude and lattitude system with the `sptransform` function.
And then, it transforms the system to a more classic longitude and latitude system with the `sptransform` function.
```{r}
coordinates(Deaths_coord)=~coords.x1+coords.x2
coordinates(Pumps_coord)=~coords.x1+coords.x2
...
...
@@ -124,35 +126,47 @@ head(df_Pumps)
```
Looks fine for both points :)!
Let's project the Deaths points on our map with the `geom_point`function in red and with a size depending on the number of deaths recorded at the particular coordinates :
Let's project the Deaths points on our map with the `geom_point`function in red and with a size depending on the number of deaths recorded at the particular coordinates :
From these vectors, we see that the 2nd vector seems to have the shortest distances. It corresponds to the pump number 1 which is the one on Broad street.
We can now plot the mean distance of each pump on a barplot to see if it is the case.
labs( x = "Pump number", y = "Mean distance between local maxima and pump",
title ="Distances of pumps from outbreak") +
scale_x_discrete(limits = c(0:7)) +
scale_x_continuous(breaks = c(-1:50))
gg
```
Pump number 1 is on average closer to the local maxima of Deaths and the number 5 is the farthest.
Perhaps, we can try to test if all the pump distance means are statistically greater than the mean distance of pump 1 by testing 2 by 2 with a one-sided Mann-Whitney test with the function `wilcox.test` and the option `alternative = "g"` with "g" standing for "greater".
```{r}
wilcox.test(dist_pump0, dist_pump1, alternative = "g")
wilcox.test(dist_pump2, dist_pump1, alternative = "g")
wilcox.test(dist_pump3, dist_pump1, alternative = "g")
wilcox.test(dist_pump4, dist_pump1, alternative = "g")
wilcox.test(dist_pump5, dist_pump1, alternative = "g")
wilcox.test(dist_pump6, dist_pump1, alternative = "g")
wilcox.test(dist_pump7, dist_pump1, alternative = "g")
```
Except for pumps number 5, 6 and 7, none of the means are greater than the mean of pump number 1.
# Conclusion
I think that I kind of showed that the pump on Broad street was in the middle of the outbreak. However, my statistical tests are not really conclusive ... This is because I do not know how to deal with density and distances. The fact that I reduced the density in local maxima lower the tests power and that is why it is not significantly different.
For the troubles part, I had difficulties in getting the data as I do not have experience with maps in R. At first, I almost believed that there was something wrong with Robin's blog website because some links did not work. But, by going on forums on the internet, I understood that I could read `.shp` files. Then I had real difficulties getting a map. The OpenStreetMaps source in the `ggmap` function did not work as it asked a Google API key ... I got around by getting Stanemmap maps. However, as I finally get the map to work, I realized that the files with the deaths and pumps coordinates had a different coordinate system than the map. I am lucky that the code I copied pasted worked. I think, I would have used the map that was in the folder `SnowGIS_SHP` if it did not work. Next steps was quite easy as I do know how to plot with `ggplot`.