This notebook aims to be an introduction to Geographical Information Systems (GIS) in R and is based on the corresponding 2 day course which is taught in ArcGIS. It relies on the simple features (sf) package. Simple features or simple feature access is a standard that describes how spatial objects can be represented and stored. More on this at https://r-spatial.github.io/sf/articles/sf1.html. We will also make use of the tidyverse collection of packages, especially ggplot2 for plotting maps, dplyr for data transformation and pipes, so it is advisable that you have at least some knowledge of the tidyverse before going through this notebook. We will also need the datasets from the monkeypox, london-pct and crypto folders which can be found in the LSHTM U drive under downloads/teach/GIS/ArcGIS.
Let’s start by loading our packages (installing them beforehand if necessary) and setting our working directory. Make sure this file is is the directory as the monkeypox, london-pct and crypto folders.
#install.packages("sf")
#install.packages("tidyverse")
library(sf)
library(tidyverse)
Before we start: We need to choose a location to save our data, and tell R where to look for the dataets we will be using. In your file manager, create to a new folder and copy the folders “monkeypox”, “London-PCT”, and “crypto” into your new location.
These files can be found at U:/SHARED/DATA/Download/Teach/GIS/ArcGIS
Now, tell R studio where to find your files by setting your working directory to the location where you copied these folders.
setwd('path/to/your/folders')
Let’s start by loading our data using the st_read function from the sf package and plotting it using ggplot. If we view the contents of our newly created dataframe we can see that the spatial data is stored as a variable called geometry.
#Windows: zaire <- st_read("H:/GIS/ArcGIS/monkeypox/zaire.shp")
zaire <- st_read("monkeypox/ZAIRE.SHP")
## Reading layer `ZAIRE' from data source `/Users/hamishgibbs/Documents/GIS_in_R/monkeypox/ZAIRE.SHP' using driver `ESRI Shapefile'
## Simple feature collection with 13 features and 1 field
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 12.20663 ymin: -13.45568 xmax: 31.30591 ymax: 5.386098
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
ggplot(zaire) +
geom_sf()
Let’s load some of the other datasets and create a plot with multiple layers, coloring the forest green.
villages <- st_read("monkeypox/VILLAGES.SHP")
## Reading layer `VILLAGES' from data source `/Users/hamishgibbs/Documents/GIS_in_R/monkeypox/VILLAGES.SHP' using driver `ESRI Shapefile'
## Simple feature collection with 48 features and 3 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 22.94857 ymin: -4.116706 xmax: 24.84879 ymax: -3.102252
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
roads <- st_read("monkeypox/ROADS.SHP")
## Reading layer `ROADS' from data source `/Users/hamishgibbs/Documents/GIS_in_R/monkeypox/ROADS.SHP' using driver `ESRI Shapefile'
## Simple feature collection with 1202 features and 2 fields
## geometry type: LINESTRING
## dimension: XY
## bbox: xmin: 22.18297 ymin: -5 xmax: 25.79088 ymax: -1.594207
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
rivers <- st_read("monkeypox/RIVERS.SHP")
## Reading layer `RIVERS' from data source `/Users/hamishgibbs/Documents/GIS_in_R/monkeypox/RIVERS.SHP' using driver `ESRI Shapefile'
## Simple feature collection with 476 features and 2 fields
## geometry type: LINESTRING
## dimension: XY
## bbox: xmin: 22.20613 ymin: -5 xmax: 25.52832 ymax: -1.876943
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
forest <- st_read("monkeypox/FOREST.SHP")
## Reading layer `FOREST' from data source `/Users/hamishgibbs/Documents/GIS_in_R/monkeypox/FOREST.SHP' using driver `ESRI Shapefile'
## Simple feature collection with 75 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 22.4703 ymin: -4.929236 xmax: 25.378 ymax: -1.901003
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
ggplot() +
geom_sf(data = forest, fill = "green") +
geom_sf(data = roads) +
geom_sf(data = villages)
Now let’s make the following adjustments:
ggplot() +
geom_sf(data = forest, aes(fill = VEGNTX)) +
scale_fill_brewer(palette = "Greens") +
theme_bw() +
geom_sf(data = roads) +
geom_sf(data = villages, aes(size = POPULATION), colour = "red", show.legend = "point") +
coord_sf(xlim = c(23, 25), ylim = c(-2.5, -4.5), expand = FALSE)
We can also plot our data on an interactive map using the leaflet package. This can be useful to explore maps like in ArcGIS, however is less useful for producing publication grade maps, so for the rest of the notebook we will stick with ggplot.
#install.packages("mapview")
library(leaflet)
leaflet(forest) %>%
addTiles() %>%
addPolygons(color = 'green',
fillColor = 'green')
What we would call “layer attributes” in ArcGIS is stored together with our geometries in data frames, so to examine the contents we can use functions such as view(), glimpse(), head(), summary() just as we would with any other dataset in R.
head(forest)
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 22.56791 ymin: -2.84007 xmax: 23.71322 ymax: -2.200254
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## VEGN VEGNTX geometry
## 1 130 Degraded rain forest MULTIPOLYGON (((23.34719 -2...
## 2 130 Degraded rain forest MULTIPOLYGON (((22.59153 -2...
## 3 130 Degraded rain forest MULTIPOLYGON (((23.60696 -2...
## 4 130 Degraded rain forest MULTIPOLYGON (((23.21732 -2...
## 5 130 Degraded rain forest MULTIPOLYGON (((23.48888 -2...
## 6 130 Degraded rain forest MULTIPOLYGON (((23.71322 -2...
glimpse(roads)
## Observations: 1,202
## Variables: 3
## $ RDLNTYPE <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, …
## $ RDLNSTAT <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ geometry <LINESTRING [°]> LINESTRING (25.05655 -2.054..., LINESTRING (25.05…
summary(villages)
## ID NAME POPULATION geometry
## Min. : 1.00 Longala : 2 Min. : 48.0 POINT :48
## 1st Qu.:12.75 Ohoma : 2 1st Qu.: 298.5 epsg:4326 : 0
## Median :31.50 Akungula: 1 Median : 474.0 +proj=long...: 0
## Mean :34.35 Demba : 1 Mean : 1087.9
## 3rd Qu.:50.25 Dikunga : 1 3rd Qu.: 929.0
## Max. :77.00 Dimanga : 1 Max. :14269.0
## (Other) :40
We join tables using the family of join functions from dplyr, most commonly with left_join() which joins one table to columns from another, matching values with the rows that they correspond to.
Let’s add some more datasets from a shapefile and from some dbf files. We can do this with read.dbf from the foreign package.
province <- st_read("monkeypox/PROVINCE.SHP")
## Reading layer `PROVINCE' from data source `/Users/hamishgibbs/Documents/GIS_in_R/monkeypox/PROVINCE.SHP' using driver `ESRI Shapefile'
## Simple feature collection with 31 features and 1 field
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 12.20663 ymin: -13.45568 xmax: 31.30591 ymax: 5.386098
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
provdata <- foreign::read.dbf("monkeypox/PROVDATA.DBF")
health <- foreign::read.dbf("monkeypox/HEALTH.DBF")
First let’s join village and health. The second dataset will be matched to the first one by the variable “ID” which is present in both datasets. We could assign the output of the left_join to the existing villages object, a new object or we can pipe the output directly into ggplot.
left_join(villages, health) %>%
ggplot() +
geom_sf(aes(size = RATE))
## Joining, by = "ID"
Now let’s join provdata to provinces, plot the resulting object and colour the provinces by population.
left_join(province, provdata) %>%
ggplot() +
geom_sf(aes(fill=POP94))
## Joining, by = "NAME"
We can also use the plot() function where we can also set the breaks to be quantiles.
provjoined <- left_join(province, provdata)
## Joining, by = "NAME"
plot(provjoined["POP94"], breaks = "quantile")
It’s a bit more complicated displaying quantiles in ggplot. Here is a method using the classInt package to determine breaks.
library(classInt)
breaks_qt <- classIntervals(c(min(provjoined$POP94) - .00001, provjoined$POP94), n = 7, style = "quantile")
breaks_qt
## style: quantile
## one of 736,281 possible partitions of this variable into 7 classes
## [183679,519080.7) [519080.7,849968.3) [849968.3,1035645) [1035645,1205096)
## 5 4 5 4
## [1205096,1357066) [1357066,2117232) [2117232,3485330]
## 5 4 5
provjoined <- mutate(provjoined, POP94cut = cut(POP94, breaks_qt$brks, dig.lab = 7))
#use dig.lab to limit the decimal places so the legend doesn't display in scientific notation.
ggplot(provjoined) +
geom_sf(aes(fill=POP94cut)) +
scale_fill_brewer(palette = "Greens")
We can select observations (rows in a data frame) in R using dplyr’s filter() function. We start by joining health to villages again, then we filter the resulting object to only include villages which have at least one case of chickenpox. We then pipe the resulting object directly into ggplot to plot said villages.
left_join(villages, health) %>%
filter(N_VZV_IGM >=1) %>%
ggplot() +
geom_sf()
## Joining, by = "ID"
Now let’s prectice what we’ve learned so far using health data from the greater London area from the London Health Observatory. If we think our Enivironment is getting too cluttered by now we can always clear all objects from the workspace using the command rm(list = ls()).
rm(list = ls())
londonla <- st_read("London-PCT/london_la.shp")
## Reading layer `london_la' from data source `/Users/hamishgibbs/Documents/GIS_in_R/London-PCT/london_la.shp' using driver `ESRI Shapefile'
## Simple feature collection with 33 features and 8 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 503568.1 ymin: 155850.8 xmax: 561957.4 ymax: 200933.9
## epsg (SRID): NA
## proj4string: NA
londonpct <- st_read("London-PCT/london_pct.shp")
## Reading layer `london_pct' from data source `/Users/hamishgibbs/Documents/GIS_in_R/London-PCT/london_pct.shp' using driver `ESRI Shapefile'
## Simple feature collection with 31 features and 5 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 503568.1 ymin: 155850.8 xmax: 561957.4 ymax: 200933.9
## epsg (SRID): NA
## proj4string: NA
londonha <- st_read("London-PCT/london_ha.shp")
## Reading layer `london_ha' from data source `/Users/hamishgibbs/Documents/GIS_in_R/London-PCT/london_ha.shp' using driver `ESRI Shapefile'
## Simple feature collection with 16 features and 5 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 503568.1 ymin: 155850.8 xmax: 561957.4 ymax: 200933.9
## epsg (SRID): NA
## proj4string: NA
tb <- foreign::read.dbf("London-PCT/tb.dbf")
chd <- foreign::read.dbf("London-PCT/chd.dbf")
obesity <- foreign::read.dbf("London-PCT/obesity.dbf")
vaccinations <- foreign::read.dbf("London-PCT/vaccinations.dbf")
imd <- foreign::read.dbf("London-PCT/imd.dbf")
#If there are no identical variables in two objects we wish to join, we need to specify on which variables to join using "by".
left_join(londonha, obesity, by = c("HA95_CODE" = "HA_CODE")) %>%
ggplot() +
geom_sf(aes(fill=RATE)) +
scale_fill_gradient(low = "yellow", high = "brown") +
labs(title = "Obesity in London") +
geom_sf_label(aes(label=HA_NAME), size=2)
#We can use scale_fill_gradient to reverse the deafault blue shading, picking the colours low = "#56B1F7" and high = "#132B43", or we can choose whichever colors we like for the high and low values.
#To add text or label we can use geom_sf_text() and geom_sf_label()
Use the following empty chunk to create maps of:
We will now look at ways of creating or modifying datasets to suit our mapping needs.
Let’s import our datasets. Remeber you can clear all the previous objects from your workspace with rm(list = ls()). We’re going to need the package readxl to read xls files.
rm(list = ls())
library(readxl)
crypto <- read.csv("crypto/crypto.txt", stringsAsFactors=FALSE)
serve <- read.csv("crypto/SERVE.TXT", stringsAsFactors=FALSE)
wards <- st_read("crypto/WARDS.SHP")
## Reading layer `WARDS' from data source `/Users/hamishgibbs/Documents/GIS_in_R/crypto/WARDS.SHP' using driver `ESRI Shapefile'
## Simple feature collection with 779 features and 3 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 503903 ymin: 155850 xmax: 561532.7 ymax: 201718
## epsg (SRID): NA
## proj4string: NA
gla <- st_read("crypto/gla.shp")
## Reading layer `gla' from data source `/Users/hamishgibbs/Documents/GIS_in_R/crypto/gla.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 2 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 503903 ymin: 155850 xmax: 561952 ymax: 201718
## epsg (SRID): 27700
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
coast <- st_read("crypto/COAST.SHP")
## Reading layer `COAST' from data source `/Users/hamishgibbs/Documents/GIS_in_R/crypto/COAST.SHP' using driver `ESRI Shapefile'
## Simple feature collection with 10 features and 1 field
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 134160 ymin: 11500 xmax: 655520 ymax: 976780
## epsg (SRID): NA
## proj4string: NA
gb_hosp <- read.csv("crypto/GB_HOSP.TXT", stringsAsFactors=FALSE)
pop <- read.csv("crypto/POP.TXT", stringsAsFactors=FALSE)
ward <- read_excel("crypto/WARD.XLS")
The geom_sf function needs a geometry object to plot the data but gb_hosp only has X and Y coordinates as seperate columns. Let’s fix that. We create a new object which contains the spatial data in sf format using the st_as_sf function.
gb_hosp_sf <- st_as_sf(gb_hosp, coords = c("X_coord", "Y_coord"))
ggplot() +
geom_sf(data = coast) +
geom_sf(data = gb_hosp_sf)
What if we only want the hospitals that are in the Greater London Area? We use the gla object to help us with that. But first, in order to use gb_hosp_sf with gla together they need to have the same coordinate reference system (crs). Otherwise we get the error “sfc object should have crs set” or “st_crs(x) == st_crs(y) is not TRUE”. This tells us the crs of the two objects don’t match and that the simple features collumn (sfc) of one of our datasets needs to have a coordinate reference system (crs) assigned. Try running the code in the chunk after this one to see for yourself. Let’s deal with this now:
#As gb_hosp_sf was created using just X and Y coordinates it doesn't have a crs assigned. Let's check the crs of the two objects using st_crs().
st_crs(gla)
## Coordinate Reference System:
## EPSG: 27700
## proj4string: "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs"
st_crs(gb_hosp_sf)
## Coordinate Reference System: NA
#For gla we get a lot of information including an EPSG number. For gb_hosp_sf we just get "NA".
#We now set the coordinate reference system (CRS) of gb_hosp_sf to that of gla_hosp_crypto.
st_crs(gb_hosp_sf) <- 27700
#Let's do the same for gla, just to be sure.
st_crs(gla) <- 27700
Now back to the task at hand; getting only the hospitals that are in the Greater London Area. We can use a geometry operation called st_intersection to do the trick. It returns only those points that share a location with gla. We can find a cheatsheet with all kinds of commands for spatial manipulation here: https://github.com/rstudio/cheatsheets/blob/master/sf.pdf
gla_hosp <- st_intersection(gb_hosp_sf, gla)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
ggplot() +
geom_sf(data = gla) +
geom_sf(data = gla_hosp)
Now let’s look at merging small areas together into larger ones. We will look at catchment areas. A catchment area consits of all the different wards that are served by the same hospital, i.e.: have the same hospital code. Let’s start by plotting the wards coloring them by hospital code.
left_join(wards, serve, by = c("WARD" = "Ward")) %>%
filter(Hosp_Code>=0) %>%
ggplot() +
geom_sf(aes(fill=as.factor(Hosp_Code)), show.legend = F)
## Warning: Column `WARD`/`Ward` joining factor and character vector, coercing into
## character vector
Now if we want to merge the wards together to create catchment areas we can just use the dplyr functions group_by and summarise, to group the wards by hospital codes. Let’s assign the output to a new objects and plot it together with the locations of the hospitals.
serveshp <- left_join(wards, serve, by = c("WARD" = "Ward")) %>%
filter(Hosp_Code>=0) %>%
group_by(Hosp_Code) %>%
summarise()
## Warning: Column `WARD`/`Ward` joining factor and character vector, coercing into
## character vector
#Again we need to change the crs of our newly created object to match that of gla.
st_crs(serveshp) <- 27700
ggplot() +
geom_sf(data = serveshp) +
geom_sf(data = gla_hosp)
Now let’s look at altering tables. When working with any type of data it is important to know:
We can get the number of rows (observations) and columns (variables) using the glimpse() function. The summary() function will give us summary statistics for each variable and will also include the number of missing values. We need to watch out if any missing values are coded in a different format, e.g. -999 as R won’t recongize these as missing values.
glimpse(crypto)
## Observations: 29
## Variables: 3
## $ hosp_code <int> 20, 47, 54, 71, 76, 80, 81, 82, 100, 116, 167, 168, 170, 19…
## $ hosp_name <chr> "FARNBOROUGH HOSPITAL", "MAYDAY HOSPITAL", "QUEEN MARYS HOS…
## $ n_cases <int> 80, 50, 102, 12, 38, 17, 26, 86, 33, 217, 121, 71, 25, 4, 4…
summary(crypto)
## hosp_code hosp_name n_cases
## Min. : 20.0 Length:29 Min. : 2
## 1st Qu.: 82.0 Class :character 1st Qu.: 13
## Median :201.0 Mode :character Median : 38
## Mean :189.4 Mean : 66
## 3rd Qu.:229.0 3rd Qu.: 86
## Max. :780.0 Max. :292
We can use the duplicated() function to check if any values for a certain variable are duplicated. It will return TRUE at the position of the duplicated values. To extract the duplicated values we can use subsetting. We can remove duplicates from a variable using distinct().
duplicated(crypto$hosp_code)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE TRUE FALSE FALSE FALSE
crypto$hosp_code[duplicated(crypto$hosp_code)]
## [1] 211
distinct(crypto, hosp_code)
## hosp_code
## 1 20
## 2 47
## 3 54
## 4 71
## 5 76
## 6 80
## 7 81
## 8 82
## 9 100
## 10 116
## 11 167
## 12 168
## 13 170
## 14 199
## 15 201
## 16 207
## 17 209
## 18 211
## 19 218
## 20 228
## 21 229
## 22 252
## 23 253
## 24 256
## 25 268
## 26 269
## 27 270
## 28 780
We find that the hospital code 211 is present twice, but the rows have different case numbers. We don’t know exactly how this happened but for now let’s just ignore the duplicate value.
For the purpose of this exercise we will create a new object called rates from joining together multiple objects. We then add a new rates variable using mutate. If we ever want to export an object as a shapefile we can do this using the st_write function.
rates <- left_join(wards, pop, by=c("WARD"="Ward"))
## Warning: Column `WARD`/`Ward` joining factor and character vector, coercing into
## character vector
rates <- left_join(rates, serve, by=c("WARD"="Ward"))
rates <- left_join(rates, crypto, by=c("Hosp_Code"="hosp_code"))
rates <- rates %>% group_by(Hosp_Code) %>% summarise(population = sum(Population), cases = sum(n_cases[is.na(n_cases)==F]))
rates <- rates %>% mutate(rates=(cases/population)*1000)
#Again, we set the crs of the newly created object to match that of gla.
st_crs(rates) <- 27700
Let’s plot our new rates objects using the rates variable as fill.
ggplot() +
geom_sf(data = rates, aes(fill=rates)) +
scale_fill_gradient(low = "#56B1F7", high = "#132B43")
Now let’s try and plot rates together with the the hospitals. We first join gla_hosp to crypto so we can also plot the number of cases at each hospital and then plot this new object together with the rates object. We plot our object twice, the hospitals without cases in grey, and the hospitals with cases in black with the size of the dots relating to the number of cases.
gla_hosp_crypto <- left_join(gla_hosp, crypto, by=c("Hosp_Code"="hosp_code"))
ggplot() +
geom_sf(data=rates, aes(fill=rates)) +
scale_fill_gradient(low = "#56B1F7", high = "#132B43") +
geom_sf(data = gla_hosp_crypto[is.na(gla_hosp_crypto$n_cases)==T,], colour="grey", show.legend = "point") +
geom_sf(data = gla_hosp_crypto[is.na(gla_hosp_crypto$n_cases)==F,], aes(size = n_cases), show.legend = "point")
#We use squard brackets to subset our data on whether the n_cases variable is "NA" for each row. We use the trailing comma before we close our square brackets to indicate that we are subsetting by rows and not by columns.
Looks good! Now it’s time to get our map ready for publication.
Let’s add a few finishing touches to our map using the packages ggthemes and ggspatial.
#install.packages("ggthemes")
#install.packages("ggspatial")
library(ggthemes)
library(ggspatial)
ggplot() +
geom_sf(data=rates, aes(fill=rates)) +
scale_fill_gradient(low = "#56B1F7", high = "#132B43") +
geom_sf(data = gla_hosp_crypto[is.na(gla_hosp_crypto$n_cases)==T,], colour="grey", show.legend = "point") +
geom_sf(data = gla_hosp_crypto[is.na(gla_hosp_crypto$n_cases)==F,], aes(size = n_cases), show.legend = "point") +
theme_bw() +
annotation_scale(location = "bl", width_hint = 0.4) +
annotation_north_arrow(location = "bl", which_north = "true", pad_x = unit(0.2, "in"), pad_y = unit(0.2, "in"), style = north_arrow_fancy_orienteering) +
theme(panel.grid.major = element_line(color = gray(0.2), linetype = "dashed", size = 0.5), panel.background = element_rect(fill = "mintcream"))
#We can save the map to our working directory using ggsave in a variety of file formats.
ggsave("londoncryptorates.png")
## Saving 7 x 5 in image
And there we have it! Beautiful maps, all done in R.