knitr::opts_chunk$set(warning = FALSE, message = FALSE) 

Introduction

We will make use of the datasets available in the package SpatialEpi. We will focus on an exploratory data analysis of lip cancer recurrence in Scotland. Information available from the SpatialEpi package includes data for lip cancer among males in Scotland from 1975 to 1980, subdivided by (then) 56 shires (nowadays, 33 administrative counties).

Question: Can we find interesting patterns for further analysis in lip cancer cases at Scottish county level?

Epidemiological studies show a high incidence of lip cancer risk factors for agricultural workers: Khuder SA. Etiologic clues to lip cancer from epidemiologic studies on farmers. Scand J Work Environ Health. 1999 Apr;25(2):125-30. doi: 10.5271/sjweh.414. PMID: 10360467. & Nordby, K. C., et al. “Incidence of Lip Cancer in the Male Norwegian Agricultural Population.” Cancer Causes & Control, vol. 15, no. 6, 2004, pp. 619–26. JSTOR, http://www.jstor.org/stable/3554234. Accessed 16 Jan. 2023.

Secondly, with the former; we will aim to analyse whether there’s a relationship between the proportion of the population who work in agricultural fishing and farming and the number of cancer lips cases.

#rm(list=ls())
#setwd("C:/Users/santi/Documents/M_DSDM/Term_2/Spatial/A_1")

library(osmdata)
library(ggplot2)
library(leaflet)
library(SpatialEpi)
library(kableExtra)
library(vtable)
library(sp)
library(spdep)
library(tmap)
library(sf)
library(viridis)
library(mapview)
library(dplyr)
library(gridExtra)
library(grid)
library(leaflet.extras2)

Exploratory Data Analysis

Descriptive information

sumstat <- st(scotland$data, 
          vars = c('cases','expected','AFF'), 
          summ = list(c('notNA(x)','mean(x)','sd(x)','min(x)','pctile(x)[25]','median(x)','pctile(x)[75]','max(x)')),
          summ.names = list(
     c('N','Mean','SD','Min','25%','Median','75%','Max')),
     labels = c('Cases','Expected Cases','% of workforce*'),
     title = 'Summary Statistics',
     note = 'Scotland (1975 - 1980). *Participation of agricultural, fishing, and forestry workers.')

sumstat
Summary Statistics
Variable N Mean SD Min 25% Median 75% Max
Cases 56 9.571 7.908 0 4.75 8 11 39
Expected Cases 56 9.575 13.177 1.1 4.05 6.3 10.125 88.7
% of workforce* 56 0.087 0.068 0 0.01 0.07 0.115 0.24
Scotland (1975 - 1980). *Participation of agricultural, fishing, and forestry workers.

The Scotland dataframe comes directly with 3 variables. Observed cases, expected cases, and percentage of workforce employed in agriculture, fishing, and forestry (AFF). The expected number of lip cancer patients has been computed using indirect standardization for Scotland’s general disease rates, i.e: expected lip cancer cases if each shire (county) behaved like the whole population. Observed cases account for the registered scenarios in real life. Lastly, AFF participation is included as a covariate risk factor for the incidence of lip cancer amongst males for the second half of the 70’s.

When comparing trend statistics for expected and observed cases, we see that mean values for both these variables are closely computed. However, this similarity is not found when comparing distributions. Expected cases have a considerably higher standard deviation, lower median, and higher max point (p100) than the cases in actuality. This in return overestimates the amount of cases a population model will output when applied to county-level projections. Overall, the distribution of expected cases is significantly larger than that of observed cases. This is supported by comparing minimum values, quantiles, and the standard deviation.

Observing the variable for workforce participation we notice that some county areas in Scotland have no participation in agriculture, fishing, or forestry; supported by the minimum observation. Likewise the maximum value observed represents almost a quarter of the entire labor is focused on these sectors. From what was mentioned earlier, part of the EDA will focus on evaluating correlation of these observations with the incidence of lip cancer.

Preliminary correlation

p1 <- ggplot(data = scotland$data, aes(x = AFF, y = cases)) +
  geom_jitter() +
  theme_minimal() + theme(axis.title.x = element_blank())

p2 <- ggplot(data = scotland$data, aes(x = AFF, y = expected)) +
  geom_jitter() +
  theme_minimal()

grid.arrange(p1, p2, nrow=2,top=textGrob("Actual and expected cases vs AFF"))

In regard to what was mentioned in the introduction, our share of worker positions in AFF sectors do not expose an evident pattern between this variable and the registration of lip cancer cases for each shire. However, the presence of ‘clouded’ data around different workforce participation levels may encourage further evaluation of geographical patterns and epidemiological characteristics for the cancer morbidity.

data(scotland)
data <- scotland$data
map <- scotland$spatial.polygon

d <- scotland$data

names(d) <- c("county", "Y", "E", "AFF")

Expected cases

The expected cases were already computed and we couldn’t find any detailed information about how they were calculated in the documentation of the package SpatialEpi from where the data was retrieved. The expected number of cases in each county represents the total number of disease cases one would expect if the population in the county behaved the way the overall Scottish population behaves.

The expected counts Ei in each county i can be calculated using indirect standardization as:

\[ E_i = \sum_{j=1}^{m} r^{(s)}_j n^{(i)}_j \]

where \(r^{(s)}_j\) is the rate (number of cases divided by population) in stratum j in the population of Scotland, and \(n^{(i)}_j\) is the population in stratum j of county i.

SIR

For each county i the SIR (Standardized Incidence Rate) is defined as the ratio of observed counts to the expected counts. \[ SIR_i=\frac{Y_i}{E_i}. \]

If the SIR is 1, then the number of actual incidence is the same as the number of predicted incidence.

If the SIR is less than 1, then the number of actual incidences is less than the number of predicted incidences. There’s a smaller risk than expected of having a lip cancer in the counties where the SIR is lower than 1.

If the SIR is greater than 1, then the number of actual incidences is greater than the number of predicted incidences. There’s a higher risk than expected of having a lip cancer in the counties where the SIR is greater than 1.

We compute the vector of SIRs as defined above, and append it to the data frame. Afterwards, the SIR is mapped.

d$SIR <- d$Y / d$E


codes <- rgdal::make_EPSG()
codes[which(codes$code == "27700"), ]
proj4string(map) <- "+proj=tmerc +lat_0=49 +lon_0=-2
+k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36
+units=km +no_defs"

map <- spTransform(map,
                   CRS("+proj=longlat +datum=WGS84 +no_defs"))

head(d)

Visualizations

In this section we realize some visualizations with the packages showed in class

Using leaflet

sapply(slot(map, "polygons"), function(x){slot(x, "ID")})
##  [1] "skye-lochalsh" "banff-buchan"  "caithness"     "berwickshire" 
##  [5] "ross-cromarty" "orkney"        "moray"         "shetland"     
##  [9] "lochaber"      "gordon"        "western.isles" "sutherland"   
## [13] "nairn"         "wigtown"       "NE.fife"       "kincardine"   
## [17] "badenoch"      "ettrick"       "inverness"     "roxburgh"     
## [21] "angus"         "aberdeen"      "argyll-bute"   "clydesdale"   
## [25] "kirkcaldy"     "dunfermline"   "nithsdale"     "east.lothian" 
## [29] "perth-kinross" "west.lothian"  "cumnock-doon"  "stewartry"    
## [33] "midlothian"    "stirling"      "kyle-carrick"  "inverclyde"   
## [37] "cunninghame"   "monklands"     "dumbarton"     "clydebank"    
## [41] "renfrew"       "falkirk"       "clackmannan"   "motherwell"   
## [45] "edinburgh"     "kilmarnock"    "east.kilbride" "hamilton"     
## [49] "glasgow"       "dundee"        "cumbernauld"   "bearsden"     
## [53] "eastwood"      "strathkelvin"  "tweeddale"     "annandale"
library(sp)
rownames(d) <- d$county
map <- SpatialPolygonsDataFrame(map, d, match.ID = TRUE)

head(map@data)
library(leaflet)
l <- leaflet(map) %>% addTiles()

pal <- colorNumeric(palette = "Purples", domain = map$SIR)

library(ggplot2)

#improved map
labels <- sprintf("<strong> %s </strong> <br/>
  Observed: %s <br/> Expected: %s <br/>
  AFF: %s <br/> SIR: %s",
  map$county, map$Y, round(map$E, 2),
  map$AFF, round(map$SIR, 2)
) %>%
  lapply(htmltools::HTML)

l %>%
  addPolygons(
    color = "grey", weight = 1,
    fillColor = ~ pal(SIR), fillOpacity = 0.80,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(
      style = list(
        "font-weight" = "normal",
        padding = "3px 8px"
      ),
      textsize = "15px", direction = "auto"
    )
  ) %>%
  addLegend(
    pal = pal, values = ~SIR, opacity = 0.5,
    title = "SIR", position = "topright"
  )

The SIR results can inform us about the riskiness of developing a lip cancer in a given area.

Areas that have a SIR greater than 1 are counties where the number of mapped cases is greater than the number of expected cases. This may indicate a riskier area where other covariates can influence the likeliness of developing a lip cancer.

However, it is important to mention that since we don’t have more detailed information on how the expected cases were computed, the distortions in SIR can be a result of misleading calculations that omit factors such as the population or the average age of the county for instance .

We can observe from the map that the county of Skye-Lochalsh has the highest SIR ratio (6.43).

Overall, we can observe that most of the counties located in the north of Scotland have a higher SIR than in the south. In other words, the expected cases of lip cancer for the counties in the north underestimate the actual number of cases.

Using mapview

data(scotland)
data <- scotland$data
scotland.map <- scotland$spatial.polygon

map <- scotland$spatial.polygon
rownames(scotland$data) <- scotland$data$county.names
map <- SpatialPolygonsDataFrame(map, scotland$data, match.ID = TRUE)
head(map@data)

We can compare the number of cases and expected cases using the side-by-side plot map created by the leaflet.extras2 package.

Actual cases are given by the left side of the map and expected cases by the right side of the map.

# Side by side map
m_cases <- mapview(map, zcol = "cases")
m_exp <- mapview(map, zcol = "expected")
m_cases | m_exp

By looking at the synchronized map we can see a visual association of the number of expected cases and the AFF. The counties with a higher number of expected cases have a lower proportion of people working in agricultural, fishing, farming.

# Synchronized plot
m_aff <- mapview(map, zcol = "AFF")
m <- leafsync::sync(m_exp, m_aff)
m

Using tmap

The visualization of the expected cases doesn’t give us detailed inference about the differences between counties, since almost all the regions have a number of expected cases between 0 and 20.

On the other hand, when we plot the map with the AFF variable that represents the percentage of people working in agriculture and the fishing sector we obtain a clearer picture just by looking at the map. We can highlight that the majority of workers of AAF are further away from the large cities.

tm_shape(map) + tm_polygons("expected") + tmap_style("col_blind")
tm_shape(map) + tm_polygons("AFF") + tmap_style("col_blind")

Measuring Spatial Autocorrelation

In this section we will measure the spatial autocorrelation of the areas using the Moran’s I

First, we retrieve the dataset again and plot the Scotland map using the scotland$spatial.polygon command.

data(scotland)
data <- scotland$data
scotland.map <- scotland$spatial.polygon

Global Moran’s I

# Selecting only the first two columns because that was the procedure
scotland$data <- scotland$data[, 1:2]
map <- scotland$spatial.polygon
rownames(scotland$data) <- scotland$data$county.names
map <- SpatialPolygonsDataFrame(map, scotland$data, match.ID = TRUE)
head(map@data)
map$vble <- map$data

tm_shape(map) + tm_polygons(col = "cases", title = "cases", style = "quantile") + tm_layout(legend.outside = TRUE) + tmap_style("col_blind")

To compute the Moran’s I we need to remove the areas that doesn’t share a border with any other areas like the islands. We can do that by setting the parameter zero.policy = TRUE.

nb <- poly2nb(map, queen = TRUE) # queen: sharing a point or a border
nbw <- spdep::nb2listw(nb, style = "W", zero.policy = TRUE)

(gmoran <- moran.test(map$cases, nbw, zero.policy = TRUE))
## 
##  Moran I test under randomisation
## 
## data:  map$cases  
## weights: nbw  n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = 2.903, p-value = 0.001848
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.245574553      -0.019230769       0.008320817

Since we have = 0.05 got a p-value equal to 0.001848, we have that the p-value < and we can thus reject the null hypothesis and conclude that there is evidence for spatial autocorrelation. Since the Moran’s I statistic is greater than the expectation, there is evidence for positive spatial autocorrelation.

The Monte Carlo approach leads us to the same result as we can see in the table and graph below.

gmoranMC <- moran.mc(map$cases, nbw, nsim = 999, zero.policy = TRUE)
gmoranMC
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  map$cases 
## weights: nbw  
## number of simulations + 1: 1000 
## 
## statistic = 0.24557, observed rank = 998, p-value = 0.002
## alternative hypothesis: greater

By looking at the histogram we can see that our estimation is far from the value. Which supports the rejection of the null hypothesis that there is no spatial autocorrelation for our given variable (observed cases county-by-county).

datahist=data.frame(value=gmoranMC$res)

p3 <- ggplot(datahist, aes(x=value)) + geom_histogram() +
  theme_minimal()

p3 + geom_vline(aes(xintercept=mean(gmoranMC$statistic)),
            color="orange", linetype="dashed", size=1) +
  labs(title = "Distribution of Moran's I via Monte-Carlo Simulation") + theme_minimal()

moran.plot(map$cases, nbw, zero.policy = TRUE)

Conclusion

Despite the statistical support, we cannot realize a factual statement since we are using the number of absolute cases and not the relative. It may be the case that the concentration of areas with high cases is related to the high population in these areas. We would need data about the population to calculate the relative number of cases.