Interpolation

HES 505 Fall 2024: Session 18

Carolyn Koehn

Assignment Revision 2…

Chance to revise assignments 6-7 - please note why you made changes/what the new code does
Switching projects in RStudio

Objectives

By the end of today you should be able to:

Distinguish deterministic and stochastic processes
Define autocorrelation and describe its estimation
Articulate the benefits and drawbacks of autocorrelation
Leverage point patterns and autocorrelation to interpolate missing data

But first…

Patterns as realizations of spatial processes

A spatial process is a description of how a spatial pattern might be generated
Generative models
An observed pattern as a possible realization of an hypothesized process

Deterministic vs. stochastic processes

Deterministic processes: always produce the same outcome

\[ z = 2x + 3y \]

Results in a spatially continuous field

Deterministic vs. stochastic processes

x <- rast(nrows = 10, ncols=10, xmin = 0, xmax=10, ymin = 0, ymax=10)
values(x) <- 1
z <- x
values(z) <- 2 * crds(x)[,1] + 3*crds(x)[,2]

Deterministic vs. stochastic processes

Stochastic processes: variation makes each realization difficult to predict

\[ z = 2x + 3y + d \]

The process is random, not the result (!!)
Measurement error makes deterministic processes appear stochastic

x <- rast(nrows = 10, ncols=10, xmin = 0, xmax=10, ymin = 0, ymax=10)
values(x) <- 1
fun <- function(z){
a <- z
d <- runif(ncell(z), -50,50)
values(a) <- 2 * crds(x)[,1] + 3*crds(x)[,2] + d
return(a)
}

b <- replicate(n=6, fun(z=x), simplify=FALSE)
d <- do.call(c, b)

Deterministic vs. stochastic processes

Expected values and hypothesis testing

Considering each outcome as the realization of a process allows us to generate expected values
The simplest spatial process is Completely Spatial Random (CSR) process
First Order effects: any event has an equal probability of occurring in a location
Second Order effects: the location of one event is independent of the other events

Generating expactations for CSR

We can use quadrat counts to estimate the expected number of events in a given area
The probability of each possible count is given by:

\[ P(n,k) = {n \choose x}p^k(1-p)^{n-k} \]

Given total coverage of quadrats, then \(p=\frac{\frac{a}{x}}{a}\) and

\[ \begin{equation} P(k,n,x) = {n \choose k}\bigg(\frac{1}{x}\bigg)^k\bigg(\frac{x-1}{x}\bigg)^{n-k} \end{equation} \]

Revisiting Ripley’s \(K\)

Probability is tied to quadrat size – we can improve estimates with a moving window
If points have independent, fixed marginal densities, then they exhibit complete, spatial randomness (CSR)
The K function is an alternative, based on a series of circles with increasing radius

\[ \begin{equation} K(d) = \lambda^{-1}E(N_d) \end{equation} \]

We can test for clustering by comparing to the expectation:

\[ \begin{equation} K_{CSR}(d) = \pi d^2 \end{equation} \]

if \(k(d) > K_{CSR}(d)\) then there is clustering at the scale defined by \(d\)

Ripley’s \(K\) Function

When working with a sample the distribution of \(K\) is unknown
Estimate with

\[ \begin{equation} \hat{K}(d) = \hat{\lambda}^{-1}\sum_{i=1}^n\sum_{j=1}^n\frac{I(d_{ij} <d)}{n(n-1)} \end{equation} \]

where:

\[ \begin{equation} \hat{\lambda} = \frac{n}{|A|} \end{equation} \]

Ripley’s \(K\) Function

Using the spatstat package

Ripley’s \(K\) Function

kf <- Kest(bramblecanes, correction-"border")
plot(kf)

Ripley’s \(K\) Function

accounting for variation in \(d\)

kf.env <- envelope(bramblecanes, correction="border", envelope = FALSE, verbose = FALSE)
plot(kf.env)

Other functions

\(L\) function: square root transformation of \(K\)
\(G\) function: the cumulative frequency distribution of the nearest neighbor distances
\(F\) function: similar to \(G\) but based on randomly located points

Tobler’s Law

‘everything is usually related to all else but those which are near to each other are more related when compared to those that are further away’.
Waldo Tobler

Spatial autocorrelation

From Manuel Gimond

(One) Measure of autocorrelation

Moran’s I

Moran’s I: An example

Use spdep package
Estimate neighbors
Generate weighted average

set.seed(2354)
# Load the shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/fl_hr80.rds"))

# Define the neighbors (use queen case)
nb <- poly2nb(s, queen=TRUE)

# Compute the neighboring average homicide rates
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
#estimate Moran's I
moran.test(s$HR80,lw, alternative="greater")


    Moran I test under randomisation

data:  s$HR80  
weights: lw    

Moran I statistic standard deviate = 1.8891, p-value = 0.02944
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.136277593      -0.015151515       0.006425761

Moran’s I: An example

M1 <- moran.mc(s$HR80, lw, nsim=9999, alternative="greater")



# Display the resulting statistics
M1


    Monte-Carlo simulation of Moran I

data:  s$HR80 
weights: lw  
number of simulations + 1: 10000 

statistic = 0.13628, observed rank = 9575, p-value = 0.0425
alternative hypothesis: greater

The challenge of areal data

Spatial autocorrelation threatens second order randomness
Areal data means an infinite number of potential distances
Neighbor matrices, \(\boldsymbol W\), allow different characterizations

Interpolation

Goal: estimate the value of \(z\) at new points in \(\mathbf{x_i}\)
Most useful for continuous values
Nearest-neighbor, Inverse Distance Weighting, Kriging

Nearest neighbor

find \(i\) such that \(| \mathbf{x_i} - \mathbf{x}|\) is minimized
The estimate of \(z\) is \(z_i\)

# data retrieved from https://www.epa.gov/outdoor-air-quality-data/download-daily-data

aq <- read_csv("/opt/data/data/classexamples/ad_viz_plotval_data_PM25_2024_ID.csv") %>% 
  st_as_sf(., coords = c("Site Longitude", "Site Latitude"), crs = "EPSG:4326") %>% 
  st_transform(., crs = "EPSG:8826") %>% 
  mutate(date = as_date(parse_datetime(Date, "%m/%d/%Y"))) %>% 
  filter(., date >= 2024-07-01) %>% 
  filter(., date > "2024-07-01" & date < "2024-07-31")
aq.sum <- aq %>% 
  group_by(., `Site ID`) %>% 
  summarise(., meanpm25 = mean(`Daily AQI Value`))

nodes <- st_make_grid(aq.sum,
                      what = "centers")

dist <- distance(vect(nodes), vect(aq.sum))
nearest <- apply(dist, 1, function(x) which(x == min(x)))
aq.nn <- aq.sum$meanpm25[nearest]
preds <- st_as_sf(nodes)
preds$aq <- aq.nn

preds <- as(preds, "Spatial")
sp::gridded(preds) <- TRUE
preds.rast <- rast(preds)

Inverse-Distance Weighting

Weight closer observations more heavily

\[ \begin{equation} \hat{z}(\mathbf{x}) = \frac{\sum_{i=1}w_iz_i}{\sum_{i=1}w_i} \end{equation} \] where

\[ \begin{equation} w_i = | \mathbf{x} - \mathbf{x}_i |^{-\alpha} \end{equation} \] and \(\alpha > 0\) (\(\alpha = 1\) is inverse; \(\alpha = 2\) is inverse square)

Inverse-Distance Weighting

terra::interpolate provides flexible interpolation methods
Use the gstat package to develop the formula

mgsf05 <- gstat(id = "meanpm25", formula = meanpm25~1, data=aq.sum,  nmax=7, set=list(idp = 0.5))
mgsf2 <- gstat(id = "meanpm25", formula = meanpm25~1, data=aq.sum,  nmax=7, set=list(idp = 2))
interpolate_gstat <- function(model, x, crs, ...) {
    v <- st_as_sf(x, coords=c("x", "y"), crs=crs)
    p <- predict(model, v, ...)
    as.data.frame(p)[,1:2]
}
zsf05 <- interpolate(preds.rast, mgsf05, debug.level=0, fun=interpolate_gstat, crs=crs(preds.rast), index=1)
zsf2 <- interpolate(preds.rast, mgsf2, debug.level=0, fun=interpolate_gstat, crs=crs(preds.rast), index=1)