2. Preparing Data for Spatial Accessibility Analysis • spax

# Load required packages
library(spax)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found
library(terra)
library(tidyverse)
library(sf)
library(osrm)

Data Preparation for Accessibility Analysis

“Give me six hours to chop down a tree, and I will spend the first four sharpening the axe.”

— Abraham Lincoln(?)

Overview

Spatial accessibility analysis requires three key components:

Demand: Population or service need distribution
Supply: Service locations and their capacities
Distance: Travel time or distance between demand and supply

This guide walks through preparing each component for use with spax.

Dealing with Demand

When preparing demand data for spatial accessibility analysis, we have several approaches available, each with its own assumptions and trade-offs. Let’s explore these methods using a simple example of disease cases by province.

set.seed(42) # For reproducibility

# Example district-level case data
case_data <- tibble(
  ADM1_PCODE = bound1$ADM1_PCODE,
  cases = round(rnorm(nrow(bound1), 1000, 200)) # Simulated disease cases
)

# Join with spatial data
case_spatial <- bound1 |>
  left_join(case_data, by = "ADM1_PCODE")

In a vector-based approach, we typically represent each area with a polygon that has an attribute for the number of cases. This often results in choropleth maps, like this:

# Plot the case
ggplot() +
  geom_sf(data = bound1) +
  geom_sf(data = case_spatial, aes(fill = cases)) +
  labs(title = "Cases by Province") +
  scale_fill_viridis_b() +
  theme_minimal()

But that could sometimes be misleading since in traditional spatial analysis, we often allocate demand to only the centroid of each area:

# Calculate centroids and rasterize
centroids <- st_centroid(case_spatial)
#> Warning: st_centroid assumes attributes are constant over geometries

# Plot the population data
ggplot() +
  geom_sf(data = bound1) +
  geom_sf(data = centroids, aes(size = cases), col = "red") +
  labs(title = "Population Data by Region") +
  theme_minimal()

The centroid approach works for vector-centric analysis, but for spax, we need to think in terms of continuous surfaces. Let’s explore different ways to convert our data

1. Centroid-Based Approach

The simplest approach treats all demand as concentrated at area centroids. This mirrors traditional vector-based accessibility analyses but in raster format:

# Load population density raster & distance
u5pd <- read_spax_example("u5pd.tif")
hos_iscr <- read_spax_example("hos_iscr.tif")

# Create template raster - Coarser resolution for example
template <- aggregate(u5pd, fact = 10)

# Rasterize centroids
centroid_demand <- terra::rasterize(
  vect(centroids),
  template,
  field = "cases",
  fun = "sum"
)

plot(centroid_demand, main = "Centroid-Based Demand")
plot(vect(bound1), add = TRUE)

As you can see, this approach creates a somewhat odd-looking result - all our demand is concentrated in single cells. However, this raster can still be used in accessibility analysis, just like the original population density raster. The key trade-off here is between simplicity and spatial accuracy.

2. Area-Weighted Approach [developing]

A more sophisticated method distributes cases evenly across each province’s area:

# 1. Calculate case density
case_spatial <- case_spatial |>
  mutate(
    area_m2 = st_area(case_spatial), # estimate area
    case_density = as.numeric(cases / area_m2) # calc density
  )

# 2. Rasterize case density
case_density_raster <- terra::rasterize(
  vect(case_spatial),
  template,
  field = "case_density"
)

# 3. Calculate pixel area in square kilometers
pixel_area <- terra::res(template)[1] * terra::res(template)[2]

# 4. Convert density to absolute cases per pixel
case_raster <- case_density_raster * pixel_area

Let’s see how this looks:

# Plot
plot(case_raster, main = "Area-Weighted Demand")
plot(vect(bound1), add = TRUE)

This gives us a more continuous surface, but we should validate that our totals still match:

# Extract raster values for each polygon
extracted_cases <- terra::extract(
  case_raster,
  vect(case_spatial),
  fun = sum, na.rm = TRUE
)

# Compare original vs rasterized totals
case_spatial |>
  st_drop_geometry() |>
  select(ADM1_PCODE, cases) |>
  mutate(
    raster_cases = extracted_cases[, 2],
    difference = abs(cases - raster_cases),
    pct_diff = (difference / cases) * 100
  ) |>
  mutate_if(is.numeric, round, 1)
#>   ADM1_PCODE cases raster_cases difference pct_diff
#> 1       TH90  1274       1261.4       12.6      1.0
#> 2       TH91   887        867.4       19.6      2.2
#> 3       TH92  1073       1068.1        4.9      0.5
#> 4       TH93  1127       1131.1        4.1      0.4
#> 5       TH94  1081       1085.9        4.9      0.5
#> 6       TH95   979        995.2       16.2      1.7
#> 7       TH96  1302       1286.8       15.2      1.2

The differences between original and rasterized values arise from the discretization process. The results get closer as resolution increases, but there’s always a trade-off between accuracy and computational efficiency.

3. Informed Distribution [developing]

Even more realistic distribution is possible using auxiliary data like population density to inform where cases are likely to occur within each province. This section will be expanded in future versions.

4. Probabilistic Sampling [developing]

For uncertainty analysis, spax provides tools to generate multiple potential demand distributions:

# This advanced approach is covered in detail in the
# "Probabilistic Accessibility Analysis" vignette
?sample_pmf # See documentation for details

The choice of demand preparation method can significantly impact your accessibility analysis results. Consider your specific context:

Data availability (Do you have population density data?)
Scale of analysis (Are districts large enough that distribution matters?)
Purpose (Is point-based analysis sufficient for your needs?)
Computational resources (Can you handle higher resolution distributions?)

Processing Supply Data

Supply data is usually the most straightforward component - we typically know exactly where facilities are and their capacities. However, these locations play a crucial role in the more complex task of distance calculation that follows..

# Basic Supply Structure
head(hc12_hos) |>
  select(id, hoslvl, s_doc, s_nurse)
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 659422 ymin: 744135 xmax: 715990.7 ymax: 824148
#> Projected CRS: WGS 84 / UTM zone 47N
#> # A tibble: 6 × 5
#>   id    hoslvl     s_doc s_nurse            geometry
#>   <chr> <fct>      <int>   <int>         <POINT [m]>
#> 1 c172  3-Regional   522     973 (662036.1 775836.7)
#> 2 c173  2-General    321     706   (672852 789753.3)
#> 3 c174  2-General     67      63     (659422 824148)
#> 4 c175  2-General     81     103     (692195 764766)
#> 5 c176  2-General    175     180   (687203.3 744135)
#> 6 c177  2-General     77     103 (715990.7 752559.6)

For use in spax, supply data should be separated into:

Spatial locations (for distance calculations)
Capacity attributes (for accessibility computations)

# Separate spatial and attribute components
supply_attributes <- hc12_hos |>
  st_drop_geometry() |>
  select(id, s_doc, s_nurse) # Keep only relevant columns

# Visual check of supply distribution
ggplot() +
  geom_sf(data = bound1, fill = "grey80") +
  geom_sf(data = hc12_hos, aes(size = s_doc, color = hoslvl)) +
  scale_size_continuous(name = "Number of Doctors") +
  scale_color_viridis_d(name = "Hospital Level", end = 0.5) +
  theme_minimal() +
  labs(
    title = "Healthcare Supply Distribution",
    subtitle = "Size indicates number of doctors, color shows facility level"
  )

Creating Distance/Travel Time Surfaces

While supply locations themselves are straightforward, calculating distances between supply and demand points is often the most complex part of data preparation. Although it’s theoretically possible to calculate distances from either supply to demand or demand to supply (assuming travel times are symmetric), there are practical reasons why we typically calculate from supply points:

We usually have fewer supply points than demand points, making computation more efficient
When redistributing accessibility scores back to the population, we need supply-based distances anyway

Let’s explore different approaches to creating these distance surfaces:

1. Euclidean Distance

The simplest approach - straight-line distances:

# Example for one facility using terra
facility_point <- hc12_hos[1, ]
template_rast <- u5pd # Use population raster as template

# Calculate Euclidean distance (in meters since our CRS is projected)
euclidean_dist <- terra::distance(
  template_rast,
  vect(facility_point)
) |> crop(vect(bound0), mask = TRUE)
#> |---------|---------|---------|---------|=========================================

# Plot the Euclidean distance
plot(euclidean_dist / 1000, # Convert to kilometers
  main = "Euclidean Distance (km)"
)
plot(vect(facility_point), add = TRUE, col = "red", pch = 16)
plot(vect(bound0), add = TRUE)

While simple, this approach ignores real-world travel constraints like roads and barriers. It’s most useful for:

Initial exploratory analysis
Areas with relatively uniform transportation networks
Services that aren’t strictly road-dependent

2. Travel Time Isochrones

A more realistic approach uses travel time isochrones - areas reachable within specific time thresholds. While spax doesn’t generate isochrones directly, we can use tools like the osrm package:

# Example: Calculate isochrones for the first hospital
# Since our data is in a planar projection, we need to transform it to WGS84 (EPSG:4326) for routing services

example_hos1 <- hc12_hos[1, ] |> # Select the first hospital
  st_transform(4326) |> # Transform to WGS84 (geodesic projection)
  st_coordinates() |> # Extract coordinates
  as.numeric() # Convert to numeric

# Calculate isochrones
example_isochrone <- osrmIsochrone(
  loc = example_hos1, # Example coordinates
  breaks = c(15, 30, 45, 60), # 15-minute intervals
  res = 30
)

# Transform back to our working CRS
example_isochrone <- st_transform(example_isochrone, st_crs(bound0))

# Plot results
par(mfrow = c(1, 2))
plot(example_isochrone[, "isomax"], main = "Travel Time Isochrone for Hospital C1")

plot(example_isochrone[3, "isomax"], main = "Travel Time Isochrone ring for Hospital C1 (45 mins)")

The isochrones show travel time “rings” around our facility. If you take on a vector-based approach, you can use functions like sf::st_contains() to determine the population within each isochrone ring.

For spax, we need to convert these to a raster format:

# Prepare isochrones for rasterization
example_isochrone <- example_isochrone |>
  mutate(
    iso_mean = (isomin + isomax) / 2,
    location_id = hc12_hos[1, ]$id
  )

# Convert to raster format
example_raster <- terra::rasterize(
  x = example_isochrone,
  y = template_rast,
  field = "iso_mean", # Fill the raster with the isomax values
  background = NA,
  fun = "sum",
  by = "location_id"
) |>
  crop(vect(bound0), mask = TRUE)

# Plot
plot(example_raster, main = "Travel Time Isochrone for Hospital C1")
plot(vect(bound0), add = TRUE)
plot(vect(hc12_hos[1, ]), add = TRUE, col = "red", pch = 16)

⚠️ Important Note about OSRM: By default, the osrm::osrmIsochrone()] function uses OSRM’s demo server. This is fine for testing, but for real analysis you should:

Set up your own OSRM server. This vignette does not cover the server setup process.

Consider rate limits (demo server allows ~1 request/second)

Be mindful of server load and ethical usage

To process all facilities, you’d need to loop through them:

# Example code (not run due to API limitations)

# Initialize a list to store isochrones
isochrones <- list(rep(NA, nrow(hc12_hos)))

# Extract coordinates
hos_coords <- hc12_hos |>
  st_transform(4326) |> # only if the data is in planar projection
  st_coordinates()

# Loop over each hospital - lapply
isochrones <- lapply(1:nrow(hc12_hos), function(i) {
  # Calculate isochrone
  osrmIsochrone(
    loc = hos_coords[i, ],
    breaks = c(15, 30, 45, 60),
    res = 30
  )
})
# Then proceed to rastarization ...

Service Provider Options

When creating travel time surfaces, you have several options:

OSRM (Open Source Routing Machine)

✓ Free and open source
✓ Can handle multiple travel modes (if setup properly)
× Requires server setup for serious use
× Basic routing without traffic

Commercial APIs (HERE, Google Maps)

✓ Includes real-time traffic
✓ More accurate travel times
× Usage costs
× API limits

Custom cost surfaces (I have no idea how to do this, lol)

✓ Can include terrain, land use
✓ Full control over parameters
× Requires GIS expertise
× More complex to validate

Validating Your Data: Essential Checks Before Analysis

Before running any accessibility analysis, it’s crucial to validate your prepared data. Here’s a systematic approach to validation (some of which already implemented in the package:) :

1. Spatial Alignment

All raster inputs must align perfectly - same extent, resolution, and projection:

# Example:
# You might want to create function to summarize raster properties
summarize_raster <- function(r, name) {
  tibble(
    Dataset = name,
    Resolution = paste(round(res(r), 2), collapse = " x "),
    Extent = paste(round(as.vector(ext(r)), 2), collapse = ", "),
    CRS = crs(r, proj = TRUE)
  )
}

# Compare properties of key rasters
raster_props <- bind_rows(
  summarize_raster(u5pd, "Population"),
  summarize_raster(hos_iscr, "Travel Time")
)

raster_props
#> # A tibble: 2 × 4
#>   Dataset     Resolution    Extent                                     CRS      
#>   <chr>       <chr>         <chr>                                      <chr>    
#> 1 Population  520.4 x 520.4 505646.53, 842347.77, 620843.72, 885729.24 +proj=ut…
#> 2 Travel Time 520.4 x 520.4 505646.53, 842347.77, 620843.72, 885729.24 +proj=ut…

# Visual check for alignment
plot(u5pd, main = "Spatial Alignment Check")
plot(hos_iscr[[1]], add = TRUE, alpha = 0.5)
plot(vect(bound1), add = TRUE)

2. Value Range Validation

Check for reasonable values in your input data:

# Check population density range
summary(u5pd)
#> Warning: [summary] used a sample
#>  tha_children_under_five_2020
#>  Min.   : 0.00               
#>  1st Qu.: 1.40               
#>  Median : 3.48               
#>  Mean   : 6.03               
#>  3rd Qu.: 7.59               
#>  Max.   :82.96               
#>  NA's   :84773

# Check a few travel time rasters
summary(hos_iscr[[1:10]])
#> Warning: [summary] used a sample
#>       c172            c173            c174            c175      
#>  Min.   :  6.0   Min.   :  3.0   Min.   :  3.5   Min.   :  4.0  
#>  1st Qu.: 67.5   1st Qu.: 67.5   1st Qu.: 82.5   1st Qu.: 67.5  
#>  Median : 97.5   Median :112.5   Median :127.5   Median :112.5  
#>  Mean   :102.9   Mean   :107.6   Mean   :118.9   Mean   :105.6  
#>  3rd Qu.:142.5   3rd Qu.:142.5   3rd Qu.:157.5   3rd Qu.:142.5  
#>  Max.   :172.5   Max.   :172.5   Max.   :172.5   Max.   :172.5  
#>  NA's   :71579   NA's   :71501   NA's   :74191   NA's   :69709  
#>       c176            c177            c178            c179      
#>  Min.   :  8.5   Min.   : 11.0   Min.   : 11.0   Min.   :  7.5  
#>  1st Qu.: 67.5   1st Qu.: 67.5   1st Qu.: 67.5   1st Qu.: 97.5  
#>  Median :112.5   Median :112.5   Median :112.5   Median :127.5  
#>  Mean   :107.5   Mean   :107.7   Mean   :110.5   Mean   :120.5  
#>  3rd Qu.:142.5   3rd Qu.:142.5   3rd Qu.:157.5   3rd Qu.:157.5  
#>  Max.   :172.5   Max.   :172.5   Max.   :172.5   Max.   :172.5  
#>  NA's   :70996   NA's   :72338   NA's   :74213   NA's   :75822  
#>       c180            c181       
#>  Min.   :  5.0   Min.   :  8.50  
#>  1st Qu.: 97.5   1st Qu.: 67.50  
#>  Median :127.5   Median : 97.50  
#>  Mean   :127.3   Mean   : 96.85  
#>  3rd Qu.:157.5   3rd Qu.:127.50  
#>  Max.   :172.5   Max.   :172.50  
#>  NA's   :76037   NA's   :72831

3. Missing Value Analysis

Understand where and why you have NA values; note that not all NAs are bad:

# Function to analyze NA patterns within boundary
# PS. Will be implemented in the package later
analyze_nas <- function(r, name, bound = bound0) {
  # Create mask from boundary
  mask <- terra::rasterize(bound, r)

  # Count cells only within boundary
  cells_in_boundary <- global(!is.na(mask), "sum")$sum
  nas_in_boundary <- global(is.na(r) & !is.na(mask), "sum")$sum

  cat(
    sprintf("%s:\n", name),
    sprintf(
      "  NA cells within boundary: %d (%.1f%% of study area)\n",
      nas_in_boundary,
      100 * nas_in_boundary / cells_in_boundary
    )
  )

  # Visualize NA pattern
  na_map <- (is.na(r) & !is.na(mask)) |> crop(bound, mask = T)

  plot(na_map,
    main = paste(name, "- NAs within Study Area"),
    col = c("grey80", "red"),
    legend = FALSE
  )
  plot(vect(bound1), add = TRUE)
  legend("bottomright",
    legend = c("Data present", "NA"),
    fill = c("grey80", "red"),
    bty = "n"
  )
}

# Check NA patterns in key datasets
par(mfrow = c(1, 2))
analyze_nas(u5pd, "Pop")
#> Pop:
#>    NA cells within boundary: 60551 (54.4% of study area)
analyze_nas(hos_iscr[[1]], "Time from 1st Fac")
#> Time from 1st Fac:
#>    NA cells within boundary: 16841 (15.1% of study area)

Final reminders:

Resolution Selection
- Higher resolution ≠ better analysis
- Consider computational resources
- Match resolution to spatial scale of research question
Coordinate Reference Systems
- Be consistent with CRS across all datasets
- Document CRS choices and transformations
- Consider distortion implications
Documentation
- Note data sources and dates
- Document all preprocessing steps
- Record assumptions made
- Note known limitations

Next Steps

With your data prepared, you’re ready to move on to actual accessibility analysis. You can explore the following resources for more information: