An outline of {shadowpop}

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(gt)
library(palmerpenguins)
library(shadowpop)
library(withr)

Using shadowpop

To illustrate what shadowpop() does, we can use the {palmerpenguins} dataset.

A hand-drawn image of three penguins, one of each of three species:
    Chinstrap, Gentoo and Adélie. The Chinstrap penguin drawing has some magenta
    shading in the background, the Gentoo penguin has a dark green background,
    and the Adélie penguin has a bright orange background.
Palmer penguins package artwork

Artwork by @allison_horst

I am going to restrict the example here to the Adélie penguins.

Let’s say we have a small sample of 5 Adélies from Biscoe island:

sample_data <- penguins |>
  dplyr::filter(species == "Adelie" & island == "Biscoe") |>
  dplyr::slice_sample(n = 5) |>
  withr::with_seed(seed = 777)

sample_data |>
  gt()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 41.1 18.2 192 4050 male 2008
Adelie Biscoe 37.8 18.3 174 3400 female 2007
Adelie Biscoe 41.6 18.0 192 3950 male 2008
Adelie Biscoe 36.5 16.6 181 2850 female 2008
Adelie Biscoe 40.1 18.9 188 4300 male 2008

and we would like to find a “shadow population” of 5 Adélies from Dream island that have a similar set of characteristics.

The penguins dataset doesn’t have an ID column to uniquely identify each observation; in this situation, shadowpop() will add one automatically.

(If your data has a unique ID/key field, you should specify its name via the id_col argument.)

Let’s create our source data. In this case we’re going to use all the Adélies found on Dream island (top five rows shown below):

source_data <- penguins |>
  dplyr::filter(species == "Adelie" & island == "Dream")

head(source_data, n = 5) |>
  gt()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Dream 39.5 16.7 178 3250 female 2007
Adelie Dream 37.2 18.1 178 3900 male 2007
Adelie Dream 39.5 17.8 188 3300 female 2007
Adelie Dream 40.9 18.9 184 3900 male 2007
Adelie Dream 36.4 17.0 195 3325 female 2007

Finally we need to decide which variables we want to use to generate the best match for our shadow population, and in which order.

Probably sex is important, and maybe body mass, and let’s also say we want to match by year if possible.

We hit a problem now because the body mass variable is a specific number of grams, so it’s very unlikely that exact matches will occur.

(Currently shadowpop() only works via exact matching of fields.)

There are also some NA body mass values in the source data which we might want to convert to a specific character value.

There are two ways we can solve the problem:

  1. convert the specific mass values to ranges and match on those,
  2. create a match by closeness, for example match any source penguins that are within +/-50g of the mass of the sample individual.

For now we will use the range option.

sample_data2 <- sample_data |>
  dplyr::mutate(body_mass_range = dplyr::case_when(
    .data[["body_mass_g"]] < 3000 ~ "<3000",
    .data[["body_mass_g"]] < 3250 ~ "3000-3249",
    .data[["body_mass_g"]] < 3500 ~ "3250-3499",
    .data[["body_mass_g"]] < 3750 ~ "3500-3749",
    .data[["body_mass_g"]] < 4000 ~ "3750-3999",
    .data[["body_mass_g"]] < 4250 ~ "4000-4249",
    .data[["body_mass_g"]] < 4500 ~ "4250-4499",
    .data[["body_mass_g"]] >= 4500 ~ "4500+",
    is.na(.data[["body_mass_g"]]) ~ "Missing"
  ))

source_data2 <- source_data |>
  dplyr::mutate(body_mass_range = dplyr::case_when(
    .data[["body_mass_g"]] < 3000 ~ "<3000",
    .data[["body_mass_g"]] < 3250 ~ "3000-3249",
    .data[["body_mass_g"]] < 3500 ~ "3250-3499",
    .data[["body_mass_g"]] < 3750 ~ "3500-3749",
    .data[["body_mass_g"]] < 4000 ~ "3750-3999",
    .data[["body_mass_g"]] < 4250 ~ "4000-4249",
    .data[["body_mass_g"]] < 4500 ~ "4250-4499",
    .data[["body_mass_g"]] >= 4500 ~ "4500+",
    is.na(.data[["body_mass_g"]]) ~ "Missing"
  ))

We can now use body_mass_range as one of our match variables, and create a shadow population:

match_vars <- c("sex", "body_mass_range", "year")

shadow_pop <- shadowpop(sample_data2, source_data2, match_vars) |>
  dplyr::select(!c("id", "body_mass_range"))

shadow_pop |>
  gt()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Dream 43.2 18.5 192 4100 male 2008
Adelie Dream 39.5 16.7 178 3250 female 2007
Adelie Dream 38.3 19.2 189 3950 male 2008
Adelie Dream 33.1 16.1 178 2900 female 2008
Adelie Dream 40.3 18.5 196 4350 male 2008
sample_data |>
  gt()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 41.1 18.2 192 4050 male 2008
Adelie Biscoe 37.8 18.3 174 3400 female 2007
Adelie Biscoe 41.6 18.0 192 3950 male 2008
Adelie Biscoe 36.5 16.6 181 2850 female 2008
Adelie Biscoe 40.1 18.9 188 4300 male 2008

How well does the shadow population match against the sample population?

Pretty well! In fact, we can tell that each row in the sample data found a match in the source data across all three of the variables we asked it to match against (the exact body mass values don’t match of course, but there was a match in the 250g mass range for each of the sample individuals).

All other things being equal, the larger the source data, the more likely you are to get a good match for each member of the sample population.

The more variables you try to match against, the less likely you are to find an exact shadow in the sample data across all chosen variables.

If shadowpop() doesn’t find a shadow match across all the initially specified match_cols, it will ignore the last one and try again to find a shadow row using the remaining variables. Then it will try again, this time ignoring the last two variables in the list, and so on until there are no more match variables to try.