Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library (gt)
library (palmerpenguins)
library (shadowpop)
library (withr)
Using shadowpop
To illustrate what shadowpop()
does, we can use the {palmerpenguins} dataset.
Palmer penguins package artwork
Artwork by @allison_horst
I am going to restrict the example here to the Adélie penguins.
Let’s say we have a small sample of 5 Adélies from Biscoe island:
sample_data <- penguins |>
dplyr:: filter (species == "Adelie" & island == "Biscoe" ) |>
dplyr:: slice_sample (n = 5 ) |>
withr:: with_seed (seed = 777 )
sample_data |>
gt ()
Adelie
Biscoe
41.1
18.2
192
4050
male
2008
Adelie
Biscoe
37.8
18.3
174
3400
female
2007
Adelie
Biscoe
41.6
18.0
192
3950
male
2008
Adelie
Biscoe
36.5
16.6
181
2850
female
2008
Adelie
Biscoe
40.1
18.9
188
4300
male
2008
and we would like to find a “shadow population” of 5 Adélies from Dream island that have a similar set of characteristics.
The penguins
dataset doesn’t have an ID column to uniquely identify each observation; in this situation, shadowpop()
will add one automatically.
(If your data has a unique ID/key field, you should specify its name via the id_col
argument.)
Let’s create our source data. In this case we’re going to use all the Adélies found on Dream island (top five rows shown below):
source_data <- penguins |>
dplyr:: filter (species == "Adelie" & island == "Dream" )
head (source_data, n = 5 ) |>
gt ()
Adelie
Dream
39.5
16.7
178
3250
female
2007
Adelie
Dream
37.2
18.1
178
3900
male
2007
Adelie
Dream
39.5
17.8
188
3300
female
2007
Adelie
Dream
40.9
18.9
184
3900
male
2007
Adelie
Dream
36.4
17.0
195
3325
female
2007
Finally we need to decide which variables we want to use to generate the best match for our shadow population, and in which order.
Probably sex is important, and maybe body mass, and let’s also say we want to match by year if possible.
We hit a problem now because the body mass variable is a specific number of grams, so it’s very unlikely that exact matches will occur.
(Currently shadowpop()
only works via exact matching of fields.)
There are also some NA
body mass values in the source data which we might want to convert to a specific character value.
There are two ways we can solve the problem:
convert the specific mass values to ranges and match on those,
create a match by closeness, for example match any source penguins that are within +/-50g of the mass of the sample individual.
For now we will use the range option.
sample_data2 <- sample_data |>
dplyr:: mutate (body_mass_range = dplyr:: case_when (
.data[["body_mass_g" ]] < 3000 ~ "<3000" ,
.data[["body_mass_g" ]] < 3250 ~ "3000-3249" ,
.data[["body_mass_g" ]] < 3500 ~ "3250-3499" ,
.data[["body_mass_g" ]] < 3750 ~ "3500-3749" ,
.data[["body_mass_g" ]] < 4000 ~ "3750-3999" ,
.data[["body_mass_g" ]] < 4250 ~ "4000-4249" ,
.data[["body_mass_g" ]] < 4500 ~ "4250-4499" ,
.data[["body_mass_g" ]] >= 4500 ~ "4500+" ,
is.na (.data[["body_mass_g" ]]) ~ "Missing"
))
source_data2 <- source_data |>
dplyr:: mutate (body_mass_range = dplyr:: case_when (
.data[["body_mass_g" ]] < 3000 ~ "<3000" ,
.data[["body_mass_g" ]] < 3250 ~ "3000-3249" ,
.data[["body_mass_g" ]] < 3500 ~ "3250-3499" ,
.data[["body_mass_g" ]] < 3750 ~ "3500-3749" ,
.data[["body_mass_g" ]] < 4000 ~ "3750-3999" ,
.data[["body_mass_g" ]] < 4250 ~ "4000-4249" ,
.data[["body_mass_g" ]] < 4500 ~ "4250-4499" ,
.data[["body_mass_g" ]] >= 4500 ~ "4500+" ,
is.na (.data[["body_mass_g" ]]) ~ "Missing"
))
We can now use body_mass_range
as one of our match variables, and create a shadow population:
match_vars <- c ("sex" , "body_mass_range" , "year" )
shadow_pop <- shadowpop (sample_data2, source_data2, match_vars) |>
dplyr:: select (! c ("id" , "body_mass_range" ))
shadow_pop |>
gt ()
Adelie
Dream
43.2
18.5
192
4100
male
2008
Adelie
Dream
39.5
16.7
178
3250
female
2007
Adelie
Dream
38.3
19.2
189
3950
male
2008
Adelie
Dream
33.1
16.1
178
2900
female
2008
Adelie
Dream
40.3
18.5
196
4350
male
2008
Adelie
Biscoe
41.1
18.2
192
4050
male
2008
Adelie
Biscoe
37.8
18.3
174
3400
female
2007
Adelie
Biscoe
41.6
18.0
192
3950
male
2008
Adelie
Biscoe
36.5
16.6
181
2850
female
2008
Adelie
Biscoe
40.1
18.9
188
4300
male
2008
How well does the shadow population match against the sample population?
Pretty well! In fact, we can tell that each row in the sample data found a match in the source data across all three of the variables we asked it to match against (the exact body mass values don’t match of course, but there was a match in the 250g mass range for each of the sample individuals).
All other things being equal, the larger the source data, the more likely you are to get a good match for each member of the sample population.
The more variables you try to match against, the less likely you are to find an exact shadow in the sample data across all chosen variables.
If shadowpop()
doesn’t find a shadow match across all the initially specified match_cols
, it will ignore the last one and try again to find a shadow row using the remaining variables. Then it will try again, this time ignoring the last two variables in the list, and so on until there are no more match variables to try.