dplyr’s Lesser Known Joins and When to Use Them

R
dplyr
data-science
I explain dplyr’s semi_join, anti_join, and join_by commands, and when they might be useful to anyone who works with data.
Author

Aaron Graybill

Published

April 28, 2024

Intro

A puzzle fitting together

 

In this post, I’ll assume that you have some exposure to data wrangling in R using dplyr. If you’re familiar with mutate, summarize, and in particular, (left/right/inner/full)_join, you should have no problem.

Introducing the Working Data

Before I get into the joins, I will introduce the primary datasets for this tutorial. You can expand the code below to see how I create the data for this tutorial.

Click to see Code that Creates the Working Data
# Load dplyr without printing masking messages
library(dplyr, warn.conflicts = FALSE)

## Get and clean Best New Artist data ----
best_new_artist_url <- 'https://en.wikipedia.org/w/index.php?title=Grammy_Award_for_Best_New_Artist&oldid=1205945357'
html_content <- rvest::read_html(best_new_artist_url)

# Convert each html table to a dataframe
tables <- 
  rvest::html_nodes(html_content, "table") %>% 
  lapply(rvest::html_table)

# For each table on the page, only keep tables that have the correct
# column names
correct_columns <- c("Year[I]","Recipient", "Nominees")
new_artist_tables <- tables[sapply(tables,function(df) {all(correct_columns %in% colnames(df))})]

# Combine multiple tables into one table
best_new_artists <- bind_rows(new_artist_tables)

# Clean up names and fix types
best_new_artists <- 
  best_new_artists %>% 
  select(
    year = `Year[I]`,
    recipient = Recipient
  ) %>% 
  mutate(year = as.integer(year))

## Get and Album of the Year data ----
album_of_the_year_url <- 'https://en.wikipedia.org/w/index.php?title=Grammy_Award_for_Album_of_the_Year&oldid=1208622825'
html_content <- rvest::read_html(album_of_the_year_url)

# Convert each html table to a dataframe
tables <- 
  rvest::html_nodes(html_content, "table") %>% 
  lapply(rvest::html_table)

# For each table on the page, only keep tables that have the correct
# column names
correct_columns <- correct_columns <- c("Year[a]","Album", "Artist(s)")
album_of_the_year_tables <- tables[sapply(tables,function(df) {all(correct_columns %in% colnames(df))})]

# Combine multiple tables into one table
albums_of_the_year <- bind_rows(album_of_the_year_tables)

# Remove missing awards, and remove suffix from year
albums_of_the_year <- 
  albums_of_the_year %>% 
  filter(!is.na(Album)) %>% 
  mutate(`Year[a]` = substr(`Year[a]`,1,4))

# For each year, get only the first year (which is the winner). The removed
# rows are non-winning nominees
albums_of_the_year <- 
  albums_of_the_year %>% 
  slice_head(n=1, by = `Year[a]`)

# Clean up names and fix types
albums_of_the_year <- 
  albums_of_the_year %>% 
  select(
    year = `Year[a]`,
    album = Album,
    artists = `Artist(s)`
  ) %>% 
  mutate(year = as.integer(year))

# Only keep what we need
rm(list = setdiff(ls(),c( "albums_of_the_year", "best_new_artists")))

For this tutorial I have two primary datasets, both scraped from Wikipedia. First, best_new_artists is the list of winners of the Grammy’s “Best New Artist” Award from 1960 to 2024. This data has two columns, year (the year of the award) and recipient (the winner of the award). The end of the data looks like this:

tail(best_new_artists)
# A tibble: 6 × 2
   year recipient          
  <int> <chr>              
1  2019 Dua Lipa           
2  2020 Billie Eilish      
3  2021 Megan Thee Stallion
4  2022 Olivia Rodrigo     
5  2023 Samara Joy         
6  2024 Victoria Monét     

The other dataset in this tutorial is albums_of_the_year which contains the winners of the Grammy’s “Album of the Year” award from 1959-2024. This dataset has three columns: year (the year of the award), album (the title of the winning album), and artists (the artist(s) who made the album).

The last few rows of this dataset are:

tail(albums_of_the_year)
# A tibble: 6 × 3
   year album                                    artists        
  <int> <chr>                                    <chr>          
1  2019 Golden Hour                              Kacey Musgraves
2  2020 When We All Fall Asleep, Where Do We Go? Billie Eilish  
3  2021 Folklore                                 Taylor Swift   
4  2022 We Are                                   Jon Batiste    
5  2023 Harry's House                            Harry Styles   
6  2024 Midnights                                Taylor Swift   

Going forward, I will ask questions about how these two datasets interact.

semi_join (everything from dataset x that matches dataset y)

The semi_join command is like a version of dplyr’s filter command that works between two datasets.

Basic Usage of semi_join

I think the easiest way to learn how to use these dplyr commands is to use them to answer motivating questions. A motivating question for semi_join could be:

Which winners of Best New Artist ever won Album of the Year?

The code below uses a semi_join to return all of the best new artists where the recipient field matches the artists field from the album of the year data

semi_join(
  best_new_artists,
  albums_of_the_year,
  by = c("recipient" = "artists")
)
# A tibble: 8 × 2
   year recipient        
  <int> <chr>            
1  1961 Bob Newhart      
2  1965 The Beatles      
3  1976 Natalie Cole     
4  1981 Christopher Cross
5  1999 Lauryn Hill      
6  2003 Norah Jones      
7  2009 Adele            
8  2020 Billie Eilish    

Notice that not all winners of Best New Artist appear here. For example Olivia Rodrigo won Best New Artist in 2022, but has not (yet) won Album of the Year.

The Beatles won album of the year in 1968 for Sgt. Pepper’s Lonely Hearts Club Band, but this semi_join is only showing us the Beatles in 1965. Where’s Sgt. Peppers? This result is actually by design! semi_join only returns rows and columns from the first dataset.

If you’re like me, you might initially have a hard time distinguishing semi_join from inner_join. The key difference is that the semi_join. Only has the columns from the first dataset. Compare the results of the semi_join (above) to the equivalent inner_join (below):

inner_join(
  best_new_artists,
  albums_of_the_year,
  by = c("recipient" = "artists")
)
# A tibble: 9 × 4
  year.x recipient         year.y album                                   
   <int> <chr>              <int> <chr>                                   
1   1961 Bob Newhart         1961 The Button-Down Mind of Bob Newhart     
2   1965 The Beatles         1968 Sgt. Pepper's Lonely Hearts Club Band   
3   1976 Natalie Cole        1992 Unforgettable... with Love              
4   1981 Christopher Cross   1981 Christopher Cross                       
5   1999 Lauryn Hill         1999 The Miseducation of Lauryn Hill         
6   2003 Norah Jones         2003 Come Away with Me                       
7   2009 Adele               2012 21                                      
8   2009 Adele               2017 25                                      
9   2020 Billie Eilish       2020 When We All Fall Asleep, Where Do We Go?

Notice that we now have a year.x column (the year of the Best New Artist award) and a year.y column (the year of the Album of the Year).

semi_joining on Multiple Columns

Just like any other dplyr join, you can semi_join on mutliple columns. This allows us to answer questions like:

Which winners of Best New Artist won Album of the Year in the same year as Best New Artist award?

Contrast this with our question from before:

Which winners of Best New Artist ever won Album of the Year?

Now we’re only asking for years where the Album of the Year and the Best New Artist occur simultaneously. To accomplish this, we can join on both artist and the year of the award:

semi_join(
  best_new_artists,
  albums_of_the_year,
  by = c("recipient" = "artists", "year")
)
# A tibble: 5 × 2
   year recipient        
  <int> <chr>            
1  1961 Bob Newhart      
2  1981 Christopher Cross
3  1999 Lauryn Hill      
4  2003 Norah Jones      
5  2020 Billie Eilish    

A very short list! Again notice that semi_join does not return the actual album, we would require a “mutating” join, like left/right/inner/full to get the album name as well.

When to Use semi_join

I often use semi_join when I have complicated dataset that I need to filter on a few conditions.

For example, I might have a dataset of multiple timeseries of economic indicators at the county level. And then, due to a research question, we have identified specific combinations of counties and years that are interesting. I could use semi_join to get the values of the economic indicators only in the years and counties that we are interested in.

anti_join (everything from dataset x that doesn’t match dataset y)

Basic Usage

The anti_join function is essentially the opposite of semi_join. Recall that semi_join was used to answer questions like:

Which winners of Best New Artist ever won Album of the Year?

anti_join can be used to answer questions like:

Which winners of Best New Artist never won Album of the Year?

That is, anti_join returns all of the rows in the first dataset, that do not have a match in the second dataset. See the example below:

anti_join(
  best_new_artists,
  albums_of_the_year,
  by = c("recipient" = "artists")
)
# A tibble: 57 × 2
    year recipient            
   <int> <chr>                
 1  1960 Bobby Darin          
 2  1962 Peter Nero           
 3  1963 Robert Goulet        
 4  1964 The Swingle Singers  
 5  1966 Tom Jones            
 6  1967 No award             
 7  1968 Bobbie Gentry        
 8  1969 José Feliciano       
 9  1970 Crosby, Stills & Nash
10  1971 The Carpenters       
# ℹ 47 more rows

Similar to semi_join, anti_join only returns rows and columns from the first dataset, so the results above do not include any of the columns from the Album of the Year dataset.

anti_join on Multiple columns

When using anti_join on multiple columns, we return all of the rows from the first dataset, that do not match all of the columns in the second dataset. This allows us to answer questions like:

Which winners of Best New Artists were not Album of the Year Winners in their same year that they won their Best New Artist Award?

anti_join(
  best_new_artists,
  albums_of_the_year,
  by = c("recipient" = "artists", "year")
)
# A tibble: 60 × 2
    year recipient            
   <int> <chr>                
 1  1960 Bobby Darin          
 2  1962 Peter Nero           
 3  1963 Robert Goulet        
 4  1964 The Swingle Singers  
 5  1965 The Beatles          
 6  1966 Tom Jones            
 7  1967 No award             
 8  1968 Bobbie Gentry        
 9  1969 José Feliciano       
10  1970 Crosby, Stills & Nash
# ℹ 50 more rows

The example above is a bit contrived, but notice that there is no value for the year 1961. That is because in 1961 Bob Newhart won both Best New Artist and Album of the Year (in the same year), so he would not appear in the anti_join.

Further notice that the Beatles do appear in the anti_join despite having won both Best New Artist and Album of the Year. They appear because their Album of the Year win and Best New Artist win occurred in different years.

When to Use anti_join

I have found anti_join useful in two main situations. The first situation is when I need to exclude/remove problematic entries in a dataframe. For example, if I have figured out that all of the data from certain counties is wrong, I might wish to exclude that from subsequent analysis. Adding an anti_join on state and county would remove only the records corresponding to these problematic counties.

My other use for anti_join is as a debugger. Sometimes when I’m building a complicated join that isn’t working as expected, I will replace the (left/full/semi)_join with an anti_join to see which records are not being matched between datasets. Isolating just the unmatched records allows you to quickly see if there’s anything off about those rows like capitalization, NAs, etc.

join_by (inequality conditions with other joins)

dplyr’s join_by function was the inspiration for this tutorial. It allows you to quickly implement complex joins that would feel almost impossible otherwise.

Basic Usage

Unlike the other functions in this tutorial join_by appears inside of other joins. Effectively, join_by creates a more specific version of what is normally specified in the by = c(...) argument. For example, the following two joins are equivalent.

df_regular <- semi_join(
  best_new_artists,
  albums_of_the_year,
  by = c("recipient" = "artists")
)

df_join_by <- semi_join(
  best_new_artists,
  albums_of_the_year,
  by = join_by(recipient == artists)
)

# print out dataframe
df_join_by
# A tibble: 8 × 2
   year recipient        
  <int> <chr>            
1  1961 Bob Newhart      
2  1965 The Beatles      
3  1976 Natalie Cole     
4  1981 Christopher Cross
5  1999 Lauryn Hill      
6  2003 Norah Jones      
7  2009 Adele            
8  2020 Billie Eilish    
# make sure two methods are exactly the same
all.equal(df_regular,df_join_by)
[1] TRUE

Just to refresh on what semi_join is doing, in the above, we are answering the question:

Which Best New Artist winners also won Album of the Year?

Notice that in the join_by example above, I am using unquoted variable names and the binary operator == instead of the assignment operator =. This is because join_by allows for inequality conditions as well! join_by allows us to ask complicated complicated questions like:

For the people who won both Album of the Year and Best New Artist, which Best New Artist winners won their award before they won Album of the Year?

semi_join(
  best_new_artists,
  albums_of_the_year,
  by = join_by(recipient == artists,
               year < year)
)
# A tibble: 3 × 2
   year recipient   
  <int> <chr>       
1  1965 The Beatles 
2  1976 Natalie Cole
3  2009 Adele       

Notice that this list is much smaller than the previous join! That’s because everyone who is now excluded won their Best New Artist award in the same year as their Best New Artist award.

The above code is much more elegant than an equivalent implementation using the basic by = c(...) syntax.

Advanced Usage

We have already discussed join_by’s inequality joins, but it can actually do even more than this! I’ll point you to the documentation for join_by which details all of the available options.

But as a teaser, I will tell you about the closest option. Let’s say you have two datasets. One is a dataset of the dates that a baseball player is traded between teams. The other dataset is the dates where that player has played in games. If you wanted to find this player’s first game with each team, you could use something like join_by(closest(game_date >= trade_date)) which would only match the first game that appears on or after the trade date!

Of course, this example is contrived, but the closest option can be extremly useful when the data you’re working with is large.

When to Use join_by

Often, there are ways to avoid using join by using clever joins and filters, but join_by can make this code much more clear. In addition, doing join and then filter often returns way more results than are actually needed before the filter occurs. When working with large datasets, often the full results of the join will not fit in your computer’s memory, so you have to use a join_by to return only the results that are needed.

I find join_by most useful when I have two datasets that both have dates that don’t match exactly. Here join_by allows me to look at which records from one dataset are closest or contained within the dates of another.

Wrapping Up 🎁

In this post I explained how to use dplyr’s semi_join, anti_join, and join_by functions. These functions can be very helpful to anyone who works with data!

Please feel free to leave a comment below if you have any questions!