Canonical tidyverse method to update some values of a vector from a look-up table

22

I frequently need to recode some (not all!) values in a data frame column based off of a look-up table. I'm not satisfied by the ways I know of to solve the problem. I'd like to be able to do it in a clear, stable, and efficient way. Before I write my own function, I'd want to make sure I'm not duplicating something standard that's already out there.

## Toy example
data = data.frame(
  id = 1:7,
  x = c("A", "A", "B", "C", "D", "AA", ".")
)

lookup = data.frame(
  old = c("A", "D", "."),
  new = c("a", "d", "!")
)

## desired result
#   id  x
# 1  1  a
# 2  2  a
# 3  3  B
# 4  4  C
# 5  5  d
# 6  6 AA
# 7  7  !

I can do it with a join, coalesce, unselect as below, but this isn't as clear as I'd like - too many steps.

## This works, but is more steps than I want
library(dplyr)
data %>%
  left_join(lookup, by = c("x" = "old")) %>%
  mutate(x = coalesce(new, x)) %>%
  select(-new)

It can also be done with dplyr::recode, as below, converting the lookup table to a named lookup vector. I prefer lookup as a data frame, but I'm okay with the named vector solution. My concern here is that recode is the Questioning lifecycle phase, so I'm worried that this method isn't stable.

lookup_v = pull(lookup, new) %>% setNames(lookup$old)
data %>%
  mutate(x = recode(x, !!!lookup_v))

It could also be done with, say, stringr::str_replace, but using regex for whole-string matching isn't efficient. I suppose there is forcats::fct_recode is a stable version of recode, but I don't want a factor output (though mutate(x = as.character(fct_recode(x, !!!lookup_v))) is perhaps my favorite option so far...).

I had hoped that the new-ish rows_update() family of dplyr functions would work, but it is strict about column names, and I don't think it can update the column it's joining on. (And it's Experimental, so doesn't yet meet my stability requirement.)

Summary of my requirements:

  • A single data column is updated based off of a lookup data frame (preferably) or named vector (allowable)
  • Not all values in the data are included in the lookup--the ones that are not present are not modified
  • Must work on character class input. Working more generally is a nice-to-have.
  • No dependencies outside of base R and tidyverse packages (though I'd also be interested in seeing a data.table solution)
  • No functions used that are in lifecycle phases like superseded or questioning. Please note any experimental lifecycle functions, as they have future potential.
  • Concise, clear code
  • I don't need extreme optimization, but nothing wildly inefficient (like regex when it's not needed)
Share
Improve this question
3
  • data %>% mutate(x = str_replace_all(x, setNames(lookup$new, lookup$old))) is certainly less efficient than the matching and join, however, I'm not sure whether it's wildly inefficient with decent sized datasets. – tmfmnk Apr 13 at 20:38
  • 1
    Well, to be safe you'd need to use something like sprintf("^%s$", lookup$new) as the pattern, and even then I'd have concerns about regex special characters in the old names. I'm biased from an old project I used to work on where regex was a performance bottleneck (quite complex regex, mostly), but I definitely have a pet peeve about using regex for exact whole string matches. But on 100k rows my guess is str_replace_all will be at least 10x slower than most of these other solutions. – Gregor Thomas Apr 13 at 22:52
  • If the recode and fct_recode variants can't deal with ., then they're not general enough. – Gregor Thomas Apr 20 at 13:30

Comments

Popular posts from this blog

Meaning of `{}` for return expression

Get current scroll position of ScrollView in React Native

flutter websocket connection issue