regex match with fuzzyjoin / dplyr

问题

I have two data frames that I want to join by the first column and to ignore the case:

df3<- data.frame("A" = c("XX28801","ZZ9"), "B" = c("one","two"),stringsAsFactors = FALSE)
df4<- data.frame("Z" = c("X2880","Zz9"),"C" = c("three", "four"), stringsAsFactors = FALSE)

What I want is this:

df5<- data.frame(A = c("XX28801","ZZ9"), B = c("one","two"), Z = c(NA,"Zz9"), C = c(NA, "four"))

but interestingly, I get this using the fuzzyjoin package:

join <- regex_left_join(df3,df4,by= c("A" = "Z"), ignore_case = TRUE)

It's good ZZ9 and Zz9 matched but I have no idea why XX28801 matched with X2880. The only similarity is the X2880 in XX28801.

I also don't want to uppercase/lowercase the values before joining as I want column A and column Z to retain their original values. Thanks.

回答1:

Regex joins join on regular expressions, this searchers for the text in the right hand table within the text of the left hand table. So as "X2880" is found within "XX28801" this is considered a match.

To understand regex better, you might find it useful to explore some comparisons using grepl(pattern, text) this returns true/false if the pattern is found within text:

> grepl('X2880', 'XX28801', ignore.case = TRUE)
[1] TRUE

It seems like you want to match only when the entire text string matches the entire text string, other than capital/lowercase. For this I would recommend you create temporary columns to join on:

df3_w_lower = df3 %>%
  mutate(A_for_join = tolower(A))
df4_w_lower = df4 %>%
  mutate(Z_for_join = tolower(Z))

join = left_join(df3_w_lower, df4_w_lower, by = c("A_for_join" = "Z_for_join")) %>%
  select(-A_for_join, - Z_for_join)

By using temporary columns for joining you preserve the capitalization in the original columns.

来源：https://stackoverflow.com/questions/64914070/regex-match-with-fuzzyjoin-dplyr

标签

dplyr