regex match with fuzzyjoin / dplyr

China☆狼群 提交于 2021-02-08 11:18:21

问题


I have two data frames that I want to join by the first column and to ignore the case:

df3<- data.frame("A" = c("XX28801","ZZ9"), "B" = c("one","two"),stringsAsFactors = FALSE)
df4<- data.frame("Z" = c("X2880","Zz9"),"C" = c("three", "four"), stringsAsFactors = FALSE)

What I want is this:

df5<- data.frame(A = c("XX28801","ZZ9"), B = c("one","two"), Z = c(NA,"Zz9"), C = c(NA, "four"))

but interestingly, I get this using the fuzzyjoin package:

join <- regex_left_join(df3,df4,by= c("A" = "Z"), ignore_case = TRUE)

It's good ZZ9 and Zz9 matched but I have no idea why XX28801 matched with X2880. The only similarity is the X2880 in XX28801.

I also don't want to uppercase/lowercase the values before joining as I want column A and column Z to retain their original values. Thanks.


回答1:


Regex joins join on regular expressions, this searchers for the text in the right hand table within the text of the left hand table. So as "X2880" is found within "XX28801" this is considered a match.

To understand regex better, you might find it useful to explore some comparisons using grepl(pattern, text) this returns true/false if the pattern is found within text:

> grepl('X2880', 'XX28801', ignore.case = TRUE)
[1] TRUE

It seems like you want to match only when the entire text string matches the entire text string, other than capital/lowercase. For this I would recommend you create temporary columns to join on:

df3_w_lower = df3 %>%
  mutate(A_for_join = tolower(A))
df4_w_lower = df4 %>%
  mutate(Z_for_join = tolower(Z))

join = left_join(df3_w_lower, df4_w_lower, by = c("A_for_join" = "Z_for_join")) %>%
  select(-A_for_join, - Z_for_join)

By using temporary columns for joining you preserve the capitalization in the original columns.



来源:https://stackoverflow.com/questions/64914070/regex-match-with-fuzzyjoin-dplyr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!