fuzzy matching two strings uring r

偶尔善良 提交于 2021-01-27 19:08:41

问题


I have two vectors, each of which includes a series of strings. For example,

V1=c("pen", "document folder", "warn")
V2=c("pens", "copy folder", "warning")

I need to find which two are matched the best. I directly use levenshtein distance. But it is not good enough. In my case, pen and pens should mean the same. document folder and copy folder are probably the same thing. warn and warning are actually the same. I am trying to use the packages like tm. But I am not very sure which functions are suitable for doing this. Can anyone tell me about this?


回答1:


In my experience the cosine match is a good one for such kind of a jobs:

V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")   
result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 1))
rownames(result) <- V2
result
                  pen document folder      warn
copy folder 0.6797437       0.2132042 0.8613250
warning     0.6150998       0.7817821 0.1666667
pens        0.1339746       0.6726732 0.7500000

You have to define a cut off when the distance is close enough, how lower the distance how better they match. You can also play with the Q parameter which says how many letters combinations should be compared to each other. For example:

result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 3))
rownames(result) <- V2
result
                  pen document folder      warn
copy folder 1.0000000       0.5377498 1.0000000
warning     1.0000000       1.0000000 0.3675445
pens        0.2928932       1.0000000 1.0000000



回答2:


Here's wiki for Levenshtein distance. It measures how many delete/change/insert actions need to be taken to transform strings. And one of approaches for fuzzy matching is minimizing this value.

Here's an example. I shuffled up order a bit, to make it less boring:

V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")

apply(adist(x = V1, y = V2), 1, which.min)
[1] 3 1 2

Output means, which positions of V2 correspond to closest transformation of V1, in order of V1.

data.frame(string_to_match = V1, 
           closest_match = V2[apply(adist(x = V1, y = V2), 1, which.min)])
  string_to_match closest_match
1             pen          pens
2 document folder   copy folder
3            warn       warning


来源:https://stackoverflow.com/questions/40299192/fuzzy-matching-two-strings-uring-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!