remove duplicates from list based on semantic similarity/relatedness

…衆ロ難τιáo~ 提交于 2019-12-03 19:38:42

问题


R + tm: How do I de-duplicate items in a list, based on semantic similarity? v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv"). My expected solution would be c("bank", "ford_suv',"toyota_suv","nissan_suv"). That is, bank, banks and banking to be reduced to one term "bank." SnowBall::stemming is not an option because I have to retain the flavor of newspaper styles of various countries. Any help or direction will be useful.


回答1:


We could calculate the Levenshtein distance between words using adist and regroup them into clusters using hclust

d <- adist(v)
rownames(d) <- v

Which gives a matrix of distance between terms:

#              [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#bank              0    1    3    8    9    8    2   13    6     5     3     4
#banks             1    0    3    7    9    7    2   13    6     6     2     5
#banking           3    3    0    8   10    8    3   13    7     6     3     7
#ford_suv          8    7    8    0    5    6    8   12    7     7     8     4
#toyota_suv        9    9   10    5    0    6    9    7    4     9     9     9
#nissan_suv        8    7    8    6    6    0    8   13   10     4     8    10
#banker            2    2    3    8    9    8    0   12    6     6     1     6
#toyota_corolla   13   13   13   12    7   13   12    0    8    13    12    12
#toyota            6    6    7    7    4   10    6    8    0     6     7     5
#nissan            5    6    6    7    9    4    6   13    6     0     7     6
#bankers           3    2    3    8    9    8    1   12    7     7     0     6
#ford              4    5    7    4    9   10    6   12    5     6     6     0

Then we can pass it to hclust using method = ward.D

cl <- hclust(as.dist(d), method  = "ward.D")
plot(cl)

Which gives:

We notice 4 distinct clusters (that we can illustrate using rect.hclust(cl, 4))

Now, we can turn this result into a data.frame and tag each cluster with it's shortest term:

library(dplyr)
data.frame(group = cutree(cl, 4)) %>%
  tibble::rownames_to_column("term") %>%
  group_by(group) %>%
  mutate(tag = term[nchar(term) == min(nchar(term))]) 

Which gives:

#Source: local data frame [12 x 3]
#Groups: group [4]
#
#             term group      tag
#            <chr> <int>    <chr>
#1            bank     1     bank
#2           banks     1     bank
#3         banking     1     bank
#4        ford_suv     2     ford
#5      toyota_suv     3   toyota
#6      nissan_suv     4   nissan
#7          banker     1     bank
#8  toyota_corolla     3   toyota
#9          toyota     3   toyota
#10         nissan     4   nissan
#11        bankers     1     bank
#12           ford     2     ford

Should we want to extract only the unique tag for each cluster, we could add ... %>% distinct(tag) %>% .$tag to the pipe which would give:

#[1] "bank"   "ford"   "toyota" "nissan"

Reference

?adist

The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t (so that the transformation exactly matches t).

?hclust

This function performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.


Note: I used data provided by @Abdou in the comments as it represents a more complete use case



来源:https://stackoverflow.com/questions/38956241/remove-duplicates-from-list-based-on-semantic-similarity-relatedness

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!