remove duplicates from list based on semantic similarity/relatedness
R + tm: How do I de-duplicate items in a list, based on semantic similarity? v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv") . My expected solution would be c("bank", "ford_suv',"toyota_suv","nissan_suv") . That is, bank, banks and banking to be reduced to one term "bank." SnowBall::stemming is not an option because I have to retain the flavor of newspaper styles of various countries. Any help or direction will be useful. We could calculate the Levenshtein distance between words using adist and regroup them into clusters using hclust d <- adist(v) rownames(d) <- v Which