Identifying near duplicate entries using synonyms in R

ぃ、小莉子 提交于 2020-01-04 16:58:34

问题


I am trying to identify near duplicate entries of names in a database. I am new to databases, however i am familiar with R. I can get clusters of near duplicates using fuzzy matching and soundex in R. However there are several names which are synonyms of each other. I would like to cluster the names based on this criteria along with the above ones.

I want to do as suggested in Techniques for finding near duplicate records but with synonyms. I understand there is a sort of database of synonyms for English words called WordNet with sets of synonyms called synsets. But the entries in the field names are in different formats and languages.

For example If know "R version 3.0.3" and "Warm Puppy" are synonyms. I want to be able to use such custom synsets syn1 <- c("R version 3.0.3", "Warm Puppy") for clustering near duplicates.

Down the road I would also like to separate homonyms in clusters based on entries in other fields(columns) of a record.

Is there any method to implement this in R?


回答1:


Crops, this is not an answer but might help with you or others who answer.

As I assume you know, the TM package allows custom stop words, but I can't recall a custom vector of synonyms as in your Warm Puppy example. That would be very useful.

Second, Tyler Rinker's qdap package has lots of capabilities and might have (or he might create) such a synonym capability.

Third, the RTextTools package amalgamates many packages and functions. The team behind it may help.

It would be very useful to have a synonym-vector capability for what I do. Good luck and I will check back.



来源:https://stackoverflow.com/questions/22403642/identifying-near-duplicate-entries-using-synonyms-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!