Identifying near duplicate entries using synonyms in R

问题

I am trying to identify near duplicate entries of names in a database. I am new to databases, however i am familiar with R. I can get clusters of near duplicates using fuzzy matching and soundex in R. However there are several names which are synonyms of each other. I would like to cluster the names based on this criteria along with the above ones.

I want to do as suggested in Techniques for finding near duplicate records but with synonyms. I understand there is a sort of database of synonyms for English words called WordNet with sets of synonyms called synsets. But the entries in the field names are in different formats and languages.

For example If know "R version 3.0.3" and "Warm Puppy" are synonyms. I want to be able to use such custom synsets syn1 <- c("R version 3.0.3", "Warm Puppy") for clustering near duplicates.

Down the road I would also like to separate homonyms in clusters based on entries in other fields(columns) of a record.

Is there any method to implement this in R?

回答1:

Crops, this is not an answer but might help with you or others who answer.

As I assume you know, the TM package allows custom stop words, but I can't recall a custom vector of synonyms as in your Warm Puppy example. That would be very useful.

Second, Tyler Rinker's qdap package has lots of capabilities and might have (or he might create) such a synonym capability.

Third, the RTextTools package amalgamates many packages and functions. The team behind it may help.

It would be very useful to have a synonym-vector capability for what I do. Good luck and I will check back.

来源：https://stackoverflow.com/questions/22403642/identifying-near-duplicate-entries-using-synonyms-in-r

标签

duplicate-removal

synonym

duplicates