发表新帖

发表新帖

Is there an efficient algorithm for fuzzy deduplication of string lists? [duplicate]

后端未结

关注

 2  1170

挽巷 2021-01-06 11:55

2条回答

夕颜 (楼主)

2021-01-06 12:17

If your measure of similarity is strong (e.g. Levenshtein distance 1), then you can process your string list in order, generating all possible "close" strings to the current string and looking up that close string in your hashtable. If it is there, skip the original string. If not, output it and add it to the hashtable.

This algorithm depends on being able to generate all close strings to a string, and there not being too many of them. (This is what I mean by "strong" above.)

As a possible optimization, you could store more than just the original strings in the hashtable. For instance, if you wanted Levenshtein distance 3, you could store all strings distance 1 from your outputted strings in the hashtable, then look up distance 2 strings when checking a new string.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题