发表新帖

发表新帖

Fuzzy matching deduplication in less than exponential time?

前端未结

关注

 6  1912

轻奢々 2020-12-07 23:41

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).

I am looking for

6条回答

予麋鹿 (楼主)

2020-12-08 00:04

I think you may have mis-calculated the complexity for all the combinations. If comparing one string with all other strings is linear, this means due to the small lengths, each comparison is O(1). The process of comparing each string with every other string is not exponential but quadratic, which is not all bad. In simpler terms you are comparing nC2 or n(n-1)/2 pairs of strings, so its just O(n^2)

I couldnt think of a way you can sort them in order as you cant write an objective comparator, but even if you do so, sorting would take O(nlogn) for merge sort and since you have so many records and probably would prefer using no extra memory, you would use quick sort, which takes O(n^2) in worst case, no improvement over the worst case time in brute force.

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题