I have a rather big dataset of N
documents with less than 1% of them being near-duplicate which I want to identify. I have many number fields, and a few text fields. I consider two documents in the data set close if...
- all but one, two or three data fields are fully identical.
- corresponding text fields of two documents are only a few edits away (that's the Levensthein distance used by ElasticSearch).
How would you approach this challenge of identifying fuzzy duplicates with ElasticSearch?
I already struggle to write a (general) ElasticSearch query for part (1), which does not explicitly use the field names. Do I really have to build a huge query of the following pattern, or is there a smarter way?
( SELECT * FROM MessyData AS T1
JOIN MessyData AS T2
WHERE T1.F1 != T1.F1 AND T1.F2 = T2.F2 AND T1.F3 = T2.F3 AND ... )
UNION ALL
( SELECT * FROM MessyData AS T1
JOIN MessyData AS T2
WHERE T1.F1 = T1.F1 AND T1.F2 != T2.F2 AND T1.F3 = T2.F3 AND ... )
UNION ALL
( SELECT * FROM MessyData AS T1
JOIN MessyData AS T2
WHERE T1.F1 = T1.F1 AND T1.F2 = T2.F2 AND T1.F3 != T2.F3 AND ... )
UNION ALL
( ... )
Note: I used SQL pseudocode to show what I mean for the case where all except one fields are identical. F
stands for field, T
for table, but it would be an index in ElasticSearch.
Calculating dendrograms or using another similarity measure which compares each document which every other one gives me a computational effort of N·(N-1)
and is therefore not feasible.
The approach I am considering for the 2nd part of the issue is to probe my data set with m
test documents (where m
is much smaller than N
), add up ElasticSearch's score over all m
queries. That would give me O(m·N) as computational effort, but I still would have to sort all N
score sums, at least partially, or on the fly.
Are there algorithms other than More Like This
or Fuzzy Query
for this problem? Links to scientific papers are also appreciated!
References
- https://en.wikipedia.org/wiki/Data_deduplication just as an introduction
- https://discuss.elastic.co/t/finding-documents--almost--the-same/66089/2
- https://discuss.elastic.co/t/using-fuzzy-query-to-find-near-duplicates/39075 - a question on the forum without any answer
- https://www.compose.com/articles/how-scoring-works-in-elasticsearch/
- https://betterexplained.com/articles/sorting-algorithms/ for order of the different standard searching algorithms
I would suggest the quick and dirty approach of dividing your fields into 4 groups. Compute a hash of each group of fields. Unless you have identical hashes on one of these four measures, you can't be a near duplicate.
With luck, this trick will mean that you only need to compute any given document with a relatively small number of others that were an exact match on a quarter of the fields.
If a clump of "matches on the same hash" is too big, you can repeat the trick with the fields that weren't part of the clump in the hope of reducing how much work needs to be done.
来源:https://stackoverflow.com/questions/57146791/fuzzy-duplicate-search-with-elasticsearch