Algorithm to match one input file with given numbers of file

后端未结

关注

 3  724

佛祖请我去吃肉 2021-01-05 09:17

I had an interview last week. I was stuck in one of the question in algorithm round. I answered that question, but the interviewer did not seem convinced. That\'s why I am s

3条回答

粉色の甜心 (楼主)

2021-01-05 10:14

As a suggestion for designing really capable, scalable systems for document similarity I'd suggest reading Chapter 3 of Mining Massive Datasets, which is freely available online. One approach presented there is to 'shingle' datasets by vectorizing word counts into sets, then hashing those word counts and comparing families of hashes results with Jaccard similarity to get a score between all documents. This can work on petabytes of files with high precision if done right. Explicit details with good diagrams can be read off Stanford's CS246 Slides on Locality Sensitive Hashing. Simpler approaches like word frequency counting are described in the book as well.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...