How to detect duplicates among text documents and return the duplicates' similarity?

前端 未结 3 445
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-02 01:16

I\'m writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two

3条回答
  •  感情败类
    2020-12-02 01:36

    A good algorithm for comparing two text is tf-idf. It will give similarity between two documents.

    1. calculate tf-idf for the document
    2. calculate cosine similarity for two given text
    3. the cosine similarity will indicate match between two documents.
    

    This is a very good tutorial for calculating tf-idf and cosine similarity in Java. It would be simple to extend it to C#.

提交回复
热议问题