I\'m writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two
A good algorithm for comparing two text is tf-idf. It will give similarity between two documents.
1. calculate tf-idf for the document
2. calculate cosine similarity for two given text
3. the cosine similarity will indicate match between two documents.
This is a very good tutorial for calculating tf-idf and cosine similarity in Java. It would be simple to extend it to C#.