If you are trying to detect the documents that are talking about the same topic, you could try collecting the most frequently used words, throw away the stop words . Documents that have a similar distribution of the most frequently used words are probably talking about similar things. You may need to do some stemming and extend the concept to n-grams if you want higher accuracy. For more advanced techniques, look into machine learning.