How does Google find relevant content when it\'s parsing the web?
Let\'s say, for instance, Google uses the PHP native DOM Library to parse content. What methods would t
There are some good answers on here, but it sounds like they don't answer your question. Perhaps this one will.
What your looking for is called Information Retrieval
It usually uses the Bag Of Words model
Say you have two documents:
DOCUMENT A
Seize the time, Meribor. Live now; make now always the most precious time. Now will never come again
and this one
DOCUMENT B
Worf, it was what it was glorious and wonderful and all that, but it doesn't mean anything
and you have a query, or something you want to find other relevant documents for
QUERY aka DOCUMENT C
precious wonderful life
Anyways, how do you calculate the most "relevant" of the two documents? Here's how:
You can count the word frequency, to get the "keywords".
Then, you make one column for each word, and calculate the word's importance to the document, with respect to its importance in all the documents. This is called the TF-IDF metric.
Now you have this:
Doc precious worf life...
A 0.5 0.0 0.2
B 0.0 0.9 0.0
C 0.7 0.0 0.9
Then, you calculate the similarity between the documents, using the Cosine Similarity measure. The document with the highest similarity to DOCUMENT C is the most relevant.
Now, you seem to want to want to find the most similar paragraphs, so just call each paragraph a document, or consider using Sliding Windows over the document instead.
You can see my video here. It uses a graphical Java tool, but explains the concepts:
http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-4.html
here is a decent IR book:
http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf