How do search engines find relevant content?

后端 未结 13 806
无人共我
无人共我 2021-01-29 20:03

How does Google find relevant content when it\'s parsing the web?

Let\'s say, for instance, Google uses the PHP native DOM Library to parse content. What methods would t

13条回答
  •  渐次进展
    2021-01-29 20:28

    There are some good answers on here, but it sounds like they don't answer your question. Perhaps this one will.

    What your looking for is called Information Retrieval

    It usually uses the Bag Of Words model

    Say you have two documents:

    DOCUMENT A  
    Seize the time, Meribor. Live now; make now always the most precious time. Now will never come again
    

    and this one

    DOCUMENT B  
    Worf, it was what it was glorious and wonderful and all that, but it doesn't mean anything
    

    and you have a query, or something you want to find other relevant documents for

    QUERY aka DOCUMENT C
    precious wonderful life
    

    Anyways, how do you calculate the most "relevant" of the two documents? Here's how:

    1. tokenize each document (break into words, removing all non letters)
    2. lowercase everything
    3. remove stopwords (and, the etc)
    4. consider stemming (removing the suffix, see Porter or Snowball stemming algorithms)
    5. consider using n-grams

    You can count the word frequency, to get the "keywords".

    Then, you make one column for each word, and calculate the word's importance to the document, with respect to its importance in all the documents. This is called the TF-IDF metric.

    Now you have this:

    Doc precious worf life...
    A   0.5      0.0  0.2 
    B   0.0      0.9  0.0
    C   0.7      0.0  0.9
    

    Then, you calculate the similarity between the documents, using the Cosine Similarity measure. The document with the highest similarity to DOCUMENT C is the most relevant.

    Now, you seem to want to want to find the most similar paragraphs, so just call each paragraph a document, or consider using Sliding Windows over the document instead.

    You can see my video here. It uses a graphical Java tool, but explains the concepts:

    http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-4.html

    here is a decent IR book:

    http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

提交回复
热议问题