How do search engines find relevant content?

后端未结

关注

 13  887

无人共我 2021-01-29 20:03

How does Google find relevant content when it\'s parsing the web?

Let\'s say, for instance, Google uses the PHP native DOM Library to parse content. What methods would t

13条回答

渐次进展 (楼主)

2021-01-29 20:28
There are some good answers on here, but it sounds like they don't answer your question. Perhaps this one will.

What your looking for is called Information Retrieval

It usually uses the Bag Of Words model

Say you have two documents:
```
DOCUMENT A  
Seize the time, Meribor. Live now; make now always the most precious time. Now will never come again
```
and this one
```
DOCUMENT B  
Worf, it was what it was glorious and wonderful and all that, but it doesn't mean anything
```
and you have a query, or something you want to find other relevant documents for
```
QUERY aka DOCUMENT C
precious wonderful life
```
Anyways, how do you calculate the most "relevant" of the two documents? Here's how:
1. tokenize each document (break into words, removing all non letters)
2. lowercase everything
3. remove stopwords (and, the etc)
4. consider stemming (removing the suffix, see Porter or Snowball stemming algorithms)
5. consider using n-grams
You can count the word frequency, to get the "keywords".

Then, you make one column for each word, and calculate the word's importance to the document, with respect to its importance in all the documents. This is called the TF-IDF metric.

Now you have this:
```
Doc precious worf life...
A   0.5      0.0  0.2 
B   0.0      0.9  0.0
C   0.7      0.0  0.9
```
Then, you calculate the similarity between the documents, using the Cosine Similarity measure. The document with the highest similarity to DOCUMENT C is the most relevant.

Now, you seem to want to want to find the most similar paragraphs, so just call each paragraph a document, or consider using Sliding Windows over the document instead.

You can see my video here. It uses a graphical Java tool, but explains the concepts:

http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-4.html

here is a decent IR book:

http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...