问题
I have seen in focused web crawling (a.k.a. topical web crawling), the evaluation metric - harvest ratio - is defined as: after crawling 't' pages, harvest ratio = number_of_relevant_pages/pages_crawled(t).
So for example after crawling 100 pages I get 80 true positives then the harvest ratio of the crawler at that point is 0.9. But the crawler might have ignored some pages off crawling that are totally relevant to the crawling domain but is not accounted in the evaluation ratio. What is this? Can we improve that evaluation metric to include the missed pages that are totally relevant? Is this consideration important?
回答1:
The most basic evaluation for a focused crawl is Precision and recall which can be aggregated into F-measure.
http://en.wikipedia.org/wiki/Precision_and_recall
http://en.wikipedia.org/wiki/F1_score
If you are more interested into how a page is relevant to a specific keyword, you want to use tf/idf (term frequency–inverse document frequency)
http://en.wikipedia.org/wiki/Tf*idf
来源:https://stackoverflow.com/questions/11184726/web-crawling-evaluation