Web crawling evaluation?

≯℡__Kan透↙ 提交于 2019-12-25 17:07:46

问题


I have seen in focused web crawling (a.k.a. topical web crawling), the evaluation metric - harvest ratio - is defined as: after crawling 't' pages, harvest ratio = number_of_relevant_pages/pages_crawled(t).

So for example after crawling 100 pages I get 80 true positives then the harvest ratio of the crawler at that point is 0.9. But the crawler might have ignored some pages off crawling that are totally relevant to the crawling domain but is not accounted in the evaluation ratio. What is this? Can we improve that evaluation metric to include the missed pages that are totally relevant? Is this consideration important?


回答1:


The most basic evaluation for a focused crawl is Precision and recall which can be aggregated into F-measure.

http://en.wikipedia.org/wiki/Precision_and_recall

http://en.wikipedia.org/wiki/F1_score

If you are more interested into how a page is relevant to a specific keyword, you want to use tf/idf (term frequency–inverse document frequency)

http://en.wikipedia.org/wiki/Tf*idf



来源:https://stackoverflow.com/questions/11184726/web-crawling-evaluation

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!