Generate only unfetched urls instead of scored Nutch 2.3

半世苍凉 提交于 2019-12-12 03:27:28

问题


Is there any way to generate only the un-fetched urls instead of based on score in Nutch 2.x?


回答1:


Well, for Nutch 1.x you could use the jexl support that is shipped since Nutch 1.12 (I think):

$ bin/nutch generate -expr "status == db_unfetched" 

with this command you're ensuring that only the URLs with a db_unfetched status are considered for generating the segments that you want to crawl.

This feature is still not available on 2.x branch, but writing a custom GeneratorJob could do the trick.

On the other hand, since the generator Job is already considering the score to sort the list of URLs to fetch, perhaps the easier way could be to write a custom ScoringFilter.

For instance, if you take a look at: https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/scoring/ScoringFilter.java#L69-L81 the ScoringFilter already provides a generatorSortValue method only for generating a scoring value for the generator job, so you could write your own to boost those URLs with an unfetched status.



来源:https://stackoverflow.com/questions/43993032/generate-only-unfetched-urls-instead-of-scored-nutch-2-3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!