问题
Is there any way to generate only the un-fetched urls instead of based on score in Nutch 2.x?
回答1:
Well, for Nutch 1.x you could use the jexl support that is shipped since Nutch 1.12 (I think):
$ bin/nutch generate -expr "status == db_unfetched"
with this command you're ensuring that only the URLs with a db_unfetched
status are considered for generating the segments that you want to crawl.
This feature is still not available on 2.x branch, but writing a custom GeneratorJob could do the trick.
On the other hand, since the generator Job is already considering the score to sort the list of URLs to fetch, perhaps the easier way could be to write a custom ScoringFilter.
For instance, if you take a look at: https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/scoring/ScoringFilter.java#L69-L81 the ScoringFilter already provides a generatorSortValue
method only for generating a scoring value for the generator job, so you could write your own to boost those URLs with an unfetched status.
来源:https://stackoverflow.com/questions/43993032/generate-only-unfetched-urls-instead-of-scored-nutch-2-3