crawl URLs based on their priorities in StormCrawler

一曲冷凌霜 提交于 2021-02-17 06:52:04

问题


I am working on a crawler based on the StormCrawler project. I have a requirement to crawl URLs based on their priorities. For example, I have two types of priority: HIGH, LOW. I want to crawl HIGH priority URLs as soon as possible before LOW URLs. I need a method for handling the above problem in the crawler. How can I handle this requirement in Apache Storm and StormCrawler?


回答1:


With Elasticsearch as a backend, you can configure the spouts to sort the URLs within a bucket by whichever field you want. The fields are sorted by ascending order so you should store a value in the metadata of 0 for high and 1 for low and specify the key name in the conf es.status.bucket.sort.field. (Note that HIGH and LOW as values would work as well).

The default values in the ES archetype are

es.status.bucket.sort.field:

  • "nextFetchDate"
  • "url"

you should keep the nextFetchDate so that URLs with the same priority are sorted by it and have for instance

es.status.bucket.sort.field:

  • "metadata.priority"
  • "nextFetchDate"
  • "url"

Note that this won't affect how the buckets are sorted, just the order within them.



来源:https://stackoverflow.com/questions/65790338/crawl-urls-based-on-their-priorities-in-stormcrawler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!