What sequence of steps does crawler4j follow to fetch data?

荒凉一梦 提交于 2019-12-24 08:39:05

问题


I'd like to learn,

  1. how crawler4j works?
  2. Does it fetch web page then download its content and extract it ?
  3. What about .db and .cvs file and its structures?

Generally ,What sequences it follows?

please, I want a descriptive content

Thanks


回答1:


General Crawler Process

The process for a typical multi-threaded crawler is as follows:

  1. We have a queue data structure, which is called frontier. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.

  2. Crawler threads then obtain URLs from the frontier and schedule them for later processing.

  3. The actual processing starts:

    • The robots.txt for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)
    • Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
    • The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
    • If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in crawler4j this can be controlled via shouldVisit(...)).
  4. The whole process is repeated until no new URLs are added to the frontier.

General (Focused) Crawler Architecture

Besides the implementation details of crawler4j a more or less general (focused) crawler architecture (on a single server/pc) looks like this:

Disclaimer: Image is my own work. Please respect this by referencing this post.



来源:https://stackoverflow.com/questions/53351712/what-sequence-of-steps-does-crawler4j-follow-to-fetch-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!