How to write a crawler?

后端 未结 10 1834
感情败类
感情败类 2020-12-02 03:47

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.

Does anybody have an

10条回答
  •  南方客
    南方客 (楼主)
    2020-12-02 04:28

    Crawlers are simple in concept.

    You get a root page via a HTTP GET, parse it to find URLs and put them on a queue unless they've been parsed already (so you need a global record of pages you have already parsed).

    You can use the Content-type header to find out what the type of content is, and limit your crawler to only parsing the HTML types.

    You can strip out the HTML tags to get the plain text, which you can do text analysis on (to get tags, etc, the meat of the page). You could even do that on the alt/title tags for images if you got that advanced.

    And in the background you can have a pool of threads eating URLs from the Queue and doing the same. You want to limit the number of threads of course.

提交回复
热议问题