Which web crawler for extracting and parsing data from about a thousand of web sites

前端 未结 3 1674
庸人自扰
庸人自扰 2021-02-06 15:45

I\'m trying to crawl about a thousand of web sites, from which I\'m interested in the html content only.

Then I transform the HTML into XML to be parsed with Xpath to ex

3条回答
  •  無奈伤痛
    2021-02-06 16:18

    Wow. State of the art crawlers like the search engines use crawl and index 1 million URLs On a sinlge box a day. Sure the HTML to XML rendering step takes a bit but I agree with you on the performance. I've only used private crawlers so I can't recommend one you'll be able to use but hope this performance numbers help in your evaluation.

提交回复
热议问题