Websites that are particularly challenging to crawl and scrape? [closed]

萝らか妹 提交于 2019-12-03 12:35:16
flyer

Here are some:

  • Content loaded via AJAX in the form of onclicks or infinite scrolling
    • pinterest
    • comments in such a page
      This is a Chinese commodity page and its comments is loaded by AJAX which is triggered by scrolling down the scrollbar in a browser or according to your browser's height. I must use PhantomJS and xvfb to trigger such actions.
  • Anti-scraping measures (but not banning crawlers via robots.txt)
    • amazon next page
      I have crawled amazon site in China and when I want to crawl the next page in such pages, it may modify the requests resulting in that you couldn't get the real next page
    • stackoverflow
      It has a limit of visit frequency. A few days ago, I wanted to get all of the tags in stackoverflow and set the spider's visit frequency to 10, but I was warned by stackoverflow...... Here's the screen shot. After that I have to use proxies to crawl stackoverflow.
  • and anything else that generally makes crawling a website a headache
    • yihaodian
      This is a Chinese e-commerce site and when you visit it in a browser, it will show your location and will offer some commodities according to your location.
    • etc.
      There're many sites like the above that will offer different contents according to your location. When you crawl such sites, what you get is not the same as what you see in a browser. It often needs setting cookie when emitting a request through a spider.

Last year I encountered a site which required http request headers and some cookies when emitting requests, but I don't remember that site....

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!