Websites that are particularly challenging to crawl and scrape? [closed]

I'm interested in public facing sites (nothing behind a login / authentication) that have things like:

High use of internal 301 and 302 redirects
Anti-scraping measures (but not banning crawlers via robots.txt)
Non-semantic, or invalid mark-up
Content loaded via AJAX in the form of onclicks or infinite scrolling
Lots of parameters used in urls
Canonical problems
Convoluted internal link structure
and anything else that generally makes crawling a website a headache!

I have built a crawler / spider that performs a range of analysis on a website, and I'm on the lookout for sites that will make it struggle.

flyer

Here are some:

Content loaded via AJAX in the form of onclicks or infinite scrolling
- pinterest
- comments in such a page
  This is a Chinese commodity page and its comments is loaded by AJAX which is triggered by scrolling down the scrollbar in a browser or according to your browser's height. I must use PhantomJS and xvfb to trigger such actions.
Anti-scraping measures (but not banning crawlers via robots.txt)
- amazon next page
  I have crawled amazon site in China and when I want to crawl the next page in such pages, it may modify the requests resulting in that you couldn't get the real next page
- stackoverflow
  It has a limit of visit frequency. A few days ago, I wanted to get all of the tags in stackoverflow and set the spider's visit frequency to 10, but I was warned by stackoverflow...... Here's the screen shot. After that I have to use proxies to crawl stackoverflow.
and anything else that generally makes crawling a website a headache
- yihaodian
  This is a Chinese e-commerce site and when you visit it in a browser, it will show your location and will offer some commodities according to your location.
- etc.
  There're many sites like the above that will offer different contents according to your location. When you crawl such sites, what you get is not the same as what you see in a browser. It often needs setting cookie when emitting a request through a spider.

Last year I encountered a site which required http request headers and some cookies when emitting requests, but I don't remember that site....

来源：https://stackoverflow.com/questions/18762334/websites-that-are-particularly-challenging-to-crawl-and-scrape

标签

web-scraping

screen-scraping

web-crawler