Websites that are particularly challenging to crawl and scrape? [closed]
I'm interested in public facing sites (nothing behind a login / authentication) that have things like: High use of internal 301 and 302 redirects Anti-scraping measures (but not banning crawlers via robots.txt) Non-semantic, or invalid mark-up Content loaded via AJAX in the form of onclicks or infinite scrolling Lots of parameters used in urls Canonical problems Convoluted internal link structure and anything else that generally makes crawling a website a headache! I have built a crawler / spider that performs a range of analysis on a website, and I'm on the lookout for sites that will make it