问题
If I need to scrape data from website that load next page automatically when one scrolled to be bottom of the page (i.e. endless extending the page) using Python and Beautiful, how can I do that? Is there a general approach or it needs to be tailored for each website?
Example of website: http://statigr.am/tag/cat/#/list
回答1:
If there is a dynamic behavior like loading additional content via ajax call (as it is here on statigr.am) - you should either use a real browser with the help of selenium or you should tailor your web scraper script for a specific web-site and simulate ajax calls by yourself.
For tailoring your staticgr.am web-scraper you need to use browser development tools to see what requests are made after the page load. You may notice this XHR request was made first:
http://statigr.am/controller_nl.php?action=nlGetMethod&method=mediasTag&value=cat&max_id=1371516699343
It returns json with all the data you need. There is also next_max_tag_id key in pagination dictionary - it's used for the next ajax request to controller_nl.php.
So, I'd simulate them via urllib2 or requests and parse the json via json module. Looks like no need for parsing html with beautifulsoup.
Hope that helps.
来源:https://stackoverflow.com/questions/17144536/scrape-data-from-website-that-turned-next-page-when-scrolled-to-bottom-using-pyt