Scraping site that uses AJAX

筅森魡賤 提交于 2019-12-25 03:34:37

问题


I've read some relevant posts here but couldn't figure an answer.

I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.

I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:

Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all) 
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
        headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
        headers={all_headers}, cookies={all_cookies})

But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?


回答1:


What you need is a headless browser for this since request module can not handle AJAX well.

One of such headless browser is selenium.

i.e.)

driver.find_element_by_id("show more").click() # This is just an example case



回答2:


Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.

You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.

Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.



来源:https://stackoverflow.com/questions/34764815/scraping-site-that-uses-ajax

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!