scrapy can't crawl all links in a page

℡╲_俬逩灬. 提交于 2019-12-20 05:16:33

问题


I am trying scrapy to crawl a ajax website http://play.google.com/store/apps/category/GAME/collection/topselling_new_free

I want to get all the links directing to each game.

I inspect the element of the page. And it looks like this: how the page looks like so I want to extract all links with the pattern /store/apps/details?id=

but when I ran commands in the shell, it returns nothing: shell command

I've also tried //a/@href. didn't work out either but Don't know what is wrong going on....

  • Now I can crawl first 120 links with starturl modified and 'formdata' added as someone told me but no more links after that.

Can someone help me with this?


回答1:


It's actually an ajax-post-request which populates the data on that page. In scrapy shell, you won't get this, instead of inspect element check the network tab there you will find the request.

Make post request to https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0 url with formdata={'start':'0','num':'60','numChildren':'0','ipf':'1','xhr':'1'}

Increment start by 60 on each request to get the paginated result.



来源:https://stackoverflow.com/questions/35304470/scrapy-cant-crawl-all-links-in-a-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!