How to use CrawlSpider from scrapy to click a link with javascript onclick?

不打扰是莪最后的温柔 提交于 2019-12-03 14:44:56

问题


I want scrapy to crawl pages where going on to the next link looks like this:

<a href="#" onclick="return gotoPage('2');"> Next </a>

Will scrapy be able to interpret javascript code of that?

With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:

encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n

I am trying to build my spider on the CrawlSpider class, but I can't really figure out how to code it, with BaseSpider I used the parse() method to process the first URL, which happens to be a login form, where I did a POST with:

def logon(self, response):
    login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
    return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]

And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?

All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.


回答1:


The actual methodology will be as follows:

  1. Post your request to reach the page (as you are doing)
  2. Extract link to the next page from that particular response
  3. Simple Request the next page if possible or use FormRequest again in applicable

All this have to be streamlined with the server response mechanism, e.g:

  • You can try using dont_click = true in FormRequest.from_response
  • Or you may want to handle the redirection (302) coming from the server (in which case you will have to mention in the meta that you require the handle redirect request also to be sent to callback.)

Now how to figure it all out: Use a web debugger like fiddler or you can use Firefox plugin FireBug, or simply hit F12 in IE 9; and check the requests a user actually makes on the website match the way you are crawling the webpage.




回答2:


I built a quick crawler that executes JS via selenium. Feel free to copy / modify https://github.com/rickysahu/seleniumjscrawl



来源:https://stackoverflow.com/questions/2454998/how-to-use-crawlspider-from-scrapy-to-click-a-link-with-javascript-onclick

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!