How to combine scrapy and htmlunit to crawl urls with javascript

北城以北 提交于 2019-11-27 04:02:59

问题


I'm working on Scrapy to crawl pages,however,I can't handle the pages with javascript. People suggest me to use htmlunit, so I got it installed,but I don't know how to use it at all.Dose anyone can give an example(scrapy + htmlunit) for me? Thanks very much.


回答1:


To handle the pages with javascript you can use Webkit or Selenium.

Here some snippets from snippets.scrapy.org:

Rendered/interactive javascript with gtk/webkit/jswebkit

Rendered Javascript Crawler With Scrapy and Selenium RC




回答2:


Here is a working example using selenium and phantomjs headless webdriver in a download handler middleware.

class JsDownload(object):

@check_spider_middleware
def process_request(self, request, spider):
    driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
    driver.get(request.url)
    return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

I wanted to ability to tell different spiders which middleware to use so I implemented this wrapper:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

settings.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

for wrapper to work all spiders must have at minimum:

middleware = set([])

to include a middleware:

middleware = set([MyProj.middleware.ModuleName.ClassName])

The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In the solution at reclosedev's second link for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it's parse_page function -- That's two requests for the same content.

Another example: https://github.com/scrapinghub/scrapyjs

Cheers!



来源:https://stackoverflow.com/questions/8047666/how-to-combine-scrapy-and-htmlunit-to-crawl-urls-with-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!