I have recently been learning Python and am dipping my hand into building a web-scraper. It\'s nothing fancy at all; its only purpose is to get the data off of a betting we
Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:
1) Define class within the middlewares.py
script.
from selenium import webdriver
from scrapy.http import HtmlResponse
class JsDownload(object):
@check_spider_middleware
def process_request(self, request, spider):
driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
driver.get(request.url)
return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
2) Add JsDownload()
class to variable DOWNLOADER_MIDDLEWARE
within settings.py
:
DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}
3) Integrate the HTMLResponse
within your_spider.py
. Decoding the response body will get you the desired output.
class Spider(CrawlSpider):
# define unique name of spider
name = "spider"
start_urls = ["https://www.url.de"]
def parse(self, response):
# initialize items
item = CrawlerItem()
# store data as items
item["js_enabled"] = response.body.decode("utf-8")
Optional Addon:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:
def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
msg = '%%s %s middleware step' % (self.__class__.__name__,)
if self.__class__ in spider.middleware:
spider.log(msg % 'executing', level=log.DEBUG)
return method(self, request, spider)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return None
return wrapper
for wrapper to work all spiders must have at minimum:
middleware = set([])
to include a middleware:
middleware = set([MyProj.middleware.ModuleName.ClassName])
Advantage:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T's solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it's parse_page function -- That's two requests for the same content.