I\'m using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I\'m having difficulty getting scrapy to follow the xpath into
The problem is that iframe content is not returned as a part of html. You can either try to fetch iframe content directly (by its src), or use render.json endpoint with iframes=1 option:
# ...
yield SplashRequest(url, self.parse_result, endpoint='render.json',
args={'html': 1, 'iframes': 1})
def parse_result(self, response):
iframe_html = response.data['childFrames'][0]['html']
sel = parsel.Selector(iframe_html)
item = {
'my_field': sel.xpath(...),
# ...
}
/execute
endpoint doesn't support fetching iframes content as of Splash 2.3.3.
An alternative way to deal with iframe can be (response if the main page):
urls = response.css('iframe::attr(src)').extract()
for url in urls :
parse the url
this way the iframe is parsed like it was a normal page, but at the moment i cannot send the cookies in the main page to the html inside the iframe and that's a problem