Scrapy + Splash: scraping element inside inner html

前端 未结 2 1510
长情又很酷
长情又很酷 2020-12-18 16:44

I\'m using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I\'m having difficulty getting scrapy to follow the xpath into

2条回答
  •  -上瘾入骨i
    2020-12-18 16:58

    The problem is that iframe content is not returned as a part of html. You can either try to fetch iframe content directly (by its src), or use render.json endpoint with iframes=1 option:

    # ...
        yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                            args={'html': 1, 'iframes': 1})
    
    def parse_result(self, response):
        iframe_html = response.data['childFrames'][0]['html']
        sel = parsel.Selector(iframe_html)
        item = {
            'my_field': sel.xpath(...),
            # ...  
        }
    

    /execute endpoint doesn't support fetching iframes content as of Splash 2.3.3.

提交回复
热议问题