Scrapy + Splash: scraping element inside inner html

前端 未结 2 1507
长情又很酷
长情又很酷 2020-12-18 16:44

I\'m using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I\'m having difficulty getting scrapy to follow the xpath into

相关标签:
2条回答
  • 2020-12-18 16:58

    The problem is that iframe content is not returned as a part of html. You can either try to fetch iframe content directly (by its src), or use render.json endpoint with iframes=1 option:

    # ...
        yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                            args={'html': 1, 'iframes': 1})
    
    def parse_result(self, response):
        iframe_html = response.data['childFrames'][0]['html']
        sel = parsel.Selector(iframe_html)
        item = {
            'my_field': sel.xpath(...),
            # ...  
        }
    

    /execute endpoint doesn't support fetching iframes content as of Splash 2.3.3.

    0 讨论(0)
  • 2020-12-18 17:19

    An alternative way to deal with iframe can be (response if the main page):

        urls = response.css('iframe::attr(src)').extract()
        for url in urls :
                parse the url
    

    this way the iframe is parsed like it was a normal page, but at the moment i cannot send the cookies in the main page to the html inside the iframe and that's a problem

    0 讨论(0)
提交回复
热议问题