How do I merge results from target page to current page in scrapy?

后端 未结 4 1344
难免孤独
难免孤独 2020-12-13 04:48

Need example in scrapy on how to get a link from one page, then follow this link, get more info from the linked page, and merge back with some data from first page.

相关标签:
4条回答
  • 2020-12-13 05:20

    More information on passing the meta data and request objects is specifically described in this part of the documentation:

    http://readthedocs.org/docs/scrapy/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

    This question is also related to: Scrapy: Follow link to get additional Item data?

    0 讨论(0)
  • 2020-12-13 05:27

    Partially fill your item on the first page, and the put it in your request's meta. When the callback for the next page is called, it can take the partially filled request, put more data into it, and then return it.

    0 讨论(0)
  • 2020-12-13 05:31

    An example from scrapy documntation

    def parse_page1(self, response):
        item = MyItem()
        item['main_url'] = response.url
        request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
        request.meta['item'] = item
        return request
    
    def parse_page2(self, response):
        item = response.meta['item']
        item['other_url'] = response.url
        return item
    
    0 讨论(0)
  • 2020-12-13 05:37

    A bit illustration of Scrapy documentation code

    def start_requests(self):
            yield scrapy.Request("http://www.example.com/main_page.html",callback=parse_page1)
    def parse_page1(self, response):
        item = MyItem()
        item['main_url'] = response.url ##extracts http://www.example.com/main_page.html
        request = scrapy.Request("http://www.example.com/some_page.html",callback=self.parse_page2)
        request.meta['my_meta_item'] = item ## passing item in the meta dictionary
        ##alternatively you can follow as below
        ##request = scrapy.Request("http://www.example.com/some_page.html",meta={'my_meta_item':item},callback=self.parse_page2)
        return request
    
    def parse_page2(self, response):
        item = response.meta['my_meta_item']
        item['other_url'] = response.url ##extracts http://www.example.com/some_page.html
        return item
    
    0 讨论(0)
提交回复
热议问题