Scraping data without having to explicitly define each field to be scraped

后端 未结 4 898
情深已故
情深已故 2021-02-04 15:06

I want to scrape a page of data (using the Python Scrapy library) without having to define each individual field on the page. Instead I want to dynamically generate fields using

4条回答
  •  Happy的楠姐
    2021-02-04 15:41

    Update:

    The old method didn't work with item loaders and was complicating things unnecessarily. Here's a better way of achieving a flexible item:

    from scrapy.item import BaseItem
    from scrapy.contrib.loader import ItemLoader
    
    class FlexibleItem(dict, BaseItem):
        pass
    
    if __name__ == '__main__':
        item = FlexibleItem()
        loader = ItemLoader(item)
    
        loader.add_value('foo', 'bar')
        loader.add_value('baz', 123)
        loader.add_value('baz', 'test')
        loader.add_value(None, {'abc': 'xyz', 'foo': 555})
    
        print loader.load_item()
    
        if 'meow' not in item:
            print "it's not a cat!"
    

    Result:

    {'foo': ['bar', 555], 'baz': [123, 'test'], 'abc': ['xyz']}
    it's not a cat!
    

    Old solution:

    Okay, I've found a solution. It's a bit of "hack" but it works..

    A Scrapy Item stores the field names in a dict called fields. When adding data to an Item it checks if the field exists, and if it doesn't it throws and error:

    def __setitem__(self, key, value):
        if key in self.fields:
            self._values[key] = value
        else:
            raise KeyError("%s does not support field: %s" %\
                  (self.__class__.__name__, key))
    

    What you can do is override this __setitem__ function to be less strict:

    class FlexItem(Item):
        def __setitem__(self, key, value):
            if key not in self.fields:
                self.fields[key] = Field()
    
            self._values[key] = value
    

    And there you go.

    Now when you add data to an Item, if the item doesn't have that field defined, it will be added, and then the data will be added as normal.

提交回复
热议问题