Scrapy use item and save data in a json file

问题

I want to use scrapy item and manipulate data and saving all in json file (using json file like a db).

# Spider Class

class Spider(scrapy.Spider):
    name = 'productpage'
    start_urls = ['https://www.productpage.com']

    def parse(self, response):
        for product in response.css('article'):

            link = product.css('a::attr(href)').get()
            id = link.split('/')[-1]
            title = product.css('a > span::attr(content)').get()
            product = Product(self.name, id, title, price,'', link)
            yield scrapy.Request('{}.json'.format(link), callback=self.parse_product, meta={'product': product})

        yield scrapy.Request(url=response.url, callback=self.parse, dont_filter=True)

    def parse_product(self, response):
        product = response.meta['product']
        for size in json.loads(response.body_as_unicode()):
            product.size.append(size['name'])

        if self.storage.update(product.__dict__):
            product.send('url')


# STORAGE CLASS

class Storage:

    def __init__(self, name):
        self.name = name
        self.path = '{}.json'.format(self.name)
        self.load()  """Load json database"""

    def update(self, new_item):
        # .... do things and update data ...
        return True

# Product Class

class Product:

    def __init__(self, name, id, title, size, link):
        self.name = name
        self.id = id
        self.title = title
        self.size = []
        self.link = link

    def send(self, url):
        return  # send notify...

Spider class search for products in main page of start_url, then it parse product page to catch also sizes. Finally it search if there are updates on self.storage.update(product.__dict__) and if it's true send a notification.

How can I implement Item in my code? I thought I could insert it in Product Class, but I can't include send method...

回答1:

You should define the item you want. And yield it after parsed.

Last, run the command: scrapy crawl [spider] -o xx.json

PS: Default scrapy had support export json file.

回答2:

@Jadian's answer will get you a file with JSON in it, but not quite db like access to it. In order to do this properly from a design stand point I would follow the below instructions. You don't have to use mongo either there are plenty of other nosql dbs available that use JSON.

What I would recommend in this situation is that you build out the items properly using scrapy.Item() classes. Then you can use json.dumps into mongoDB. You will need to assign a PK to each item, but mongo is basically made to be a non relational json store. So what you would do is then create an item pipeline which checks for the PK of the item and if its found and no details are changed then raise DropItem() else update/store new data into the mongodb. You could even pipe into the json exporter if you wanted to probably, but I think just dumping the python object to json into mongo is the way to go and then mongo will present you with json to work with on the front end.

I hope that you understand this answer, but I think from a design point this will be a much easier solution since mongo is basically a non relational data store based on JSON, and you will be dividing your item pipeline logic into its own area instead of cluttering your spider with it.

I would provide a code sample, but most of mine are using ORM for SQL db. Mongo is actually easier to use than this...

来源：https://stackoverflow.com/questions/56005678/scrapy-use-item-and-save-data-in-a-json-file

标签

python

python-3.x

web-scraping

scrapy

scrapy-item