Scrapy overwrite json files instead of appending the file

后端 未结 6 1171
长情又很酷
长情又很酷 2020-12-28 10:57

Is there a way to overwrite the said file instead of appending it?

Example)

scrapy crawl myspider -o \"/path/to/json/my.json\" -t json    
scrapy cra         


        
相关标签:
6条回答
  • 2020-12-28 11:18

    There is a flag which allows overwriting the output file, you can do so by passing the file reference via -O option instead of -o, so you can use this instead:

    scrapy crawl myspider -O /path/to/json/my.json
    

    More information:

    $ scrapy crawl --help
    Usage
    =====
      scrapy crawl [options] <spider>
    
    Run a spider
    
    Options
    =======
    --help, -h              show this help message and exit
    -a NAME=VALUE           set spider argument (may be repeated)
    --output=FILE, -o FILE  append scraped items to the end of FILE (use - for
                            stdout)
    --overwrite-output=FILE, -O FILE
                            dump scraped items into FILE, overwriting any existing
                            file
    --output-format=FORMAT, -t FORMAT
                            format to use for dumping items
    
    Global Options
    --------------
    --logfile=FILE          log file. if omitted stderr will be used
    --loglevel=LEVEL, -L LEVEL
                            log level (default: DEBUG)
    --nolog                 disable logging completely
    --profile=FILE          write python cProfile stats to FILE
    --pidfile=FILE          write process ID to FILE
    --set=NAME=VALUE, -s NAME=VALUE
                            set/override setting (may be repeated)
    --pdb                   enable pdb on failure
    
    0 讨论(0)
  • 2020-12-28 11:19

    To overcome this problem I created a subclass from scrapy.extensions.feedexport.FileFeedStorage in myproject dir.

    This is my customexport.py:

    """Custom Feed Exports extension."""
    import os
    
    from scrapy.extensions.feedexport import FileFeedStorage
    
    
    class CustomFileFeedStorage(FileFeedStorage):
        """
        A File Feed Storage extension that overwrites existing files.
    
        See: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/feedexport.py#L79
        """
    
        def open(self, spider):
            """Return the opened file."""
            dirname = os.path.dirname(self.path)
            if dirname and not os.path.exists(dirname):
                os.makedirs(dirname)
            # changed from 'ab' to 'wb' to truncate file when it exists
            return open(self.path, 'wb')
    

    Then I added the following to my settings.py (see: https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-storages-base):

    FEED_STORAGES_BASE = {
        '': 'myproject.customexport.CustomFileFeedStorage',
        'file': 'myproject.customexport.CustomFileFeedStorage',
    }
    

    Now every time I write to a file it gets overwritten because of this.

    0 讨论(0)
  • 2020-12-28 11:20
    scrapy crawl myspider -t json --nolog -o - > "/path/to/json/my.json"
    
    0 讨论(0)
  • 2020-12-28 11:20

    Or you can add:

    import os
    
    if "filename.json" in os.listdir('..'):
            os.remove('../filename.json')
    

    at the beginning of your code.

    very easy.

    0 讨论(0)
  • 2020-12-28 11:24

    This is an old, well-known "problem" of Scrapy. Every time you start a crawl and you do not want to keep the results of previous calls you have to delete the file. The idea behind this is that you want to crawl different sites or the same site at different time-frames so you could accidentally lose your already gathered results. Which could be bad.

    A solution would be to write an own item pipeline where you open the target file for 'w' instead of 'a'.

    To see how to write such a pipeline look at the docs: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline (specifically for JSON exports: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file)

    0 讨论(0)
  • 2020-12-28 11:44

    Since, the accepted answer gave me problems with unvalid json, this could work:

    find "/path/to/json/" -name "my.json" -exec rm {} \; && scrapy crawl myspider -t json -o "/path/to/json/my.json"
    
    0 讨论(0)
提交回复
热议问题