Scrapy overwrite json files instead of appending the file

后端未结

关注

 6  1171

Is there a way to overwrite the said file instead of appending it?

Example)

scrapy crawl myspider -o \"/path/to/json/my.json\" -t json    
scrapy cra


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-28 11:18
              
            
            
                                                                       
There is a flag which allows overwriting the output file, you can do so by passing the file reference via -O option instead of -o, so you can use this instead:
scrapy crawl myspider -O /path/to/json/my.json

More information:
$ scrapy crawl --help
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
--help, -h              show this help message and exit
-a NAME=VALUE           set spider argument (may be repeated)
--output=FILE, -o FILE  append scraped items to the end of FILE (use - for
                        stdout)
--overwrite-output=FILE, -O FILE
                        dump scraped items into FILE, overwriting any existing
                        file
--output-format=FORMAT, -t FORMAT
                        format to use for dumping items

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-12-28 11:19
              
            
            
                                                                       
To overcome this problem I created a subclass from scrapy.extensions.feedexport.FileFeedStorage in myproject dir.

This is my customexport.py:

"""Custom Feed Exports extension."""
import os

from scrapy.extensions.feedexport import FileFeedStorage


class CustomFileFeedStorage(FileFeedStorage):
    """
    A File Feed Storage extension that overwrites existing files.

    See: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/feedexport.py#L79
    """

    def open(self, spider):
        """Return the opened file."""
        dirname = os.path.dirname(self.path)
        if dirname and not os.path.exists(dirname):
            os.makedirs(dirname)
        # changed from 'ab' to 'wb' to truncate file when it exists
        return open(self.path, 'wb')


Then I added the following to my settings.py (see: https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-storages-base):

FEED_STORAGES_BASE = {
    '': 'myproject.customexport.CustomFileFeedStorage',
    'file': 'myproject.customexport.CustomFileFeedStorage',
}


Now every time I write to a file it gets overwritten because of this.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-28 11:20
              
            
            
                                                                       
scrapy crawl myspider -t json --nolog -o - > "/path/to/json/my.json"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  伪装坚强ぢ        
                
              
                            
                2020-12-28 11:20
              
            
            
                                                                       
Or you can add:

import os

if "filename.json" in os.listdir('..'):
        os.remove('../filename.json')


at the beginning of your code.

very easy.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-12-28 11:24
              
            
            
                                                                       
This is an old, well-known "problem" of Scrapy. Every time you start a crawl and you do not want to keep the results of previous calls you have to delete the file. The idea behind this is that you want to crawl different sites or the same site at different time-frames so you could accidentally lose your already gathered results. Which could be bad.

A solution would be to write an own item pipeline where you open the target file for 'w' instead of 'a'.

To see how to write such a pipeline look at the docs: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline (specifically for JSON exports: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2020-12-28 11:44
              
            
            
                                                                       
Since, the accepted answer gave me problems with unvalid json, this could work:

find "/path/to/json/" -name "my.json" -exec rm {} \; && scrapy crawl myspider -t json -o "/path/to/json/my.json"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复