Is there a way to overwrite the said file instead of appending it?
Example)
scrapy crawl myspider -o \"/path/to/json/my.json\" -t json
scrapy cra
There is a flag which allows overwriting the output file, you can do so by passing the file reference via -O
option instead of -o
, so you can use this instead:
scrapy crawl myspider -O /path/to/json/my.json
More information:
$ scrapy crawl --help
Usage
=====
scrapy crawl [options] <spider>
Run a spider
Options
=======
--help, -h show this help message and exit
-a NAME=VALUE set spider argument (may be repeated)
--output=FILE, -o FILE append scraped items to the end of FILE (use - for
stdout)
--overwrite-output=FILE, -O FILE
dump scraped items into FILE, overwriting any existing
file
--output-format=FORMAT, -t FORMAT
format to use for dumping items
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
To overcome this problem I created a subclass from scrapy.extensions.feedexport.FileFeedStorage
in myproject dir.
This is my customexport.py
:
"""Custom Feed Exports extension."""
import os
from scrapy.extensions.feedexport import FileFeedStorage
class CustomFileFeedStorage(FileFeedStorage):
"""
A File Feed Storage extension that overwrites existing files.
See: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/feedexport.py#L79
"""
def open(self, spider):
"""Return the opened file."""
dirname = os.path.dirname(self.path)
if dirname and not os.path.exists(dirname):
os.makedirs(dirname)
# changed from 'ab' to 'wb' to truncate file when it exists
return open(self.path, 'wb')
Then I added the following to my settings.py
(see: https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-storages-base):
FEED_STORAGES_BASE = {
'': 'myproject.customexport.CustomFileFeedStorage',
'file': 'myproject.customexport.CustomFileFeedStorage',
}
Now every time I write to a file it gets overwritten because of this.
scrapy crawl myspider -t json --nolog -o - > "/path/to/json/my.json"
Or you can add:
import os
if "filename.json" in os.listdir('..'):
os.remove('../filename.json')
at the beginning of your code.
very easy.
This is an old, well-known "problem" of Scrapy. Every time you start a crawl and you do not want to keep the results of previous calls you have to delete the file. The idea behind this is that you want to crawl different sites or the same site at different time-frames so you could accidentally lose your already gathered results. Which could be bad.
A solution would be to write an own item pipeline where you open the target file for 'w'
instead of 'a'
.
To see how to write such a pipeline look at the docs: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline (specifically for JSON exports: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file)
Since, the accepted answer gave me problems with unvalid json, this could work:
find "/path/to/json/" -name "my.json" -exec rm {} \; && scrapy crawl myspider -t json -o "/path/to/json/my.json"