问题
I implemented a pipeline based on some example around here. I'm trying to export all informations for multiple spiders (launched by a single file and not in command line) on a single CSV file.
However, it appears that some data (around 10%) showed into the shell aren't recorded into CSV. Is this because spiders are writing at the same time?
How could I fix this into my script to collect all data in a single CSV? I'm using CrawlerProcess
to launch the spiders.
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class ScrapybotPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('result_extract.csv', 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ['ean', 'price', 'desc', 'company']
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
回答1:
I know from your description that you're handling multiple spiders. Just to confirm: are you handling them at a same time? (within a same crawl process)?
According to the code you shared. You're trying to maintain one output file object per spider but writing to all the same path. In spider_opened
:
file = open('result_extract.csv', 'w+b')
self.files[spider] = file
This is believed to be the root cause of issue.
As you're having only one file (as on your filesystem) to write to, you may do it by opening it just once. Modified version of your code:
class ScrapybotPipeline(object):
def __init__(self):
self.file = None
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
self.file = open('result_extract.csv', 'w+b')
self.exporter = CsvItemExporter(self.file)
self.exporter.fields_to_export = ['ean', 'price', 'desc', 'company']
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
来源:https://stackoverflow.com/questions/53223801/missing-data-with-export-csv-for-multiple-spiders-scrapy-pipeline