Missing data with Export CSV for multiple spiders / Scrapy / Pipeline

こ雲淡風輕ζ 提交于 2020-03-05 05:35:48

问题


I implemented a pipeline based on some example around here. I'm trying to export all informations for multiple spiders (launched by a single file and not in command line) on a single CSV file.

However, it appears that some data (around 10%) showed into the shell aren't recorded into CSV. Is this because spiders are writing at the same time?

How could I fix this into my script to collect all data in a single CSV? I'm using CrawlerProcess to launch the spiders.

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter


class ScrapybotPipeline(object):

def __init__(self):
    self.files = {}

@classmethod
def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

def spider_opened(self, spider):
    file = open('result_extract.csv', 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ['ean', 'price', 'desc', 'company']
    self.exporter.start_exporting()

def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

回答1:


I know from your description that you're handling multiple spiders. Just to confirm: are you handling them at a same time? (within a same crawl process)?

According to the code you shared. You're trying to maintain one output file object per spider but writing to all the same path. In spider_opened:

file = open('result_extract.csv', 'w+b')
self.files[spider] = file

This is believed to be the root cause of issue.

As you're having only one file (as on your filesystem) to write to, you may do it by opening it just once. Modified version of your code:

class ScrapybotPipeline(object):

    def __init__(self):
        self.file = None

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        self.file = open('result_extract.csv', 'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.fields_to_export = ['ean', 'price', 'desc', 'company']
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item


来源:https://stackoverflow.com/questions/53223801/missing-data-with-export-csv-for-multiple-spiders-scrapy-pipeline

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!