How to sort the scrapy item info in customized order?

╄→гoц情女王★ 提交于 2020-08-23 07:48:27

问题


The default order in scrapy is alphabet,i have read some post to use OrderedDict to output item in customized order.
I write a spider follow the webpage.
How to get order of fields in Scrapy item

My items.py.

import scrapy
from collections import OrderedDict


class OrderedItem(scrapy.Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:  
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

class StockinfoItem(OrderedItem):
    name = scrapy.Field()
    phone = scrapy.Field()
    address = scrapy.Field()

The simple spider file.

import scrapy
from info.items import InfoItem

class InfoSpider(scrapy.Spider):
    name = 'Info'
    allowed_domains = ['quotes.money.163.com']
    start_urls = [ "http://quotes.money.163.com/f10/gszl_600023.html"]
    def parse(self, response):
        item = InfoItem()
        item["name"] = response.xpath('/html/body/div[2]/div[4]/table/tr[2]/td[2]/text()').extract()
        item["phone"] = response.xpath('/html/body/div[2]/div[4]/table/tr[7]/td[4]/text()').extract()
        item["address"] = response.xpath('/html/body/div[2]/div[4]/table/tr[2]/td[4]/text()').extract()
        item.items()
        yield  item

The scrapy info when to run the spider.

2019-04-25 13:45:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'address': ['浙江省杭州市天目山路152号浙能大厦'],'name': ['浙能电力'],'phone': ['0571-87210223']}

Why i can't get such desired order as below?

{'name': ['浙能电力'],'phone': ['0571-87210223'],'address': ['浙江省杭州市天目山路152号浙能大厦']}

Thank for Gallaecio's advice, to add the following in settings.py.

FEED_EXPORT_FIELDS=['name','phone','address']

Execute the spider and output to csv file.

scrapy crawl  info -o  info.csv

The field order is in my customized order.

cat info.csv
name,phone,address
浙能电力,0571-87210223,浙江省杭州市天目山路152号浙能大

Look at the scrapy's debug info :

2019-04-26 00:16:38 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'address': ['浙江省杭州市天目山路152号浙能大厦'],
 'name': ['浙能电力'],
 'phone': ['0571-87210223']}

How can i make the debug info in customized order?How to get the following debug output?

2019-04-26 00:16:38 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'name': ['浙能电力'],
 'phone': ['0571-87210223'],
 'address': ['浙江省杭州市天目山路152号浙能大厦'],}

回答1:


Problem is in __repr__ function of Item. Originally its code is:

def __repr__(self):
    return pformat(dict(self))

So even if you convert your item to OrderedDict and expect fields to be saved in the same order, this function applies dict() to it and breaks the order.

So, I propose you to overload it in the way you like, for example:

import json

class OrderedItem(scrapy.Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

    def __repr__(self):
        return json.dumps(OrderedDict(self), ensure_ascii = False)  # it should return some string

And now you can get this output:

2019-04-30 18:56:20 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{"name": ["\u6d59\u80fd\u7535\u529b"], "phone": ["0571-87210223"], "address": ["\u6d59\u6c5f\u7701\u676d\u5dde\u5e02\u5929\u76ee\u5c71\u8def152\u53f7\u6d59\u80fd\u5927\u53a6"]}



回答2:


you can define a custom string representation of your item

class InfoItem:
    def __repr__(self):
      return 'name: {}, phone: {}, address: {}'.format(self['name'], self.['phone'], self.['address'])



回答3:


In your spider replace item.items() with self.log(item.items()), log msg should be list of tuples in order you assigned them in your spider.

Another way is to combine answer you mentioned in your post with this answer




回答4:


The whole items.py which can output customized dubug info in cjk apperance is as below.

import scrapy
import json    
from collections import OrderedDict

class OrderedItem(scrapy.Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

    def __repr__(self):
        return json.dumps(OrderedDict(self),ensure_ascii = False)  
        #ensure_ascii = False ,it make characters show in cjk appearance.

class StockinfoItem(OrderedItem):
    name = scrapy.Field()
    phone = scrapy.Field()
    address = scrapy.Field()


来源:https://stackoverflow.com/questions/55851125/how-to-sort-the-scrapy-item-info-in-customized-order

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!