How to read and map CSV's multi line header rows using python

十年热恋 提交于 2021-02-04 18:30:08

问题


I have a CSV file which is downloaded from database(as it is in CSV) and now I have to parse into JSON Schema. Don't worry this link just github gist

Problem I am facing is its Multi line Header check CSV File Here

If you take notice in the file:

  1. On 1st line of CSV it has 1st line of headers then next line has all the values for those headers.

  2. On 3rd line of CSV file it has 2nd line of headers then next line has all the values for those headers.

  3. On 5th line of CSV file it has 3rd line of headers then next line has all the values for those headers.

Also you can notice the pattern here,

  • 1st line of headers hasn't any tab
  • 2nd line of headers has only one tab
  • 3rd line of headers has two tabs

This goes for all the records.

Now 1st problem is this multi line of headers. And 2nd problem is how to parse it into nested json as I have. one of the solution I have tried Create nested JSON from CSV. and noticed the 1st problem with my csv.

My look like this. Where I am only trying to parse initial fields of schema

import csv
import json


def csvParse(csvfile):
    # Open the CSV
    f = open(csvfile, 'r')
    # Change each fieldname to the appropriate field name.
    reader = csv.DictReader(f, fieldnames=("Order Ref", "Order 
Status", "Affiliate", "Source", "Agent", "Customer Name", "Customer Name", "Email 
Address", "Telephone", "Mobile", "Address 1", "Address 2", "City", "County/State",
"Postal Code", "Country", "Voucher Code", " Voucher Amount", "Order Date", "Item ID", 
"Type", "Supplier Code", "Supplier Name", "Booking Ref", "Supplier Price", "Currency", "Selling Price", "Currency", "Depart", "Arrive", "Origin", 
"Destination", "Carrier", "Flight No", "Class", "Pax Type", "Title", 
"Firstname", "Surname", "DOB", "Gender", "FOID Type"))

customer = []
data = []
# data frame names in a list
for row in reader:
    frame = {"orderRef": row["Order Ref"],
             "orderStatus": row["Order Status"],
             "affiliate": row["Affiliate"],
             "source": row["Source"],
             "customers": []}

    data.append(frame)

回答1:


This is not csv parser friendly, because the file contains multiple csv, and at least one contains 2 columns with same name, which prevents from using a DictReader.

I would first build a wrapper that allows to parse each csv fragment and kindly raises a stop iteration when it reaches a different fragment. I would use the re module to find the different headers.

Here is what the code for the wrapper could be:

class multi_csv:
    levels = [re.compile('Order Ref,Order Status,Affiliate,Source,Agent,'
                       '.*,Country,Voucher Code,Voucher Amount,Order Date'),
              re.compile('\s*,Item ID,Type,Supplier Code,Supplier Name,'
                      '.*,Arrive,Origin,Destination,Carrier,Flight No,Class,'),
              re.compile('\s*,\s*,Pax Type,Title,Firstname,Surname,DOB,Gender,'
                        'FOID Type,*')
              ]
    def __init__(self, fd):
        self.fd = fd
        self.level = 0
        self.end = False
    def __iter__(self):
        return self
    def __next__(self):
        try:
            line = next(self.fd)
        except StopIteration:
            self.end = True
            raise
        for lev, rx in enumerate(self.levels):
            if rx.match(line):
                self.level = lev
                raise StopIteration('New level')
        return line

It can then be used to build a Python object according to your Json schema:

mc = multi_csv(open(csvfile, 'r')
orders = []
while not mc.end:
    rd = csv.reader(mc)
    for row in rd:
        if mc.level == 0:
            items = []
            orders.append({
                'orderRef': int(row[0]),
                'orderStatus': row[1],
                'affiliate': row[2],
                'source': row[3],
                'agent': row[4],
                'customer': {
                    'name': row[5],
                    'email': row[6],
                    'telephone': row[7],
                    'mobile': row[8],
                    'address': {
                        'address1': row[9],
                        'address2': row[10],
                        'city': row[11],
                        'county': row[12],
                        'postCode': row[13],
                        'country': row[14],
                        },
                    },
                'voucherCode': row[15],
                'voucherAmount': int(row[16]),
                'orderDate': row[17],
                'items': items,
                })
        elif mc.level == 1:
            if len(row[1].strip()) != 0:
                legs = []
                passengers = []
                items.append({
                    'itemId': int(row[1]),
                    'type': row[2],
                    'supplierCode': row[3],
                    'supplierName': row[4],
                    'bookingReference': row[5],
                    'supplierPrice': row[6],
                    'supplierPriceCurrency': row[7],
                    'sellingPrice': row[8],
                    'sellingPriceCurrency': row[9],
                    'legs': legs,
                    'passengers': passengers,
                    })
            legs.append({
                'departureDate': row[10],
                'arrivalDate': row[11],
                'origin': row[12],
                'destination': row[13],
                'carrier': row[14],
                'flightNumber': row[15],
                'class': row[16],
                })
        else:    # mc.level == 2
            passengers.append({
                'passengerType': row[2],
                'title': row[3],
                'firstName': row[4],
                'surName': row[5],
                'dob': row[6],
                'gender': row[7],
                'foidType': row[8],
                })

With your file, it gives the expected result:

pprint.pprint(orders)

[{'affiliate': ' ',
  'agent': 'akjsd@ad.com',
  'customer': {'address': {'address1': ' ',
                           'address2': ' ',
                           'city': ' ',
                           'country': ' ',
                           'county': ' ',
                           'postCode': ' '},
               'email': 'asd@asd.com',
               'mobile': ' ',
               'name': 'Mr Kas Iasd',
               'telephone': '3342926655'},
  'items': [{'bookingReference': 'ABC123',
             'itemId': 125,
             'legs': [{'arrivalDate': 'ONEWAY',
                       'carrier': 'PK',
                       'class': 'Economy',
                       'departureDate': '12/01/2018 13:15',
                       'destination': 'LHE',
                       'flightNumber': '354',
                       'origin': 'KHI'}],
             'passengers': [{'dob': '19/09/1995',
                             'firstName': 'Aasdsa',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Mas',
                             'title': 'Mr'},
                            {'dob': '07/12/1995',
                             'firstName': 'Asdad',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Dasd',
                             'title': 'Mr'}],
             'sellingPrice': '5002',
             'sellingPriceCurrency': 'PKR',
             'supplierCode': 'SB',
             'supplierName': 'Sabre',
             'supplierPrice': '5002',
             'supplierPriceCurrency': 'PKR',
             'type': 'Flight'}],
  'orderDate': '11/01/2018 18:51',
  'orderRef': 1234,
  'orderStatus': 'PayOfflineConfirmedManual',
  'source': ' ',
  'voucherAmount': 0,
  'voucherCode': ' '},
 {'affiliate': ' ',
  'agent': 'asdss@asda.com',
  'customer': {'address': {'address1': ' ',
                           'address2': ' ',
                           'city': ' ',
                           'country': ' ',
                           'county': ' ',
                           'postCode': ' '},
               'email': 'ads@ads.com',
               'mobile': '3332784342',
               'name': 'Mr Asdsd Asdsd',
               'telephone': '3332324252'},
  'items': [{'bookingReference': 'FAILED',
             'itemId': 123,
             'legs': [{'arrivalDate': '18/01/2018 14:25',
                       'carrier': 'PA',
                       'class': 'Economy',
                       'departureDate': '18/01/2018 11:40',
                       'destination': 'DXB',
                       'flightNumber': '210',
                       'origin': 'KHI'},
                      {'arrivalDate': '25/01/2018 10:40',
                       'carrier': 'PA',
                       'class': 'Economy',
                       'departureDate': '25/01/2018 6:25',
                       'destination': 'LHE',
                       'flightNumber': '211',
                       'origin': 'DXB'}],
             'passengers': [{'dob': '11/08/1991',
                             'firstName': 'Asd',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Azam',
                             'title': 'Mr'},
                            {'dob': '01/07/1974',
                             'firstName': 'Aziz',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Asdsd',
                             'title': 'Mr'},
                            {'dob': '28/05/1995',
                             'firstName': 'mureed',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'ahmed',
                             'title': 'Mr'},
                            {'dob': '14/04/2012',
                             'firstName': 'abdullah',
                             'foidType': 'None',
                             'gender': 'Female',
                             'passengerType': 'Child',
                             'surName': 'Cdsd',
                             'title': 'Mr'},
                            {'dob': '17/12/1999',
                             'firstName': 'Asdsd',
                             'foidType': 'None',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'Ahmed',
                             'title': 'Mr'}],
             'sellingPrice': '154340',
             'sellingPriceCurrency': 'PKR',
             'supplierCode': 'PITCH',
             'supplierName': 'Kicker',
             'supplierPrice': '154340',
             'supplierPriceCurrency': 'PKR',
             'type': 'Flight'}],
  'orderDate': '11/01/2018 17:06',
  'orderRef': 1235,
  'orderStatus': 'PayOfflinePendingManualProcessing',
  'source': ' ',
  'voucherAmount': 100,
  'voucherCode': 'ABC123'},
 {'affiliate': ' ',
  'agent': 'asda@asdad.com',
  'customer': {'address': {'address1': ' ',
                           'address2': ' ',
                           'city': ' ',
                           'country': ' ',
                           'county': ' ',
                           'postCode': ' '},
               'email': 'asd@asdsd.com',
               'mobile': '3067869234',
               'name': 'Mr Asds Sdsd',
               'telephone': '3067869234'},
  'items': [{'bookingReference': ' ',
             'itemId': 124,
             'legs': [{'arrivalDate': 'ONEWAY',
                       'carrier': 'PK',
                       'class': 'Economy',
                       'departureDate': '23/01/2018 2:00',
                       'destination': 'SHJ',
                       'flightNumber': '812',
                       'origin': 'KHI'}],
             'passengers': [{'dob': '01/12/1994',
                             'firstName': 'Asds',
                             'foidType': 'Passport',
                             'gender': 'Male',
                             'passengerType': 'Adult',
                             'surName': 'raza',
                             'title': 'Mr'}],
             'sellingPrice': '20134',
             'sellingPriceCurrency': 'PKR',
             'supplierCode': 'PITCH',
             'supplierName': 'Kicker',
             'supplierPrice': '20134',
             'supplierPriceCurrency': 'PKR',
             'type': 'Flight'}],
  'orderDate': '11/01/2018 16:23',
  'orderRef': 1236,
  'orderStatus': 'PayOfflinePendingManualProcessing',
  'source': ' ',
  'voucherAmount': 0,
  'voucherCode': ' '}]



回答2:


This can be parsed with some recursion. The code below uses the blank records at the beginning of the line to determine that a sub-record needs to be parsed.

This code does not map the data into the desired schema as that should be done in a second step. Would suggest the ETL pattern, where this code would be part of the Extract step. The Transform step should be a separate routine which maps each of these records into the desired schema.

Code:

def get_records_from_csv(csv_filename):

    with open(csv_filename, 'r') as f:

        csv_file = iter(csv.reader(f))
        last_line = []

        def rows():
            """wrapper to csv iterator which allows a read of last_line"""
            while True:
                if last_line:
                    yield last_line.pop()
                else:
                    yield next(csv_file)

        def row_level(row):
            """how many empty fields on start of row"""
            for i, col in enumerate(row):
                if col.strip():
                    return i

        def get_records(level=0):
            header = next(rows())
            records = []
            for row in rows():
                this_level = row_level(row)
                if this_level == level:
                    records.append(dict(zip(header[level:], row[level:])))
                elif this_level < level:
                    break
                elif this_level == level + 1:
                    last_line.append(row)
                    record_type, sub_records = get_records(this_level)
                    if record_type:
                        records[-1].setdefault(
                            record_type, []).extend(sub_records)
                    if not level:
                        last_line.append(header)
                    break
                else:
                    sub_record = [
                        records[-1][k] for k in header[level:this_level]]
                    this_row = sub_record + row[this_level:]
                    records.append(dict(zip(header[level:], this_row)))

            return header[level], records

        record = get_records(0)[1]
        while record:
            yield record
            record = get_records(0)[1]

Test Code:

import csv
import json
print(json.dumps(list(get_records_from_csv('csvfile.csv')), indent=2))

Results:

[
  [
    {
      "Order Ref": "1234",
      "Order Status": "PayOfflineConfirmedManual",
      "Affiliate": " ",
      "Source": " ",
      "Agent": "akjsd@ad.com",
      "Customer Name": "Mr Kas Iasd",
      "Email Address": "asd@asd.com",
      "Telephone": "3342926655",
      "Mobile": " ",
      "Address 1": " ",
      "Address 2": " ",
      "City": " ",
      "County/State": " ",
      "Postal Code": " ",
      "Country": " ",
      "Voucher Code": " ",
      "Voucher Amount": "0",
      "Order Date": "11/01/2018 18:51",
      "Item ID": [
        {
          "Item ID": "125",
          "Type": "Flight",
          "Supplier Code": "SB",
          "Supplier Name": "Sabre",
          "Booking Ref": "ABC123",
          "Supplier Price": "5002",
          "Currency": "PKR",
          "Selling Price": "5002",
          "Depart": "12/01/2018 13:15",
          "Arrive": "ONEWAY",
          "Origin": "KHI",
          "Destination": "LHE",
          "Carrier": "PK",
          "Flight No": "354",
          "Class": "Economy",
          "": "",
          "Pax Type": [
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Aasdsa",
              "Surname": "Mas",
              "DOB": "19/09/1995",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Asdad",
              "Surname": "Dasd",
              "DOB": "07/12/1995",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            }
          ]
        }
      ]
    }
  ],
  [
    {
      "Order Ref": "1235",
      "Order Status": "PayOfflinePendingManualProcessing",
      "Affiliate": " ",
      "Source": " ",
      "Agent": "asdss@asda.com",
      "Customer Name": "Mr Asdsd Asdsd",
      "Email Address": "ads@ads.com",
      "Telephone": "3332324252",
      "Mobile": "3332784342",
      "Address 1": " ",
      "Address 2": " ",
      "City": " ",
      "County/State": " ",
      "Postal Code": " ",
      "Country": " ",
      "Voucher Code": "ABC123",
      "Voucher Amount": "100",
      "Order Date": "11/01/2018 17:06",
      "Item ID": [
        {
          "Item ID": "123",
          "Type": "Flight",
          "Supplier Code": "PITCH",
          "Supplier Name": "Kicker",
          "Booking Ref": "FAILED",
          "Supplier Price": "154340",
          "Currency": "PKR",
          "Selling Price": "154340",
          "Depart": "18/01/2018 11:40",
          "Arrive": "18/01/2018 14:25",
          "Origin": "KHI",
          "Destination": "DXB",
          "Carrier": "PA",
          "Flight No": "210",
          "Class": "Economy",
          "": ""
        },
        {
          "Item ID": "123",
          "Type": "Flight",
          "Supplier Code": "PITCH",
          "Supplier Name": "Kicker",
          "Booking Ref": "FAILED",
          "Supplier Price": "154340",
          "Currency": "PKR",
          "Selling Price": "154340",
          "Depart": "25/01/2018 6:25",
          "Arrive": "25/01/2018 10:40",
          "Origin": "DXB",
          "Destination": "LHE",
          "Carrier": "PA",
          "Flight No": "211",
          "Class": "Economy",
          "": "",
          "Pax Type": [
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Asd",
              "Surname": "Azam",
              "DOB": "11/08/1991",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Aziz",
              "Surname": "Asdsd",
              "DOB": "01/07/1974",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "mureed",
              "Surname": "ahmed",
              "DOB": "28/05/1995",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Child",
              "Title": "Mr",
              "Firstname": "abdullah",
              "Surname": "Cdsd",
              "DOB": "14/04/2012",
              "Gender": "Female",
              "FOID Type": "None",
              "": ""
            },
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Asdsd",
              "Surname": "Ahmed",
              "DOB": "17/12/1999",
              "Gender": "Male",
              "FOID Type": "None",
              "": ""
            }
          ]
        }
      ]
    }
  ],
  [
    {
      "Order Ref": "1236",
      "Order Status": "PayOfflinePendingManualProcessing",
      "Affiliate": " ",
      "Source": " ",
      "Agent": "asda@asdad.com",
      "Customer Name": "Mr Asds Sdsd",
      "Email Address": "asd@asdsd.com",
      "Telephone": "3067869234",
      "Mobile": "3067869234",
      "Address 1": " ",
      "Address 2": " ",
      "City": " ",
      "County/State": " ",
      "Postal Code": " ",
      "Country": " ",
      "Voucher Code": " ",
      "Voucher Amount": "0",
      "Order Date": "11/01/2018 16:23",
      "Item ID": [
        {
          "Item ID": "124",
          "Type": "Flight",
          "Supplier Code": "PITCH",
          "Supplier Name": "Kicker",
          "Booking Ref": " ",
          "Supplier Price": "20134",
          "Currency": "PKR",
          "Selling Price": "20134",
          "Depart": "23/01/2018 2:00",
          "Arrive": "ONEWAY",
          "Origin": "KHI",
          "Destination": "SHJ",
          "Carrier": "PK",
          "Flight No": "812",
          "Class": "Economy",
          "": "",
          "Pax Type": [
            {
              "Pax Type": "Adult",
              "Title": "Mr",
              "Firstname": "Asds",
              "Surname": "raza",
              "DOB": "01/12/1994",
              "Gender": "Male",
              "FOID Type": "Passport",
              "": ""
            }
          ]
        }
      ]
    }
  ]
]

csvfile.csv

Order Ref,Order Status,Affiliate,Source,Agent,Customer Name,Email Address,Telephone,Mobile,Address 1,Address 2,City,County/State,Postal Code,Country,Voucher Code,Voucher Amount,Order Date
1234,PayOfflineConfirmedManual, , ,akjsd@ad.com,Mr Kas Iasd,asd@asd.com,3342926655, , , , , , , , ,0,11/01/2018 18:51
 ,Item ID,Type,Supplier Code,Supplier Name,Booking Ref,Supplier Price,Currency,Selling Price,Currency,Depart,Arrive,Origin,Destination,Carrier,Flight No,Class,
 ,125,Flight,SB,Sabre,ABC123,5002,PKR,5002,PKR,12/01/2018 13:15,ONEWAY,KHI,LHE,PK,354,Economy,
 , ,Pax Type,Title,Firstname,Surname,DOB,Gender,FOID Type,,,,,,,,,
 , ,Adult,Mr,Aasdsa,Mas,19/09/1995,Male,None,,,,,,,,,
 , ,Adult,Mr,Asdad,Dasd,07/12/1995,Male,None,,,,,,,,,
Order Ref,Order Status,Affiliate,Source,Agent,Customer Name,Email Address,Telephone,Mobile,Address 1,Address 2,City,County/State,Postal Code,Country,Voucher Code,Voucher Amount,Order Date
1235,PayOfflinePendingManualProcessing, , ,asdss@asda.com,Mr Asdsd Asdsd,ads@ads.com,3332324252,3332784342, , , , , , ,ABC123,100,11/01/2018 17:06
 ,Item ID,Type,Supplier Code,Supplier Name,Booking Ref,Supplier Price,Currency,Selling Price,Currency,Depart,Arrive,Origin,Destination,Carrier,Flight No,Class,
 ,123,Flight,PITCH,Kicker,FAILED,154340,PKR,154340,PKR,18/01/2018 11:40,18/01/2018 14:25,KHI,DXB,PA,210,Economy,
 , , , , , , , , , ,25/01/2018 6:25,25/01/2018 10:40,DXB,LHE,PA,211,Economy,
 , ,Pax Type,Title,Firstname,Surname,DOB,Gender,FOID Type,,,,,,,,,
 , ,Adult,Mr,Asd,Azam,11/08/1991,Male,None,,,,,,,,,
 , ,Adult,Mr,Aziz,Asdsd,01/07/1974,Male,None,,,,,,,,,
 , ,Adult,Mr,mureed,ahmed,28/05/1995,Male,None,,,,,,,,,
 , ,Child,Mr,abdullah,Cdsd,14/04/2012,Female,None,,,,,,,,,
 , ,Adult,Mr,Asdsd,Ahmed,17/12/1999,Male,None,,,,,,,,,
Order Ref,Order Status,Affiliate,Source,Agent,Customer Name,Email Address,Telephone,Mobile,Address 1,Address 2,City,County/State,Postal Code,Country,Voucher Code,Voucher Amount,Order Date
1236,PayOfflinePendingManualProcessing, , ,asda@asdad.com,Mr Asds Sdsd,asd@asdsd.com,3067869234,3067869234, , , , , , , ,0,11/01/2018 16:23
 ,Item ID,Type,Supplier Code,Supplier Name,Booking Ref,Supplier Price,Currency,Selling Price,Currency,Depart,Arrive,Origin,Destination,Carrier,Flight No,Class,
 ,124,Flight,PITCH,Kicker, ,20134,PKR,20134,PKR,23/01/2018 2:00,ONEWAY,KHI,SHJ,PK,812,Economy,
 , ,Pax Type,Title,Firstname,Surname,DOB,Gender,FOID Type,,,,,,,,,
 , ,Adult,Mr,Asds,raza,01/12/1994,Male,Passport,,,,,,,,,


来源:https://stackoverflow.com/questions/54589321/how-to-read-and-map-csvs-multi-line-header-rows-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!