Convert Multiline Excel Data into column and rows using Python

…衆ロ難τιáo~ 提交于 2021-02-11 13:54:30

问题


I am looking to reshape the data from an excel sheet using Python. This is how my data looks

AuditDate      Fields                     ModifiedBy
1/1/2019 7:58  Status: Assigned  (0)                
               Site Group: XXX                      
               Region: xxx                          
               Site: xxxxx                          
               Summary: xxxx                        
               Location Company: xxx                
               Support Organization: XXXX           
               Support Group Name: xxxxx            
               Last Name: xxxx                      
               First Name: xxxx                     
               Categorization Tier 1:               
               Categorization Tier 2:               
               Categorization Tier 3:               
               Company: xxxx                        
               Priority: xxx                        
               Work Order Type: xxx                 
               Company3: xxxxx                      
               Request Manager:                     
               Product Cat Tier 1(2):               
               Product Cat Tier 2 (2):              
               Product Cat Tier 3 (2):              
               ASORG: IT Shoreside                  
               ASCPY: xxxx                          
               ASGRP: xxx                           
               Request Assignee:                    
               Status History: XXXX       XXXX           
1/1/2019 8:31  Request Assignee: XXXX     XXXX      
1/1/2019 15:02 Status: Pending  (1)       XXXX      
1/3/2019 13:00 Status: Completed  (5)     XXXX      
1/9/2019 2:46  Status: Closed  (8)        XXXX      

So if you see above the the first row is a multiline where data before colon(:) is to converted to columns.

Among here from FieldsChanged I am just concerned with Status, Priority, Request Assignee and ASGRP which i want to convert into columns. The output result will look like this

AuditDate       Status     Priority RequestAssignee ASGRP ModifiedBy
1/1/2019 7:58   Assigned   XX       XXX             XXX   XXXX
1/1/2019 8:31                       XXXX                  XXXX
1/1/2019 15:02  Pending                                   XXXX
1/3/2019 13:00  Completed                                 XXXX
1/9/2019 2:46   Closed                                    XXXX

The same data can be present in other rows as well. After reshaping the data this is how excel should look.

I would greatly appreciated if someone can help


回答1:


I will assume that the sheet has been converted to a csv file. So, you can use the csv module to first parse the rows and then parse the Fields field. And you can directly use the same csv module to directly build the result csv file.

Assuming that the input csv file is (note the quotes around the multiline field):

AuditDate,Fields,ModifiedBy
1/1/2019 7:58,"Status: Assigned (0)
Site Group: XXX
Region: xxx
Site: xxxxx
Summary: xxxx
Location Company: xxx
Support Organization: XXXX
Support Group Name: xxxxx
Last Name: xxxx
First Name: xxxx
Categorization Tier 1:
Categorization Tier 2:
Categorization Tier 3:
Company: xxxx
Priority: xxx
Work Order Type: xxx
Company3: xxxxx
Request Manager:
Product Cat Tier 1(2):
Product Cat Tier 2 (2):
Product Cat Tier 3 (2):
ASORG: IT Shoreside
ASCPY: xxxx
ASGRP: xxx
Request Assignee:
Status History: XXXX",XXXX
1/1/2019 8:31,Request Assignee: XXXX,XXXX
1/1/2019 15:02,Status: Pending (1),XXXX
1/3/2019 13:00,Status: Completed (5),XXXX
1/9/2019 2:46,Status: Closed (8),XXXX

You can easily process it that way:

with open('input.csv', newline='') as fd, open('output.csv', 'w', newline='') as fdout:
    rd = csv.DictReader(fd)       # directly use a DictReader for reading
    # declare a DictWriter for the required fields ignoring any additional field (extrasaction)
    wr = csv.DictWriter(fdout, ['AuditDate', 'Status', 'Priority', 'Request Assignee',
                                'ASGRP', 'ModifiedBy'], extrasaction='ignore')
    wr.writeheader()               # write the headers
    for row in rd:
        with io.StringIO(row['Fields']) as ffd:     # process Fields
            frd = csv.reader(ffd,delimiter=':', skipinitialspace=True)
            row.update(dict(frd))  # update the row dictionary with the "sub-fields"
        _ = wr.writerow(row)       # and directly use that

You should get as expected:

AuditDate,Status,Priority,Request Assignee,ASGRP,ModifiedBy
1/1/2019 7:58,Assigned (0),xxx,,xxx,XXXX
1/1/2019 8:31,,,XXXX,,XXXX
1/1/2019 15:02,Pending (1),,,,XXXX
1/3/2019 13:00,Completed (5),,,,XXXX
1/9/2019 2:46,Closed (8),,,,XXXX



回答2:


I would suggest usage of the pandas library. This follows an intuitive table style format (similar to excel)

import pandas as pd
pd.read_excel('tmp.xlsx', index_col=0) 

You can then filter and reshape the read dataframe (table) as required or drop rows with na (ie using the audit date column).



来源:https://stackoverflow.com/questions/54789589/convert-multiline-excel-data-into-column-and-rows-using-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!