How to loop through a complicated XML structure in order to transform it to a pandas data frame

China☆狼群 提交于 2019-12-13 10:18:18

问题


I am trying to extract information from a XML file and transform it into a pandas dataframe for the following XML structure:

<change user="123" timestamp="2017-09-04T13:58:46.190Z">
    <log id="333" action="create">
        <property id="52122">
            <old/>
            <new>
                <item id="562622" toString="Test"/>
                <item id="033362" toString="Test2"/>
            </new>
        </property>
        <property id="33563">
            <new>
                <item id="44322" toString="Test3"/>
            </new>
        </property>
        <property id="21733">
            <old/>
            <new id="12341212" toString="Test4"/>
        </property>
    </log>
</change>

The following are the expected headers for the columns in the dataframe:

Change_User|Timestamp|Log_id|Action|property_ID|New_Property_ID|Item_ID|To_String

I tried it before with MiniDom, but it's horrible. Now I'm trying to do this with an xml-elementree.

How may I code to loop through the whole change elements until item-id without duplicates?

I need something like that:

for test in root.iter('change'):
change_user_id.append(test.attrib['user'])
timestamp.append(test.attrib['timestamp'])
for log in test:
    log_id.append(log.attrib['id'])
    action.append(log.attrib['action'])
    #now comes the part where i get duplicates and wrong order of the following values...

    #after some logic...

d = {'changer_user':change_user_id,'timestamp':timestamp,'log_id':log_id,'action':action#and so on...}


a = pd.DataFrame.from_dict(d, orient='index')

回答1:


Not sure what you're after, but this should get you started:

import xmltodict

with open('change_user.xml') as fd:
    doc = xmltodict.parse(fd.read())  

doc['change']['log'] #use tags to maneuver through dicts

Prints:

OrderedDict([('@id', '333'),
             ('@action', 'create'),
             ('property',
              [OrderedDict([('@id', '52122'),
                            ('old', None),
                            ('new',
                             OrderedDict([('item',
                                           [OrderedDict([('@id', '562622'),
                                                         ('@toString',
                                                          'Test')]),
                                            OrderedDict([('@id', '033362'),
                                                     ('@toString',
                                                      'Test2')])])]))]),
           OrderedDict([('@id', '33563'),
                        ('new',
                         OrderedDict([('item',
                                       OrderedDict([('@id', '44322'),
                                                    ('@toString',
                                                     'Test3')]))]))]),
           OrderedDict([('@id', '21733'),
                        ('old', None),
                        ('new',
                         OrderedDict([('@id', '12341212'),
                                      ('@toString', 'Test4')]))])])])

Source: http://docs.python-guide.org/en/latest/scenarios/xml/




回答2:


this is way by which you can proceed further, i am taking example for two columns ,rest you can figure out yourself

Step 1

Parse the xml with ElementTree

import xml.etree.ElementTree as ET
import datetime as date

def output_xml_parsing(xml):
    xml_data=open(xml).read()
    root= ET.XML(xml_data)
    Change_User=root.attrib.get('user')
    timestamp=root.attrib.get('timestamp')
    return Change_User,timestamp

Step 2

Create a dataframe and add values to it,this example is with two columns only,but you can expand it further

def add_data_to_dataframe(xml):
    import pandas as pd
    #This will create an empty dataframe with two columns
    report_dataframe=pd.DataFrame(columns=['Change_User','timestamp'],index=[date])
    #Returned value from above function would be stored in Change_user,timestamp
    Change_User,timestamp=output_xml_parsing(xml)

    #Dictionary which will populate the data in data frame, key is column name and value is value returned from previous function

   data={
        'Change_User':[Change_User],
        'timestamp':[timestamp]
        }
    #DataFrame would be populated by below command
    report_dataframe=pd.DataFrame(data,index=[date])
    return report_dataframe

Step 3

Calling the function

ab=add_data_to_dataframe(r'D:\Users\pankaj-m\Desktop\Stack overflow questions\xml\data.xml')
print ab


来源:https://stackoverflow.com/questions/47453372/how-to-loop-through-a-complicated-xml-structure-in-order-to-transform-it-to-a-pa

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!