问题
I am trying to extract information from a XML file and transform it into a pandas dataframe for the following XML structure:
<change user="123" timestamp="2017-09-04T13:58:46.190Z">
<log id="333" action="create">
<property id="52122">
<old/>
<new>
<item id="562622" toString="Test"/>
<item id="033362" toString="Test2"/>
</new>
</property>
<property id="33563">
<new>
<item id="44322" toString="Test3"/>
</new>
</property>
<property id="21733">
<old/>
<new id="12341212" toString="Test4"/>
</property>
</log>
</change>
The following are the expected headers for the columns in the dataframe:
Change_User|Timestamp|Log_id|Action|property_ID|New_Property_ID|Item_ID|To_String
I tried it before with MiniDom, but it's horrible. Now I'm trying to do this with an xml-elementree.
How may I code to loop through the whole change elements until item-id without duplicates?
I need something like that:
for test in root.iter('change'):
change_user_id.append(test.attrib['user'])
timestamp.append(test.attrib['timestamp'])
for log in test:
log_id.append(log.attrib['id'])
action.append(log.attrib['action'])
#now comes the part where i get duplicates and wrong order of the following values...
#after some logic...
d = {'changer_user':change_user_id,'timestamp':timestamp,'log_id':log_id,'action':action#and so on...}
a = pd.DataFrame.from_dict(d, orient='index')
回答1:
Not sure what you're after, but this should get you started:
import xmltodict
with open('change_user.xml') as fd:
doc = xmltodict.parse(fd.read())
doc['change']['log'] #use tags to maneuver through dicts
Prints:
OrderedDict([('@id', '333'),
('@action', 'create'),
('property',
[OrderedDict([('@id', '52122'),
('old', None),
('new',
OrderedDict([('item',
[OrderedDict([('@id', '562622'),
('@toString',
'Test')]),
OrderedDict([('@id', '033362'),
('@toString',
'Test2')])])]))]),
OrderedDict([('@id', '33563'),
('new',
OrderedDict([('item',
OrderedDict([('@id', '44322'),
('@toString',
'Test3')]))]))]),
OrderedDict([('@id', '21733'),
('old', None),
('new',
OrderedDict([('@id', '12341212'),
('@toString', 'Test4')]))])])])
Source: http://docs.python-guide.org/en/latest/scenarios/xml/
回答2:
this is way by which you can proceed further, i am taking example for two columns ,rest you can figure out yourself
Step 1
Parse the xml with ElementTree
import xml.etree.ElementTree as ET
import datetime as date
def output_xml_parsing(xml):
xml_data=open(xml).read()
root= ET.XML(xml_data)
Change_User=root.attrib.get('user')
timestamp=root.attrib.get('timestamp')
return Change_User,timestamp
Step 2
Create a dataframe and add values to it,this example is with two columns only,but you can expand it further
def add_data_to_dataframe(xml):
import pandas as pd
#This will create an empty dataframe with two columns
report_dataframe=pd.DataFrame(columns=['Change_User','timestamp'],index=[date])
#Returned value from above function would be stored in Change_user,timestamp
Change_User,timestamp=output_xml_parsing(xml)
#Dictionary which will populate the data in data frame, key is column name and value is value returned from previous function
data={
'Change_User':[Change_User],
'timestamp':[timestamp]
}
#DataFrame would be populated by below command
report_dataframe=pd.DataFrame(data,index=[date])
return report_dataframe
Step 3
Calling the function
ab=add_data_to_dataframe(r'D:\Users\pankaj-m\Desktop\Stack overflow questions\xml\data.xml')
print ab
来源:https://stackoverflow.com/questions/47453372/how-to-loop-through-a-complicated-xml-structure-in-order-to-transform-it-to-a-pa