问题
I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google, tutorials, and just trying out codes, I've been able to pull out all of this data. See, for example: Parsing xml to pandas data frame throws memory error and Dynamic search through xml attributes using lxml and xpath in python
However, I realized that I was extracting the data poorly, with a child "Time" repeated for each parent.
Here is what I am trying to get.
Time blah abc
1200 100 2
1300 30 4
1400 70 2
Here is what I know how to get. But my current method is clunky (I'll show below the example XML)
child Time grandchild
0 blah 1200 100
1 blah 1300 30
...
n-2 abc 1200 2
n-1 abc 1300 4
n abc 1400 2
Example XML format
<outer>
<inner>
<parent name = "blah" id = "1">
<child Time = "1200">
<grandchild>100</grandchild>
</child>
<child Time = "1300">
<grandchild>30</grandchild>
</child>
<child Time = "1400">
<grandchild>70</grandchild>
</child>
</parent>
<parent name = "abc" id = "2">
<child Time = "1200">
<grandchild>2</grandchild>
</child>
<child Time = "1300">
<grandchild>4</grandchild>
</child>
<child Time = "1400">
<grandchild>2</grandchild>
</child>
</parent>
<parent name = "1234" id = "7734">
<other> 12 </other>
</parent>
</inner>
</outer>
Here is how I can get my output:
from lxml import etree, objectify
from pandas import *
dTime=[]
dparent = []
dgrandchild=[]
for df in root.xpath('/*/*/*/parent/child'):
dparent.append(df.getparent().attrib['name'])
## Iterate over attributes of time for specific parent
for attrib in df.attrib:
dTime.append(df.attrib[attrib])
## grandchild is a child of time, and iterate
subfields = df.getchildren()
for subfield in subfields:
dgrandchild.append(subfield.text)
df=DataFrame({'Parent': dparent,'Time':dTime,'grandchild':dgrandchld})
I could just take this output and re-shape it, but that seems inefficient and a very clunky approach.
I think I need something of the flavor:
#this does not work
data = []
for elem in root.xpath('/*/*/*/parent/child'):
elem_data = {}
for attrib in elem.attrib:
elem_data['Time'] = elem.attrib[attrib])
for child in elem.getchildren():
elem_data[getparent().attrib['name'])] = child.text
data.append(elem_data)
ndata = DataFrame(data)
回答1:
I recommend just parsing to a DataFrame first, similar to how you are already (see below for my implementation) and then tweaking it to your requirements.
Then you're looking for a pivot:
In [11]: df
Out[11]:
child Time grandchild
0 blah 1200 100
1 blah 1300 30
2 abc 1200 2
3 abc 1300 4
4 abc 1400 2
In [12]: df.pivot('Time', 'child', 'grandchild')
Out[12]:
child abc blah
Time
1200 2 100
1300 4 30
1400 2 NaN
I recommend first parse from a file and take out the things you want into a list of tuples:
from lxml import etree
root = etree.parse(file_name)
parents = root.getchildren()[0].getchildren()
In [21]: elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
for p in parents
for c in p
for gc in c]
In [22]: elems
Out[22]:
[('blah', 1200, 100),
('blah', 1300, 30),
('blah', 1400, 70),
('abc', 1200, 2),
('abc', 1300, 4),
('abc', 1400, 2)]
For multiple files you could just whack it in an even longer list comprehension. Which shouldn't be too slow unless you have a huge number of xmls (here files
is the list of xmls)...
elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
for f in files
for p in etree.parse(f).getchildren()[0].getchildren()
for c in p
for gc in c]
Put them in a DataFrame:
In [23]: pd.DataFrame(elems, columns=['child', 'Time', 'grandchild'])
Out[23]:
child Time grandchild
0 blah 1200 100
1 blah 1300 30
2 blah 1400 70
3 abc 1200 2
4 abc 1300 4
5 abc 1400 2
then do the pivot. :)
来源:https://stackoverflow.com/questions/16991691/extracting-xml-into-data-frame-with-parent-attribute-as-column-title