Extracting XML into data frame with parent attribute as column title

与世无争的帅哥 提交于 2019-11-29 08:09:19

I recommend just parsing to a DataFrame first, similar to how you are already (see below for my implementation) and then tweaking it to your requirements.

Then you're looking for a pivot:

In [11]: df
Out[11]:
  child  Time  grandchild
0  blah  1200         100
1  blah  1300          30
2   abc  1200           2
3   abc  1300           4
4   abc  1400           2

In [12]: df.pivot('Time', 'child', 'grandchild')
Out[12]:
child  abc  blah
Time
1200     2   100
1300     4    30
1400     2   NaN

I recommend first parse from a file and take out the things you want into a list of tuples:

from lxml import etree
root = etree.parse(file_name)

parents = root.getchildren()[0].getchildren()

In [21]: elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
                      for p in parents
                      for c in p
                      for gc in c]

In [22]: elems
Out[22]:
[('blah', 1200, 100),
 ('blah', 1300, 30),
 ('blah', 1400, 70),
 ('abc', 1200, 2),
 ('abc', 1300, 4),
 ('abc', 1400, 2)]

For multiple files you could just whack it in an even longer list comprehension. Which shouldn't be too slow unless you have a huge number of xmls (here files is the list of xmls)...

elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
            for f in files
            for p in etree.parse(f).getchildren()[0].getchildren()
            for c in p
            for gc in c]

Put them in a DataFrame:

In [23]: pd.DataFrame(elems, columns=['child', 'Time', 'grandchild'])
Out[23]:
  child  Time grandchild
0  blah  1200        100
1  blah  1300         30
2  blah  1400         70
3   abc  1200          2
4   abc  1300          4
5   abc  1400          2

then do the pivot. :)

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!