Extracting data from XML tree into pandas/csv with Python

问题

I have an issue with some XML files. I cannot say a lot about data, because it is for work and I don't want to be in trouble! From a huge XML file, 123091 lines of code, I only need data from 7 tags(if that makes sense). I am trying to extract that specific data, but I am having a bit of a situation when trying to store into pandas or csv. I have found a method to take some information out, like:

for info in root.iter('ArtistName'):
   print(info.text)

The code above will give me the artists in the data from that XML tag. Here is a little part of my Jupyter Notebook, with the output of the above lines of code:

Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Cream
Various Artists
Various Artists
Various Artists
Various Artists
Various Artists
Fleetwood Mac
Fleetwood Mac
Linkin Park
Lynyrd Skynyrd
Fleetwood Mac
Eric Clapton
The Black Keys
Tegan And Sara

And then, I have run into the problem, because in the below code, I cannot or better said, I don't know how to loop over each tag from XML to extract the data. Below is an attempt:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("filename.xml")
root = tree.getroot()
dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
df_xml = pd.DataFrame(columns = dfcols)

for i in root.iter(tree):
   df_xml = df_xml.append(pd.Series(index=dfcols), ignore_index=True)

df_xml.head()

The result of the above code is:

 IRC IRC2 Artist Song Units PPD TerritoryCode

Which is the header of the file that I want to create. I cannot find a way to bring the information I need into these columns.

I have also tried this:

def getValOfNode(node):
    return node.text if node is not None else None


def main():

    dfcols = ['IRC', 'IRC2', 'Artist', 'Song', 'Units', 'PPD', 'TerritoryCode']
    df_xml = pd.DataFrame(columns = dfcols)

    for node in tree:
        IRC = node.find('IRC')
        IRC2 = node.find('ICPN')
        Artist = node.find('rtistName')
        Song = node.find('Title')
        Units = node.find('ConsumerSales')
        PPD = node.find('Amount')
        TerritoryCode = node.find('TerritoryCode')

        df_xml = df_xml.append(
            pd.Series([getValOfNode(IRC), getValOfNode(IRC2), getValOfNode(Artist), getValOfNode(Song), getValOfNode(Units), getValOfNode(PPD), getValOfNode(TerritoryCode)], index=dfcols), ignore_index=True)

    print(df_xml)


main()

And I get this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-1f816143f9e4> in <module>()
     23 
     24 
---> 25 main()

<ipython-input-5-1f816143f9e4> in main()
      8     df_xml = pd.DataFrame(columns = dfcols)
      9 
---> 10     for node in tree:
     11         IRC = node.find('IRC')
     12         IRC2 = node.find('ICPN')

TypeError: 'ElementTree' object is not iterable

There is also an issue with the territory code, when I run:

for info in root.iter('TerritoryCode'):
   print(info.text)

it prints the territories but, in order, because they are duplicates(I don't know how to explain), I really need all of them and not just one of each. If that makes sense. This is what I get:

AE
AR
AT
AU
AW
BE
BG
BO
BR
BY
CA
CH
CL
CN
CO
CR
CY
CZ
DE
DK
DO
DZ
EC
EE
EG
ES
FI
FR
GB
GL
GR
GT
HK
HN

This is what I need:

AD
AD
AE
AE
AE
AE
AE
AE,

and so forth.

Can anyone help me with this? Much appreciated.

Have a great day :)

回答1:

As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.

Therefore, consider parsing in nested for loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame() constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.

Below uses chained find() calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr and child rls objects.

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("file.xml")

data = []
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):

    for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):

        inner = {}

        # DESCENDANTS
        inner['IRC'] = rls.find('./ReleaseId/ISRC').text    
        inner['IRC2'] = rls.find('./ReleaseId/ICPN').text

        # CHILDREN
        inner['Artist'] = rls.find('WMGArtistName').text
        inner['Song'] = rls.find('WMGTitle').text

        # DESCENDANTS
        inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text    
        inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text

        # PARENT
        inner['TerritoryCode'] = terr.find('./TerritoryCode').text

        data.append(inner)

df = pd.DataFrame(data)

You can shorten the find() chains and long relative paths with absolute paths using .//:

inner['IRC'] = rls.find('.//ISRC').text    
inner['IRC2'] = rls.find('.//ICPN').text

inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text

来源：https://stackoverflow.com/questions/53427905/extracting-data-from-xml-tree-into-pandas-csv-with-python