How to open this XML file to create dataframe in Python?

前端 未结 3 555
不思量自难忘°
不思量自难忘° 2020-12-09 20:46

Does anyone have a suggestion for the best way to open the xml data on the site below to put it in a dataframe (I prefer working with pandas) in python? The file is on the

3条回答
  •  离开以前
    2020-12-09 21:35

    XML is a tree-like structure, while a Pandas DataFrame is a 2D table-like structure. So there is no automatic way to convert between the two. You have to understand the XML structure and know how you want to map its data onto a 2D table. Thus, every XML-to-DataFrame problem is different.

    Your XML has 2 DataSets, each containing a number of Series. Each Series contains a number of Obs elements.

    Each Series has a NAME attribute, and each Obs has OBS_STATUS, TIME_PERIOD and OBS_VALUE attributes. So perhaps it would be reasonable to create a table with NAME, OBS_STATUS, TIME_PERIOD, and OBS_VALUE columns.

    I found pulling the desired data out of the XML a bit complicated, which makes me doubtful that I've found the best way to do it. But here is one way (PS. Thomas Maloney's idea of starting with the 2D table-like XLS data should be way simpler):

    import lxml.etree as ET
    import pandas as pd
    
    path = 'feds200628.xml'
    
    def fast_iter(context, func, *args, **kwargs):
        """
        http://lxml.de/parsing.html#modifying-the-tree
        Based on Liza Daly's fast_iter
        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
        See also http://effbot.org/zone/element-iterparse.htm
        http://stackoverflow.com/a/7171543/190597 (unutbu)
        """
        for event, elem in context:
            func(elem, *args, **kwargs)
            # It's safe to call clear() here because no descendants will be
            # accessed
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath('ancestor-or-self::*'):
                while ancestor.getprevious() is not None:
                    del ancestor.getparent()[0]
        del context
    
    data = list()
    obs_keys = ['OBS_STATUS', 'TIME_PERIOD', 'OBS_VALUE']
    columns = ['NAME'] + obs_keys
    
    def process_obs(elem, name):
        dct = elem.attrib
        # print(dct)
        data.append([name] + [dct[key] for key in obs_keys])
    
    def process_series(elem):
        dct = elem.attrib
        # print(dct)
        context = ET.iterwalk(
            elem, events=('end', ),
            tag='{http://www.federalreserve.gov/structure/compact/common}Obs'
            )
        fast_iter(context, process_obs, dct['SERIES_NAME'])
    
    def process_dataset(elem):
        nsmap = elem.nsmap
        # print(nsmap)
        context = ET.iterwalk(
            elem, events=('end', ),
            tag='{{{prefix}}}Series'.format(prefix=elem.nsmap['kf'])
            )
        fast_iter(context, process_series)
    
    with open(path, 'rb') as f:
        context = ET.iterparse(
            f, events=('end', ),
            tag='{http://www.federalreserve.gov/structure/compact/common}DataSet'
            )
        fast_iter(context, process_dataset)
        df = pd.DataFrame(data, columns=columns)
    

    yields

                NAME OBS_STATUS TIME_PERIOD   OBS_VALUE
    0        SVENY01          A  1961-06-14      2.9825
    1        SVENY01          A  1961-06-15      2.9941
    2        SVENY01          A  1961-06-16      3.0012
    3        SVENY01          A  1961-06-19      2.9949
    4        SVENY01          A  1961-06-20      2.9833
    5        SVENY01          A  1961-06-21      2.9993
    6        SVENY01          A  1961-06-22      2.9837
    ...
    1029410     TAU2          A  2014-09-19  3.72896779
    1029411     TAU2          A  2014-09-22  3.12836171
    1029412     TAU2          A  2014-09-23  3.20146575
    1029413     TAU2          A  2014-09-24  3.29972110
    

提交回复
热议问题