DATEXII XML file to DataFrame in Python

问题

The last couple of days I have been trying to open and read a certain XML file (in DATEXII format), but have not succeeded so far. It is about traffic data from the NDW Open Data website (Dutch Databank for Road and Traffic Data), hyperlink for the source of the XML files. The head of the tree is like in this picture and continues like this, see also snippet below. Though these together only form a very small part of the data.

<?xml version="1.0"?> -
<soapenv:Envelope xmlns:_0="http://datex2.eu/schema/2/2_0" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Header/> -
  <soapenv:Body>
    -
    <d2LogicalModel modelBaseVersion="2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      -
      <exchange xmlns="http://datex2.eu/schema/2/2_0">
        -
        <supplierIdentification>
          <country>nl</country>
          <nationalIdentifier>NLNDW</nationalIdentifier>
        </supplierIdentification>
      </exchange>
      -
      <payloadPublication lang="nl" xmlns="http://datex2.eu/schema/2/2_0" xsi:type="MeasuredDataPublication">
        <publicationTime>2017-10-30T05:00:40.007Z</publicationTime>
        -
        <publicationCreator>
          <country>nl</country>
          <nationalIdentifier>NLNDW</nationalIdentifier>
        </publicationCreator>
        <measurementSiteTableReference targetClass="MeasurementSiteTable" version="955" id="NDW01_MT" /> -
        <headerInformation>
          <confidentiality>noRestriction</confidentiality>
          <informationStatus>real</informationStatus>
        </headerInformation>
        -
        <siteMeasurements>
          <measurementSiteReference targetClass="MeasurementSiteRecord" version="1" id="PZH01_MST_0690_00" />
          <measurementTimeDefault>2017-10-30T04:59:00Z</measurementTimeDefault>
          -
          <measuredValue index="1">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>60</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="2">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>0</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="3">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>0</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="4">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficFlow">
                -
                <vehicleFlow>
                  <vehicleFlowRate>60</vehicleFlowRate>
                </vehicleFlow>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="5">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficSpeed">
                -
                <averageVehicleSpeed numberOfInputValuesUsed="1">
                  <speed>38</speed>
                </averageVehicleSpeed>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="6">
            -
            <measuredValue>
              -
              <basicData xsi:type="TrafficSpeed">
                -
                <averageVehicleSpeed numberOfInputValuesUsed="0">
                  <speed>-1</speed>
                </averageVehicleSpeed>
              </basicData>
            </measuredValue>
          </measuredValue>
          -
          <measuredValue index="7">

Ideally I would want to load the information with Python in a Jupyter Notebook as DataFrame, so I can perform some predictive analytics if the data allows. I have tried it with ElementTree, lxml like this, inspired from numerous other threads:

# Standard Packages
import pandas as pd
import numpy as np

# Necessary Packages for XML and setting Working Directory
import os
import xml.etree.ElementTree as ET
import lxml

os.chdir("C:/.../Intensiteiten en snelheden/30-10-2017")

xml_file = open('0600_Trafficspeed.xml').read() # Unzipped the file manually

def xml2df(xml_data):
    root = ET.XML(xml_data) # element tree
    all_records = [] #This is our record list which we will convert into a 
    dataframe
    for i, child in enumerate(root): #Begin looping through our root tree
        record = {} #Place holder for our record
        for subchild in child: #iterate through the subchildren
            record[subchild.tag] = subchild.text #Extract the text create a new 
    dictionary key, value pair
        all_records.append(record) #Append this record to all_records.
return pd.DataFrame(all_records) #return records as DataFrame

print(xml2df(xml_file))

Though this only returns one single entry with the first line, like column name: d2LogicalModel, row: 0, entry: None.

I was able to see the tree like structure with difficulty in Microsoft Edge, requiring a lot of the CPU (Notepad++ and the plugin XMLtools also sufficed, but crashes with "bigger" size files, i.e. > 20mb). Though, in my opinion, this structure was still difficult to comprehend. There are so many layers that I do not know how to define the xml2df() with the correct sub-subchilds etc.

My questions thus boils down to, first of all, how would I be able to identify the variables/columns with data? Herewith getting an overview of the relevant data that I want to import. And secondly, how to import this into a DataFrame?

Note: Since the DATEXII format is the standard for traffic data in Europe, I was hoping their guides would help (see documents), but they haven't made sense to me yet. Maybe they will to any of you :)

Any help is greatly appreciated!

回答1:

Consider transforming your nested XML input source into a flatter structure using XSLT the special-purpose transformation language designed to transform XML files into other XML, HTML, even text (CSV/TAB). Therefore, consider the below XSLT that transforms original XML into comma-separated values in tabular format for import into pandas with read_csv():

XSLT (save as .xsl file, a special xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
                              xmlns:pub="http://datex2.eu/schema/2/2_0"
                              xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/soapenv:Envelope">
    <xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
    <xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
    <xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
    <xsl:text>&#xa;</xsl:text>
    <xsl:apply-templates select="soapenv:Body"/>
  </xsl:template>

  <xsl:template match="soapenv:Body">
    <xsl:apply-templates select="d2LogicalModel"/>
  </xsl:template>

  <xsl:template match="d2LogicalModel">
    <xsl:apply-templates select="pub:payloadPublication"/>
  </xsl:template>

  <xsl:template match="pub:payloadPublication">
    <xsl:apply-templates select="pub:siteMeasurements"/>
  </xsl:template>

  <xsl:template match="pub:siteMeasurements">
    <xsl:apply-templates select="pub:measuredValue"/>
  </xsl:template>

  <xsl:template match="pub:measuredValue">
    <xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
                                 ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
                                 ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
                                 ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
                                 @index,',',
                                 pub:measuredValue/pub:basicData/@xsi:type,',',
                                 descendant::pub:vehicleFlowRate,',',
                                 descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
                                 descendant::pub:speed)"/><xsl:text>&#xa;</xsl:text>    
  </xsl:template>

</xsl:stylesheet>

Python

from io import StringIO
import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL FILES
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT.xsl')

# INITIALIZE AND RUN TRANSFORMATION
transform = et.XSLT(xsl)
# CONVERT RESULT TO STRING 
result = str(transform(doc))

# IMPORT INTO DATAFRAME
df = pd.read_csv(StringIO(result))

Output (parent node values become repeated indicators with different numeric data)

print(df)

#           publicationTime country nationalIdentifier msmtSiteTableRef_targetClass  msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass  msmtSiteRef_version     msmtSiteRef_id measurementTimeDefault  measuredValue_index basicData_type  vehicleFlowRate  averageVehicleSpeed_numberOfInputValues  averageVehicleSpeed_value
# 0  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    1    TrafficFlow             60.0                                      NaN                        NaN
# 1  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    2    TrafficFlow              0.0                                      NaN                        NaN
# 2  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    3    TrafficFlow              0.0                                      NaN                        NaN
# 3  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    4    TrafficFlow             60.0                                      NaN                        NaN
# 4  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    5   TrafficSpeed              NaN                                      1.0                       38.0
# 5  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    6   TrafficSpeed              NaN                                      0.0                        1.0

来源：https://stackoverflow.com/questions/47331175/datexii-xml-file-to-dataframe-in-python

标签

python

xml

dataframe

lxml

elementtree