问题
The last couple of days I have been trying to open and read a certain XML file (in DATEXII format), but have not succeeded so far. It is about traffic data from the NDW Open Data website (Dutch Databank for Road and Traffic Data), hyperlink for the source of the XML files. The head of the tree is like in this picture and continues like this, see also snippet below. Though these together only form a very small part of the data.
<?xml version="1.0"?> -
<soapenv:Envelope xmlns:_0="http://datex2.eu/schema/2/2_0" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Header/> -
<soapenv:Body>
-
<d2LogicalModel modelBaseVersion="2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
-
<exchange xmlns="http://datex2.eu/schema/2/2_0">
-
<supplierIdentification>
<country>nl</country>
<nationalIdentifier>NLNDW</nationalIdentifier>
</supplierIdentification>
</exchange>
-
<payloadPublication lang="nl" xmlns="http://datex2.eu/schema/2/2_0" xsi:type="MeasuredDataPublication">
<publicationTime>2017-10-30T05:00:40.007Z</publicationTime>
-
<publicationCreator>
<country>nl</country>
<nationalIdentifier>NLNDW</nationalIdentifier>
</publicationCreator>
<measurementSiteTableReference targetClass="MeasurementSiteTable" version="955" id="NDW01_MT" /> -
<headerInformation>
<confidentiality>noRestriction</confidentiality>
<informationStatus>real</informationStatus>
</headerInformation>
-
<siteMeasurements>
<measurementSiteReference targetClass="MeasurementSiteRecord" version="1" id="PZH01_MST_0690_00" />
<measurementTimeDefault>2017-10-30T04:59:00Z</measurementTimeDefault>
-
<measuredValue index="1">
-
<measuredValue>
-
<basicData xsi:type="TrafficFlow">
-
<vehicleFlow>
<vehicleFlowRate>60</vehicleFlowRate>
</vehicleFlow>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="2">
-
<measuredValue>
-
<basicData xsi:type="TrafficFlow">
-
<vehicleFlow>
<vehicleFlowRate>0</vehicleFlowRate>
</vehicleFlow>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="3">
-
<measuredValue>
-
<basicData xsi:type="TrafficFlow">
-
<vehicleFlow>
<vehicleFlowRate>0</vehicleFlowRate>
</vehicleFlow>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="4">
-
<measuredValue>
-
<basicData xsi:type="TrafficFlow">
-
<vehicleFlow>
<vehicleFlowRate>60</vehicleFlowRate>
</vehicleFlow>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="5">
-
<measuredValue>
-
<basicData xsi:type="TrafficSpeed">
-
<averageVehicleSpeed numberOfInputValuesUsed="1">
<speed>38</speed>
</averageVehicleSpeed>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="6">
-
<measuredValue>
-
<basicData xsi:type="TrafficSpeed">
-
<averageVehicleSpeed numberOfInputValuesUsed="0">
<speed>-1</speed>
</averageVehicleSpeed>
</basicData>
</measuredValue>
</measuredValue>
-
<measuredValue index="7">
Ideally I would want to load the information with Python in a Jupyter Notebook as DataFrame, so I can perform some predictive analytics if the data allows. I have tried it with ElementTree, lxml like this, inspired from numerous other threads:
# Standard Packages
import pandas as pd
import numpy as np
# Necessary Packages for XML and setting Working Directory
import os
import xml.etree.ElementTree as ET
import lxml
os.chdir("C:/.../Intensiteiten en snelheden/30-10-2017")
xml_file = open('0600_Trafficspeed.xml').read() # Unzipped the file manually
def xml2df(xml_data):
root = ET.XML(xml_data) # element tree
all_records = [] #This is our record list which we will convert into a
dataframe
for i, child in enumerate(root): #Begin looping through our root tree
record = {} #Place holder for our record
for subchild in child: #iterate through the subchildren
record[subchild.tag] = subchild.text #Extract the text create a new
dictionary key, value pair
all_records.append(record) #Append this record to all_records.
return pd.DataFrame(all_records) #return records as DataFrame
print(xml2df(xml_file))
Though this only returns one single entry with the first line, like column name: d2LogicalModel, row: 0, entry: None.
I was able to see the tree like structure with difficulty in Microsoft Edge, requiring a lot of the CPU (Notepad++ and the plugin XMLtools also sufficed, but crashes with "bigger" size files, i.e. > 20mb). Though, in my opinion, this structure was still difficult to comprehend. There are so many layers that I do not know how to define the xml2df()
with the correct sub-subchilds etc.
My questions thus boils down to, first of all, how would I be able to identify the variables/columns with data? Herewith getting an overview of the relevant data that I want to import. And secondly, how to import this into a DataFrame?
Note: Since the DATEXII format is the standard for traffic data in Europe, I was hoping their guides would help (see documents), but they haven't made sense to me yet. Maybe they will to any of you :)
Any help is greatly appreciated!
回答1:
Consider transforming your nested XML input source into a flatter structure using XSLT the special-purpose transformation language designed to transform XML files into other XML, HTML, even text (CSV/TAB). Therefore, consider the below XSLT that transforms original XML into comma-separated values in tabular format for import into pandas with read_csv()
:
XSLT (save as .xsl file, a special xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:pub="http://datex2.eu/schema/2/2_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/soapenv:Envelope">
<xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
<xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
<xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
<xsl:text>
</xsl:text>
<xsl:apply-templates select="soapenv:Body"/>
</xsl:template>
<xsl:template match="soapenv:Body">
<xsl:apply-templates select="d2LogicalModel"/>
</xsl:template>
<xsl:template match="d2LogicalModel">
<xsl:apply-templates select="pub:payloadPublication"/>
</xsl:template>
<xsl:template match="pub:payloadPublication">
<xsl:apply-templates select="pub:siteMeasurements"/>
</xsl:template>
<xsl:template match="pub:siteMeasurements">
<xsl:apply-templates select="pub:measuredValue"/>
</xsl:template>
<xsl:template match="pub:measuredValue">
<xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
@index,',',
pub:measuredValue/pub:basicData/@xsi:type,',',
descendant::pub:vehicleFlowRate,',',
descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
descendant::pub:speed)"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Python
from io import StringIO
import lxml.etree as et
import pandas as pd
# LOAD XML AND XSL FILES
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT.xsl')
# INITIALIZE AND RUN TRANSFORMATION
transform = et.XSLT(xsl)
# CONVERT RESULT TO STRING
result = str(transform(doc))
# IMPORT INTO DATAFRAME
df = pd.read_csv(StringIO(result))
Output (parent node values become repeated indicators with different numeric data)
print(df)
# publicationTime country nationalIdentifier msmtSiteTableRef_targetClass msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass msmtSiteRef_version msmtSiteRef_id measurementTimeDefault measuredValue_index basicData_type vehicleFlowRate averageVehicleSpeed_numberOfInputValues averageVehicleSpeed_value
# 0 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 1 TrafficFlow 60.0 NaN NaN
# 1 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 2 TrafficFlow 0.0 NaN NaN
# 2 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 3 TrafficFlow 0.0 NaN NaN
# 3 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 4 TrafficFlow 60.0 NaN NaN
# 4 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 5 TrafficSpeed NaN 1.0 38.0
# 5 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 6 TrafficSpeed NaN 0.0 1.0
来源:https://stackoverflow.com/questions/47331175/datexii-xml-file-to-dataframe-in-python