问题
I am currently parsing an XML and from that, fill a dataframe. Suppose we have this toy XML:
<A>
<AA>
<AAA1 period='march'>ONE</AAA1>
<AAA2>TWO</AAA2>
<AAA3>THREE</AAA3>
<AAA4>
<B semester='4'>FOUR</B>
<C>FIVE</C>
<D>SIX</D>
</AAA4>
</AA>
</A>
And what I am trying to get is something like :
[{A.AA.AAA1.period-march: 'ONE'}, {A.AA.AAA2: 'TWO'}, {A.AA.AAA3: 'THREE'}, {A.AA.AAA4.B.semester-4: 'FOUR'},{A.AA.AAA4.C: 'FIVE'}, {A.AA.AAA4.D: 'SIX'}]
, which would be much easier to work with.
I have already parsed the XML and transformed it into this form: [{'A: 'empty'}, {'AA': 'empty'}, {'AAA1': 'ONE'}, {'AAA2': 'TWO'},{'AAA3': 'THREE'}, {'AAA4': 'empty'}, {'B': 'FOUR'}, {'C': 'FIVE'}, {'D': 'SIX'}]
, filling the values of the father tags with 'empty' to mark them and then be able to concatenate them following the idea that if it finds and 'empty' value, saves the key to concatenate, and so on.
I would appreciate all the help, guys. Thank you very much in advance.
回答1:
The tricky part is getting the path to the element you are interested in. One way with xslt is to use a recursive call to a template.
The following uses this method to assemble string versions of the dictionaries and hand those to python.
Here's the xslt part, dataframe.xsl:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:strip-space elements="*" />
<!-- match all elements that have text -->
<xsl:template match="//*[text()]">
<xsl:text>{'</xsl:text>
<xsl:call-template name="pwd" />
<xsl:text>': "</xsl:text>
<xsl:value-of select="normalize-space(.)" />
<xsl:text>"}
</xsl:text>
</xsl:template>
<!-- recursive template that prints parent element names -->
<xsl:template name="pwd">
<xsl:for-each select="parent::*">
<xsl:call-template name="pwd" />
</xsl:for-each>
<xsl:if test="count(ancestor::*) > 0">
<xsl:text>.</xsl:text>
</xsl:if>
<xsl:value-of select="name()" />
<xsl:for-each select="@*">
<xsl:value-of select="concat('.', name(), '-', .)" />
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
To test the xslt transformation with libxml's xsltproc utility:
xsltproc dataframe.xsl source.xml
{'A.AA.AAA1.period-march': 'ONE'}
{'A.AA.AAA2': 'TWO'}
{'A.AA.AAA3': 'THREE'}
{'A.AA.AAA4.B.semester-4': 'FOUR'}
{'A.AA.AAA4.C': 'FIVE'}
{'A.AA.AAA4.D': 'SIX'}
Put it all together in python, dataframe.py:
#!/usr/bin/env python3
import ast
from lxml import etree
with open('dataframe.xsl') as stylesheet:
transform = etree.XSLT(etree.XML(stylesheet.read()))
with open('source.xml') as xml:
dataframe_str = str(transform(etree.parse(xml))).rstrip('\n')
dataframe_array = list(map(lambda s: ast.literal_eval(s),
dataframe_str.split('\n')))
print(dataframe_array)
Results:
./dataframe.py
[{'A.AA.AAA1.period-march': 'ONE'}, {'A.AA.AAA2': 'TWO'}, {'A.AA.AAA3': 'THREE'}, {'A.AA.AAA4.B.semester-4': 'FOUR'}, {'A.AA.AAA4.C': 'FIVE'}, {'A.AA.AAA4.D': 'SIX'}]
来源:https://stackoverflow.com/questions/58423635/concatenate-xml-tags-to-become-a-dataframe-column-name