How to take preceding element when iterating over XML in Python?

三世轮回 提交于 2020-04-18 06:10:15

问题


I have an XML structured like this:

 <?xml version="1.0" encoding="utf-8"?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="192.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text></text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text></text>
            </textline>
        </textbox>
    </page>
</pages>

Attribute bbox in text tag has four values, and I need to have the difference of the first bbox value of an element and its preceding one. In other words, the distance between the first two bboxes is 1.

So far my code is:

def wrap(line, idxList):
    if len(idxList) == 0:
        return    # No elements to wrap
    # Take the first element from the original location
    idx = idxList.pop(0)     # Index of the first element
    elem = removeByIdx(line, idx) # The indicated element
    # Create "newline" element with "elem" inside
    nElem = E.newline(elem)
    line.insert(idx, nElem)  # Put it in place of "elem"
    while len(idxList) > 0:  # Process the rest of index list
        # Value not used, but must be removed
        idxList.pop(0)
        # Remove the current element from the original location
        currElem = removeByIdx(line, idx + 1)
        nElem.append(currElem)  # Append it to "newline"

for line in root.iter('textline'):
    idxList = []
    for elem in line:
        bbox = elem.attrib.get('bbox')
        if bbox is not None:
            tbl = bbox.split(',')

            distance = float(tbl[2]) - float(tbl[0])
        else:
            distance = 100  # "Too big" value
        if distance > 10:
            par = elem.getparent()
            idx = par.index(elem)
            idxList.append(idx)
        else:  # "Wrong" element, wrap elements "gathered" so far
            wrap(line, idxList)
            idxList = []
    # Process "good" elements without any "bad" after them, if any
    wrap(line, idxList)

But the part that interests the problem is specifically:

for line in root.iter('textline'):
idxList = []
for elem in line:
    bbox = elem.attrib.get('bbox')
    if bbox is not None:
        tbl = bbox.split(',')

        distance = float(tbl[2]) - float(tbl[0])

I tried a lot and really don't know how to do it.


回答1:


If I fully understand your needs, you want to select text nodes which respect the following condition :

bbox value of the text node - bbox value of the preceding text nodes not greater than 10.

You could try with XSL and XPath. First the XSL code (mandatory step to compare bbox value with XPath in the next step) :

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="no" indent="yes"/>

<xsl:template match="@bbox">
  <xsl:attribute name="{name()}">
  <xsl:value-of select="substring(.,1,3)" />
  </xsl:attribute>
</xsl:template>

<xsl:template match="@font">
  <xsl:attribute name="{name()}">
  <xsl:text>NUMPTY+ImprintMTnum</xsl:text>
  </xsl:attribute>
</xsl:template>

<xsl:template match="*[not(node())]"/> 
<xsl:strip-space  elements="*"/>

 <xsl:template match="@*|node()">
  <xsl:copy>
  <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

Then :

import lxml.etree as IP

xml = IP.parse(xml_filename)
xslt = IP .parse(xsl_filename)
transform = IP.XSLT(xslt)

Then request with :

tree = IP.parse(transform)
for nodes in tree.xpath("//text[@bbox<preceding::text[1]/@bbox+11]"):
    print(nodes)

Replace //text[@bbox<preceding::text[1]/@bbox+11] with //text[@bbox>preceding::text[1]/@bbox] to test with your sample data (will select text nodes with greater bbox value than the preceding text bbox value).



来源:https://stackoverflow.com/questions/61213788/how-to-take-preceding-element-when-iterating-over-xml-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!