Regular Expressions to parse template tags in XML

╄→尐↘猪︶ㄣ 提交于 2019-12-01 22:55:46
Mike Pennington

Please don't use regular expressions for this problem.

I'm serious, parsing XML with a regex is hard, and it makes your code 50x less maintainable by anyone else.

lxml is the defacto tool that pythonistas use to parse XML... take a look at this article on Stack Overflow for sample usage. Or consider this answer, which should have been the answer that was accepted.

I hacked this up as a quick demo... it searches for <w:tc> with non-empty <w:t> children and prints good next to each element.

import lxml.etree as ET
from lxml.etree import XMLParser

def worthy(elem):
    for child in elem.iterchildren():
        if (child.tag == 't') and (child.text is not None):
            return True
    return False

def dump(elem):
    for child in elem.iterchildren():
        print "Good", child.tag, child.text

parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('regex_trial.xml', parser)
for thing in etree.findall("//"):
    if thing.tag == 'tc' and worthy(thing):
        dump(thing)

Yields...

Good t Header 1
Good t Header 2
Good t Header 3
Good t {% for i in items %}
Good t {{ i.field1 }}
Good t {{ i.field2 }}
Good t {{ i.field3 }}
Good t {% endfor %}

Never ever parse HTML or XML or SGML with regular expressions.

Always use tools like lxml, libxml2 or Beautiful - they will ever do a smarter and better job than your code .

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!