Using Python and lxml to strip only the tags that have certain attributes/values

我怕爱的太早我们不能终老 提交于 2019-12-12 08:25:55

问题


I'm familiar with etree's strip_tags and strip_elements methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.

For instance: I'd like to strip all span or div tags (or other elements) from a tree (xhtml) that have a class='myclass' attribute/value (preserving the element's contents like strip_tags would do). Meanwhile, those same elements that don't have class='myclass' should remain untouched.

Conversely: I'd like a way to strip all "naked" spans or divs from a tree. Meaning only those spans/divs (or any other elements for that matter) that have absolutely no attributes. Leaving those same elements that have attributes (any) untouched.

I feel I'm missing something obvious, but I've been searching without any luck for quite some time.


回答1:


HTML

lxmls HTML elements have a method drop_tag() which you can call on any element in a tree parsed by lxml.html.

It acts similar to strip_tags in that it removes the element, but retains the text, and it can be called on the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:

doc.html

<html>
    <body>
        <div>This is some <span attr="foo">Text</span>.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get <span attr="foo">removed</span> as well.</div>
        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
    </body>
</html>

strip.py

from lxml import etree
from lxml import html

doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")

for span in spans_with_attrs:
    span.drop_tag()

print etree.tostring(doc)

Output:

<html>
    <body>
        <div>This is some Text.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get removed as well.</div>
        <div>Nested elements will <b>be</b> left alone.</div>
        <div>Unless they also match.</div>
    </body>
</html>

In this case, the XPath expression //span[@attr='foo'] selects all the span elements with an attribute attr of value foo. See this XPath tutorial for more details on how to construct XPath expressions.

XML / XHTML

Edit: I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the drop_tag() method is really only available for elements in a HTML document.

So for XML it's a bit more complicated:

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

from lxml import etree


def strip_nodes(nodes):
    for node in nodes:
        text_content = node.xpath('string()')

        # Include tail in full_text because it will be removed with the node
        full_text = text_content + (node.tail or '')

        parent = node.getparent()
        prev = node.getprevious()
        if prev:
            # There is a previous node, append text to its tail
            prev.tail += full_text
        else:
            # It's the first node in <parent/>, append to parent's text
            parent.text = (parent.text or '') + full_text
        parent.remove(node)


doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)

print etree.tostring(doc)

Output:

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this first span should <span>be</span> removed.</node>
</document>

As you can see, this will replace node and all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)

NOTE Last edit have changed the code in question.




回答2:


I just had the same problem, and after some cosideration had this rather hacky idea, which is borrowed from regex-ing Markup in Perl onliners: How about first catching all unwanted Elements with all the power that element.iterfind brings, renaming those elements to something unlikely, and then strip all those elements?

Yes,this isn't absolutely clean and robust, as you always might have a document that actually uses the "unlikely" tag name you've chosen, but the resulting code IS rather clean and easily maintainable. If you really need to be sure that whatever "unlikely" name you've picked doesn't exist already in the document, you can always check for it's existing first, and do the renaming only if you can't find any pre-existing tags of that name.

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

from lxml import etree
xml = etree.parse("doc.xml")
deltag ="xxyyzzdelme"
for el in xml.iterfind("//span[@attr='foo']"):
    el.tag = deltag
etree.strip_tag(xml, deltag)
print(etree.tostring(xml, encoding="unicode", pretty_print=True))

Output

<document>
     <node>This is <span>some</span> text.</node>
     <node>Only this first <b>span</b> should <span>be</span> removed.</node>
</document>



回答3:


I have the same problem. But in my case the scenario a little easier, I have an option - not remove tags, just clear it, our users see rendered html and if I have for example

<div>Hello <strong>awesome</strong> World!</div>

I want to clear strong tag by css selector div > strong and save tail context, in lxml you cant use strip_tags with keep_tail by selector, you can strip only by tag, it makes me crazy. And more over if you just remove <strong>awesome</strong> node, you also remove this tail - "World!", text that wrapped strong tag. Output will be like:

<div>Hello</div>

For me ok this:

<div>Hello <strong></strong> World!</div>

No awesome for the user anymore.

doc = lxml.html.fromstring(markup)
selector = lxml.cssselect.CSSSelector('div > strong')
for el in list(selector(doc)):
    if el.tail:
        tail = el.tail
        el.clear()
        el.tail = tail
    else:
        #if no tail, we can safety just remove node
        el.getparent().remove(el)

You can adapt the code with physical delete strong tag with the call element.remove(child) and attach it tail to the parent, but for my case it was overhead.



来源:https://stackoverflow.com/questions/21685795/using-python-and-lxml-to-strip-only-the-tags-that-have-certain-attributes-values

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!