问题
I'm familiar with etree's strip_tags
and strip_elements
methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.
For instance: I'd like to strip all span
or div
tags (or other elements) from a tree (xhtm
l) that have a class='myclass'
attribute/value (preserving the element's contents like strip_tags
would do). Meanwhile, those same elements that don't have class='myclass'
should remain untouched.
Conversely: I'd like a way to strip all "naked" spans
or divs
from a tree. Meaning only those spans
/divs
(or any other elements for that matter) that have absolutely no attributes. Leaving those same elements that have attributes (any) untouched.
I feel I'm missing something obvious, but I've been searching without any luck for quite some time.
回答1:
HTML
lxml
s HTML elements have a method drop_tag() which you can call on any element in a tree parsed by lxml.html
.
It acts similar to strip_tags
in that it removes the element, but retains the text, and it can be called on the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:
doc.html
<html>
<body>
<div>This is some <span attr="foo">Text</span>.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get <span attr="foo">removed</span> as well.</div>
<div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
<div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
</body>
</html>
strip.py
from lxml import etree
from lxml import html
doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")
for span in spans_with_attrs:
span.drop_tag()
print etree.tostring(doc)
Output:
<html>
<body>
<div>This is some Text.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get removed as well.</div>
<div>Nested elements will <b>be</b> left alone.</div>
<div>Unless they also match.</div>
</body>
</html>
In this case, the XPath expression //span[@attr='foo']
selects all the span
elements with an attribute attr
of value foo
. See this XPath tutorial for more details on how to construct XPath expressions.
XML / XHTML
Edit: I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the drop_tag()
method is really only available for elements in a HTML document.
So for XML it's a bit more complicated:
doc.xml
<document>
<node>This is <span>some</span> text.</node>
<node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
strip.py
from lxml import etree
def strip_nodes(nodes):
for node in nodes:
text_content = node.xpath('string()')
# Include tail in full_text because it will be removed with the node
full_text = text_content + (node.tail or '')
parent = node.getparent()
prev = node.getprevious()
if prev:
# There is a previous node, append text to its tail
prev.tail += full_text
else:
# It's the first node in <parent/>, append to parent's text
parent.text = (parent.text or '') + full_text
parent.remove(node)
doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)
print etree.tostring(doc)
Output:
<document>
<node>This is <span>some</span> text.</node>
<node>Only this first span should <span>be</span> removed.</node>
</document>
As you can see, this will replace node and all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)
NOTE Last edit have changed the code in question.
回答2:
I just had the same problem, and after some cosideration had this rather hacky idea, which is borrowed from regex-ing Markup in Perl onliners: How about first catching all unwanted Elements with all the power that element.iterfind
brings, renaming those elements to something unlikely, and then strip all those elements?
Yes,this isn't absolutely clean and robust, as you always might have a document that actually uses the "unlikely" tag name you've chosen, but the resulting code IS rather clean and easily maintainable. If you really need to be sure that whatever "unlikely" name you've picked doesn't exist already in the document, you can always check for it's existing first, and do the renaming only if you can't find any pre-existing tags of that name.
doc.xml
<document>
<node>This is <span>some</span> text.</node>
<node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
strip.py
from lxml import etree
xml = etree.parse("doc.xml")
deltag ="xxyyzzdelme"
for el in xml.iterfind("//span[@attr='foo']"):
el.tag = deltag
etree.strip_tag(xml, deltag)
print(etree.tostring(xml, encoding="unicode", pretty_print=True))
Output
<document>
<node>This is <span>some</span> text.</node>
<node>Only this first <b>span</b> should <span>be</span> removed.</node>
</document>
回答3:
I have the same problem. But in my case the scenario a little easier, I have an option - not remove tags, just clear it, our users see rendered html and if I have for example
<div>Hello <strong>awesome</strong> World!</div>
I want to clear strong
tag by css selector div > strong
and save tail context, in lxml you cant use strip_tags
with keep_tail
by selector, you can strip only by tag, it makes me crazy. And more over if you just remove <strong>awesome</strong>
node, you also remove this tail - "World!", text that wrapped strong
tag.
Output will be like:
<div>Hello</div>
For me ok this:
<div>Hello <strong></strong> World!</div>
No awesome for the user anymore.
doc = lxml.html.fromstring(markup)
selector = lxml.cssselect.CSSSelector('div > strong')
for el in list(selector(doc)):
if el.tail:
tail = el.tail
el.clear()
el.tail = tail
else:
#if no tail, we can safety just remove node
el.getparent().remove(el)
You can adapt the code with physical delete strong
tag with the call element.remove(child)
and attach it tail to the parent, but for my case it was overhead.
来源:https://stackoverflow.com/questions/21685795/using-python-and-lxml-to-strip-only-the-tags-that-have-certain-attributes-values