问题
This must be an absolute classic, but I can't find the answer here. I'm parsing the following tag with lxml cssselect:
<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>
I want to get the content of the <li>
tag without the content of the <span>
tag.
Currently I have:
stop_list = doc.cssselect('ol#stations li a')
start = stop_list[0].text_content().strip()
But that gives me 3 Detroit
. How can I just get Detroit
?
回答1:
itertext
method of an element returns an iterator of node's text data. For your <a>
tag, ' Detroit'
would be the 2nd value returned by the iterator. If structure of your document always conforms to a known specification, you could skip specific text elements to get what you need.
from lxml import html
doc = html.fromstring("""<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>""")
stop_nodes = doc.cssselect('li a')
stop_names = []
for start in stop_list:
node_text = start.itertext()
node_text.next() # Skip '3'
stop_names.append(node_text.next().lstrip())
continue
You can combine css selector with the xpath text()
function mentioned in Zachary's answer like this (If you're more comfortable with using CSS selectors than xpath):
stop_names = [a.xpath('text()').lstrip() for a in doc.cssselect('li a')]
回答2:
I'm not very familiar with lxml but this is working in IDLE (v2.7.2). I think going with XPath is a better bet than CSS:
>>> xml = '<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>'
>>> root = etree.fromstring(xml)
>>> print( root.xpath('/li/a/text()'))
[' Detroit']
This appears to need less finagling after selection.
EDIT 1
Here's a slightly different example which may affect your decision:
>>> xml = '<li><a href="/stations/1">I <span>FooBar!</span> love <span class="num">3</span> Detroit</a></li>'
>>> root = etree.fromstring(xml)
>>> print( root.xpath('/li/a/text()'))
['I ', ' love ', ' Detroit']
>>> ' '.join([x.strip() for x in root.xpath('/li/a/text()')])
'I love Detroit'
I hope this helps,
Zachary
来源:https://stackoverflow.com/questions/8141956/lxml-classic-get-text-content-except-for-that-of-nested-tags