lxml classic: Get text content except for that of nested tags?

问题

This must be an absolute classic, but I can't find the answer here. I'm parsing the following tag with lxml cssselect:

<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>

I want to get the content of the <li> tag without the content of the <span> tag.

Currently I have:

stop_list = doc.cssselect('ol#stations li a')
start = stop_list[0].text_content().strip()

But that gives me 3 Detroit. How can I just get Detroit?

回答1:

itertext method of an element returns an iterator of node's text data. For your <a> tag, ' Detroit' would be the 2nd value returned by the iterator. If structure of your document always conforms to a known specification, you could skip specific text elements to get what you need.

from lxml import html

doc = html.fromstring("""<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>""")
stop_nodes = doc.cssselect('li a') 
stop_names = []
for start in stop_list:
    node_text = start.itertext()
    node_text.next() # Skip '3'
    stop_names.append(node_text.next().lstrip())
    continue

You can combine css selector with the xpath text() function mentioned in Zachary's answer like this (If you're more comfortable with using CSS selectors than xpath):

stop_names = [a.xpath('text()').lstrip() for a in doc.cssselect('li a')]

回答2:

I'm not very familiar with lxml but this is working in IDLE (v2.7.2). I think going with XPath is a better bet than CSS:

>>> xml = '<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>'
>>> root = etree.fromstring(xml)
>>> print( root.xpath('/li/a/text()'))
[' Detroit']

This appears to need less finagling after selection.

EDIT 1

Here's a slightly different example which may affect your decision:

>>> xml = '<li><a href="/stations/1">I <span>FooBar!</span> love <span class="num">3</span> Detroit</a></li>'
>>> root = etree.fromstring(xml)
>>> print( root.xpath('/li/a/text()'))
['I ', ' love ', ' Detroit']
>>> ' '.join([x.strip() for x in root.xpath('/li/a/text()')])
'I love Detroit'

I hope this helps,
Zachary

来源：https://stackoverflow.com/questions/8141956/lxml-classic-get-text-content-except-for-that-of-nested-tags

标签

python

web-scraping

lxml