问题
I'm working in Python with HTML that looks like this. I'm parsing with lxml, but could equally happily use pyquery:
<p><span class="Title">Name</span>Dave Davies</p>
<p><span class="Title">Address</span>123 Greyfriars Road, London</p>
Pulling out 'Name' and 'Address' is dead easy, whatever library I use, but how do I get the remainder of the text - i.e. 'Dave Davies'?
回答1:
Each Element can have a text and a tail attribute (in the link, search for the word "tail"):
import lxml.etree
content='''\
<p><span class="Title">Name</span>Dave Davies</p>
<p><span class="Title">Address</span>123 Greyfriars Road, London</p>'''
root=lxml.etree.fromstring(content,parser=lxml.etree.HTMLParser())
for elt in root.findall('**/span'):
print(elt.text, elt.tail)
# ('Name', 'Dave Davies')
# ('Address', '123 Greyfriars Road, London')
回答2:
Another method -- using xpath:
>>> from lxml import html
>>> doc = html.parse( file )
>>> doc.xpath( '//span[@class="Title"][text()="Name"]/../self::p/text()' )
['Dave Davies']
>>> doc.xpath( '//span[@class="Title"][text()="Address"]/../self::p/text()' )
['123 Greyfriars Road, London']
回答3:
Have a look at BeautifulSoup. I've just started using it, so I'm no expert. Off the top of my head:
import BeautifulSoup
text = '''<p><span class="Title">Name</span>Dave Davies</p>
<p><span class="Title">Address</span>123 Greyfriars Road, London</p>'''
soup = BeautifulSoup.BeautifulSoup(text)
paras = soup.findAll('p')
for para in paras:
spantext = para.span.text
othertext = para.span.nextSibling
print spantext, othertext
[Out]: Name Dave Davies
Address 123 Greyfriars Road, London
来源:https://stackoverflow.com/questions/3302248/python-parsing-lxml-to-get-just-part-of-a-tags-text