Parsing HTML with Lxml

后端未结

关注

 1  1131

I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn\'t work. So I have moved on to

相关标签:

1条回答

时光取名叫无心

2020-12-28 18:52
```
import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))
```
yields

For over 65 years, Carl Stirn's Marine has been setting new standards of excellence and service for boating enjoyment. Because we offer quality merchandise, caring, conscientious, sales and service, we have been able to make our customers our good friends.

Our 26,000 sq. ft. facility includes a complete parts and accessories department, full service department (Merc. Premier dealer with 2 full time Mercruiser Master Tech's), and new, used, and brokerage sales.

Edit: Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown:
```
import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...