Parsing HTML with Lxml

后端 未结 1 1128
失恋的感觉
失恋的感觉 2020-12-28 18:36

I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn\'t work. So I have moved on to

相关标签:
1条回答
  • 2020-12-28 18:52
    import lxml.html as lh
    import urllib2
    
    def text_tail(node):
        yield node.text
        yield node.tail
    
    url='http://bit.ly/bf1T12'
    doc=lh.parse(urllib2.urlopen(url))
    for elt in doc.iter('td'):
        text=elt.text_content()
        if text.startswith('Additional  Info'):
            blurb=[text for node in elt.itersiblings('td')
                   for subnode in node.iter()
                   for text in text_tail(subnode) if text and text!=u'\xa0']
            break
    print('\n'.join(blurb))
    

    yields

    For over 65 years, Carl Stirn's Marine has been setting new standards of excellence and service for boating enjoyment. Because we offer quality merchandise, caring, conscientious, sales and service, we have been able to make our customers our good friends.

    Our 26,000 sq. ft. facility includes a complete parts and accessories department, full service department (Merc. Premier dealer with 2 full time Mercruiser Master Tech's), and new, used, and brokerage sales.

    Edit: Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown:

    import lxml.html as lh
    import urllib2
    
    url='http://bit.ly/bf1T12'
    doc=lh.parse(urllib2.urlopen(url))
    
    blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')
    
    blurb=[text for text in blurb if text != u'\xa0']
    print('\n'.join(blurb))
    
    0 讨论(0)
提交回复
热议问题