Parsing HTML with BeautifulSoup

后端 未结 2 645
遇见更好的自我
遇见更好的自我 2020-12-20 01:36

\"enter

(Picture is small, here is another link: http://i.imgur.com/OJC0A.png)

相关标签:
2条回答
  • 2020-12-20 02:08

    http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings suggests that the .strings method is what you want - it returns a iterator of each string within the object. So if you turn that iterator into a list and take the last item, you should get what you want. For example:

    $ python
    >>> import bs4
    >>> text = '<div style="mine"><div>unwanted</div>wanted</div>'
    >>> soup = bs4.BeautifulSoup(text)
    >>> soup.find_all("div", style="mine")[0].text
    u'unwantedwanted'
    >>> list(soup.find_all("div", style="mine")[0].strings)[-1]
    u'wanted'
    
    0 讨论(0)
  • 2020-12-20 02:08

    To get the text in the tail of div.tiny:

    review = soup.find("div", "tiny").findNextSibling(text=True)
    

    Full example:

    #!/usr/bin/env python
    from bs4 import BeautifulSoup
    
    html = """<div style="margin-left:0.5em;">
    <div style="margin-bottom:0.5em;">
       9 of 35 people found the following review helpful </div>
    <div style="margin-bottom:0.5em;">
    <div style="margin-bottom:0.5em;">
    <div class="tiny" style="margin-bottom:0.5em;">
    <b>
        <span class="h3color tiny">This review is from: </span>
        <a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
         A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
    </b>
    </div>
    That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""
    
    soup = BeautifulSoup(html)
    review = soup.find("div", "tiny").findNextSibling(text=True)
    print(review)
    

    Output

    
    That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
    

    Here's an equivalent lxml code that produces the same output:

    import lxml.html
    
    doc = lxml.html.fromstring(html)
    print doc.find(".//div[@class='tiny']").tail
    
    0 讨论(0)
提交回复
热议问题