Suggestions on get_text() in BeautifulSoup

前端 未结 3 838
遇见更好的自我
遇见更好的自我 2020-12-30 08:54

I am using BeautifulSoup to parse some content from a html page.

I can extract from the html the content I want (i.e. the text contained in a span defin

3条回答
  •  無奈伤痛
    2020-12-30 09:20

    Use 'contents' , then replace
    ?

    Here is a full (working, tested) example:

    from bs4 import BeautifulSoup
    import urllib2
    
    url="http://www.floris.us/SO/bstest.html"
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    
    result = soup.find(attrs={'class':'myclass'})
    print "The result of soup.find:"
    print result
    
    print "\nresult.contents:"
    print result.contents
    print "\nresult.get_text():"
    print result.get_text()
    for r in result:
      if (r.string is None):
        r.string = ' '
    
    print "\nAfter replacing all the 'None' with ' ':"
    print result.get_text()
    

    Result:

    The result of soup.find:
    Lorem ipsum
    dolor sit amet,
    consectetur...
    result.contents: [u'Lorem ipsum',
    , u'dolor sit amet,',
    , u'consectetur...'] result.get_text(): Lorem ipsumdolor sit amet,consectetur... After replacing all the 'None' with ' ': Lorem ipsum dolor sit amet, consectetur...

    This is more elaborate than Sean's very compact solution - but since I had said I would create and test a solution along the lines I had indicate when I could, I decided to follow through on my promise. You can see a little better what is going on here - the
    is its own element in the result.contents tuple, but when converted to string there's "nothing left".

提交回复
热议问题