beautifulsoup 4 + python: string returns 'None'

…衆ロ難τιáo~ 提交于 2019-12-24 10:47:17

问题


I'm trying to parse some html with BeautifulSoup4 and Python 2.7.6, but the string is returning "None". The HTML i'm trying to parse is:

<div class="booker-booking">
    2&nbsp;rooms
    &#0183;
    USD&nbsp;0
    <!-- Commission: USD  -->
</div>

The snippet from python I have is:

 data = soup.find('div', class_='booker-booking').string

I've also tried the following two:

data = soup.find('div', class_='booker-booking').text
data = soup.find('div', class_='booker-booking').contents[0]

Which both return:

u'\n\t\t2\xa0rooms \n\t\t\xb7\n\t\tUSD\xa00\n\t\t\n

I'm ultimately trying to get the first line into a variable just saying "2 Rooms", and the third line into another variable just saying "USD 0".


回答1:


.string returns None because the text node is not the only child (there is a comment).

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(html)
div = soup.find('div', 'booker-booking')
# remove comments
text = " ".join(div.find_all(text=lambda t: not isinstance(t, Comment)))
# -> u'\n    2\xa0rooms\n    \xb7\n    USD\xa00\n     \n'

To remove Unicode whitespace:

text = " ".join(text.split())
# -> u'2 rooms \xb7 USD 0'
print text
# -> 2 rooms · USD 0

To get your final variables:

var1, var2 = [s.strip() for s in text.split(u"\xb7")]
# -> u'2 rooms', u'USD 0'



回答2:


After you have done data = soup.find('div', class_='booker-booking').text you've extracted the data you need from the HTML. Now you just need to format it to get "2 Rooms" and "USD 0. The first step is probably splitting the data by line:

import string
lines = string.split(data, '\n')

Which will give [u'', u'\t\t2\xa0rooms ', u'\t\t\xb7', u'\t\tUSD\xa00', u'\t\t', u'']

Now you need to get rid of the whitespace, unescape the html characters, and remove the lines that don't have data:

import HTMLParser
h = HTMLParser.HTMLParser()
formatted_lines =  [string.strip(h.unescape(line)) for line in lines if len(line) > 3]

You will be left with the data you want:

print formatted_lines[0]
#2 rooms
print formatted_lines[1]
#USD 0


来源:https://stackoverflow.com/questions/20750852/beautifulsoup-4-python-string-returns-none

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!