How to get the option text using BeautifulSoup

问题

I want to using BeautifulSoup to get the option text in the following html. For example: I'd like to get 2002/12 , 2003/12 etc.

<select id="start_dateid">
<option value="0">2002/12</option>
<option value="1">2003/12</option>
<option value="2">2004/12</option>
<option value="3">2005/12</option>
<option value="4">2006/12</option>
<option value="5" selected="">2007/12</option>
<option value="6">2008/12</option>
<option value="7">2009/12</option>
<option value="8">2010/12</option>
<option value="9">2011/12</option>
</select>

What's the best way to get the contents? Now I'm using the following code but I don't know how to use beautiful soup for that. If there are more than one selected areas in the html file, the result will be incorrect. Here is what I have so far:

    import urllib2
    from bs4 import BeautifulSoup
    import lxml

    soup = BeautifulSoup(urllib2.urlopen("./test.html").read(),"lxml");
    for item in soup.find_all('option'):
            print(''.join(str(item.find(text=True))));

回答1:

You don't have to use lxml here. I have trouble installing it on my machine, so my answer does not make use of it.

from bs4 import BeautifulSoup as BS
import urllib2

soup = BS(urllib2.urlopen("./test.html").read())
contents = [str(x.text) for x in soup.find(id="start_dateid").find_all('option')]

With this, you avoid the issue of multiple select areas in the html file, since we're first limiting by id='start_dateid', which guarantees for you that you have the right <select>, since within each html document each html element must have a unique id attribute if it has an id attribute. Then, we're searching for all of the <option> tags only within that <select> tag, and then we get all of the values from each <option>.

回答2:

Just select the select tag instead, then loop over the contained string elements:

import urllib2
from bs4 import BeautifulSoup
import lxml

soup = BeautifulSoup(urllib2.urlopen("./test.html").read(),"lxml");
select = soup.find('select', id="start_dateid")
for value in select.stripped_strings:
    print value

It is a slight shortcut; you could instead loop over select.find_all('option') instead and get the .text property from each, but since no other elements are present anyway, why not go straight for the string iterable and be done with it. After all, only <option> and <optgroup> tags are permitted in a <select> tag, and only <option> tags hold text.

Output from the interactive interpreter:

>>> select = soup.find('select', id="start_dateid")
>>> for value in select.stripped_strings:
...     print value
... 
2002/12
2003/12
2004/12
2005/12
2006/12
2007/12
2008/12
2009/12
2010/12
2011/12

If you need to turn this into a list, simply use:

values = list(select.stripped_strings)

来源：https://stackoverflow.com/questions/13555307/how-to-get-the-option-text-using-beautifulsoup

标签

python

html-parsing

beautifulsoup