问题
I want to using BeautifulSoup to get the option text in the following html. For example: I'd like to get 2002/12 , 2003/12 etc.
<select id="start_dateid">
<option value="0">2002/12</option>
<option value="1">2003/12</option>
<option value="2">2004/12</option>
<option value="3">2005/12</option>
<option value="4">2006/12</option>
<option value="5" selected="">2007/12</option>
<option value="6">2008/12</option>
<option value="7">2009/12</option>
<option value="8">2010/12</option>
<option value="9">2011/12</option>
</select>
What's the best way to get the contents? Now I'm using the following code but I don't know how to use beautiful soup for that. If there are more than one selected areas in the html file, the result will be incorrect. Here is what I have so far:
import urllib2
from bs4 import BeautifulSoup
import lxml
soup = BeautifulSoup(urllib2.urlopen("./test.html").read(),"lxml");
for item in soup.find_all('option'):
print(''.join(str(item.find(text=True))));
回答1:
You don't have to use lxml
here. I have trouble installing it on my machine, so my answer does not make use of it.
from bs4 import BeautifulSoup as BS
import urllib2
soup = BS(urllib2.urlopen("./test.html").read())
contents = [str(x.text) for x in soup.find(id="start_dateid").find_all('option')]
With this, you avoid the issue of multiple select areas in the html file, since we're first limiting by id='start_dateid'
, which guarantees for you that you have the right <select>
, since within each html document each html element must have a unique id
attribute if it has an id
attribute. Then, we're searching for all of the <option>
tags only within that <select>
tag, and then we get all of the values from each <option>
.
回答2:
Just select the select
tag instead, then loop over the contained string elements:
import urllib2
from bs4 import BeautifulSoup
import lxml
soup = BeautifulSoup(urllib2.urlopen("./test.html").read(),"lxml");
select = soup.find('select', id="start_dateid")
for value in select.stripped_strings:
print value
It is a slight shortcut; you could instead loop over select.find_all('option')
instead and get the .text
property from each, but since no other elements are present anyway, why not go straight for the string iterable and be done with it. After all, only <option>
and <optgroup>
tags are permitted in a <select>
tag, and only <option>
tags hold text.
Output from the interactive interpreter:
>>> select = soup.find('select', id="start_dateid")
>>> for value in select.stripped_strings:
... print value
...
2002/12
2003/12
2004/12
2005/12
2006/12
2007/12
2008/12
2009/12
2010/12
2011/12
If you need to turn this into a list, simply use:
values = list(select.stripped_strings)
来源:https://stackoverflow.com/questions/13555307/how-to-get-the-option-text-using-beautifulsoup