问题
I am using Beautiful Soup in Python.
Here is an example URL:
http://www.locationary.com/place/en/US/Ohio/Middletown/McDonald%27s-p1013254580.jsp
In the HTML, there are a bunch of tags and the only way I can specify which ones to find is with their id. The only thing I want to find is the telephone number. The tag looks like this:
<td class="dispTxt" id="value_xxx_c_1_f_8_a_134242498">5134231582</td>
I have gone to other URLs on the same website and found almost the same id for the telephone number tag every time. The part that always stays the same is:
'value_xxx_c_1_f_8_a_'
However, the numbers that come after that always change. Is there a way that I can tell Beautiful Soup to look for part of the id and match it and let the other part be numbers like a regular expression could?
Also, once I get the tag, I was wondering...how can I extract the phone number without using regular expressions? I don't know if Beautiful Soup can do that but it would probably be simpler than regex.
回答1:
You can use regular expressions (this example matches on the tag names, you need to adjust it so it matches on an element's id):
import re
for tag in soup.find_all(re.compile("^value_xxx_c_1_f_8_a_")):
print(tag.name)
回答2:
Know your documentation
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
soup.findAll(id=re.compile("para$"))
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
回答3:
You can use CSS Selectors here, to match on an attribute value prefix:
soup.select('div[id^="value_xxx_c_1_f_8_a_"]')
This will only match <div> tags with an id attribute that starts with the string value_xxx_c_1_f_8_a_.
If you are willing to switch to lxml instead, you can use an XPath 1.0 expression to find these:
from lxml import etree
doc = etree.parse(openfile)
for elem in doc.xpath('//div[starts-with(@id, "value_xxx_c_1_f_8_a_")]'):
print elem.text
Using an lxml XPath expression will be an order of a magnitude faster than using a BeautifulSoup regular-expression match.
回答4:
To get the phone number you can use the .text attribute.
tag = soup.find("foo")
phone_number = tag.text
来源:https://stackoverflow.com/questions/11924135/how-to-use-beautiful-soup-to-find-a-tag-with-changing-id