Python BeautifulSoup Extract specific URLs

Deadly 提交于 2020-05-26 12:28:49

问题


Is it possible to get only specific URLs?

Like:

<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>

Output should be only URLs from http://www.iwashere.com/

like, output URLs:

http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

I did it by string logic. Is there any direct method using BeautifulSoup?


回答1:


You can match multiple aspects, including using a regular expression for the attribute value:

import re
soup.find_all('a', href=re.compile('http://www\.iwashere\.com/'))

which matches (for your example):

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

so any <a> tag with a href attribute that has a value that starts with the string http://www.iwashere.com/.

You can loop over the results and pick out just the href attribute:

>>> for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')):
...     print elem['href']
... 
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

To match all relative paths instead, use a negative look-ahead assertion that tests if the value does not start with a schem (e.g. http: or mailto:), or a double slash (//hostname/path); any such value must be a relative path instead:

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))



回答2:


If you're using BeautifulSoup 4.0.0 or greater:

soup.select('a[href^="http://www.iwashere.com/"]')


来源:https://stackoverflow.com/questions/15313250/python-beautifulsoup-extract-specific-urls

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!