Find specific link text with bs4

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-02 05:23:46

问题


I am trying to scrape a website and find all the headings of a feed. I am having trouble just getting the text of the a tag that I need. Here is an example of the html.

<td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a>
<td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a>

I am trying to get the text for every a tag with a id of c and print each on a new line.

My output should look like this.

TF4 - Oreos
Awesome Game Boy Facts

So far I have tried.

soup = bs4.BeautifulSoup(html)
links = soup.find_all('a',{'id' : 'c'})
for link in links:
    print link.text

But it doesn't find or print anything?


回答1:


You can pass a regular expression in place of an attribute value:

links = soup.find_all('a', {'id': re.compile('^c\d+')})

^ means the beginning of a string, \d+ matches one or more digits.

Demo:

>>> import re
>>> from bs4 import BeautifulSoup
>>> 
>>> html = """
... <tr>
...     <td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a></td>
...     <td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a></td>
... </tr>
... """
>>> soup = BeautifulSoup(html)
>>> links = soup.find_all('a', {'id': re.compile('^c\d+')})
>>> for link in links:
...     print link.text
... 
TF4 - Oreos
Awesome Game Boy Facts



回答2:


There's no a tag with the attribute c, but c1 and c2.

links = soup.find_all('a',{'id' : 'c1'})

If you want to find all a with attribute that starts with c, you need to pass regular expression:

import re

links = soup.findAll('a', {'id': re.compile('^c')})



回答3:


You can pass a regular expression object inside the call to find_all()

import re
import bs4

html = '''
<td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a>
<td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a>
'''

soup = bs4.BeautifulSoup(html)
for links in soup.find_all('a', {'id' : re.compile('^c') }):
    print ''.join(links.find_all(text=True))

Output

TF4 - Oreos
Awesome Game Boy Facts


来源:https://stackoverflow.com/questions/23762464/find-specific-link-text-with-bs4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!