Print URL from two different BeautifulSoap outputs

时光毁灭记忆、已成空白 提交于 2019-12-25 03:30:43

问题


I am scraping a few URLs in batch using BeautifulSoap.

Here is my script (only relevant stuff):

import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://example.com/foo/bar'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
url_box = soup.find('div', attrs={'class': 'player'})
print url_box

This gives 2 different kinds of print depending on the HTML of URL (about half pages gives first print and rest give the second print).

Here's first kind of print:

<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>

And here's the other:

<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>

I want to extract the image URL which is poster in first and src in second.

Any ideas how I can do that so same script extracts that URL from either kind of print?

P.S The first print also has a mp4 link which I do not need.


回答1:


You can use the get() method to get the value of attrs from the targeted tag.

You should be able to do something like this:

if url_box.find('video'):
    url = url_box.find('video').get('poster')
    mp4 = ulr_box.find('span').get('data-url')
if url_box.find('img'):
    url = url_box.find('img').get('src')



回答2:


Decide which version you are dealing with and split accordingly:


firstVersion = '''<div class="player">
<video class="video-js vjs-fluid video-player" height="100%" id="some-player" poster="https://example.com/path/to/jpg/random.jpg" width="100%"></video>
<span data-type="trailer-src" data-url="https://example.com/path/to/mp4/random.mp4"></span>
</div>'''

secondVersion = '''<div class="player">
<img alt="Image description here" src="https://example.com/path/to/jpg/random.jpg"/>
</div>'''

def extractImageUrl(htmlInput):
    imageUrl = ""
    if "poster" in htmlInput:
        imageUrl = htmlInput.split('poster="')[1].split('"')[0]
    elif "src" in htmlInput:
        imageUrl = htmlInput.split('src="')[1].split('"')[0]
    return imageUrl



来源:https://stackoverflow.com/questions/55400733/print-url-from-two-different-beautifulsoap-outputs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!