Elegant way to try/except a series of BeautifulSoup commands?

不想你离开。 提交于 2020-02-03 05:09:10

问题


I'm parsing webpages on a site displaying item data. These items have about 20 fields which may or may not occur -- say: price, quantity, last purchased, high, low, etc.

I'm currently using a series of commands; about 20 lines of soup.find('div',{'class':SOME_FIELD_OF_INTEREST}) to look for each separate field of interest. (Some are in div, span, dd, and so on, so it's difficult to just do a soup.find_all('div') command.)

My question: Is there an elegant way to try and except everything such that the viewing of said code can be more compact or concise? Right now a sample line would look like:

try:
    soup.find('div', {'id':'item-pic'}).img["src"]
except:
    ""

I was hoping to combine everything in one line. I don't think I can syntactically run try: <line of code> except: <code>, and I'm not sure how I'd write a function that goes try_command(soup.find('div',{'id':'item-pic'}).img["src"]) without actually running the command.

I'd love to hear if anybody has any advice (including: "this isn't possible/practical, move on"). :)

EDIT: After talking a bit, I guess I wanted to see what is good practice for inline exception handling, and if that's the right route to take.


回答1:


maybe something like that:

def try_these(start_obj, *args) :
        obj = start_obj
        for trythat in args :
            if obj is None :
                return None
            try :
                if isinstance(trythat, str) :
                    obj = getattr(obj, trythat)
                else :
                    method, opts = trythat
                    obj = getattr(obj, method)(*opts)
            except :
                return None
        return obj    
src = try_these(soup, ('find', ({'id':'item-pic'},),), 
                      'img', 
                      ('get', ('src',),) )

where you can pass str to get attribute from object or tuple (str method, tuple params), finally you'll get None or result. I'm not familiar with soup so I'm not sure if get('src') would be a good approach (as probably its not a dict), anyway you can easily modify that snippet to accept something more than only 'call or attr'.


Inspired by your question I wrote simple python module that helps to deal with such situation, you can find it here

import silentcrawler    

wrapped = silentcrawler.wrap(soup)
# just return None on failure
print wrapped.find('div', {'id':'item-pic'}).img["src"].value_

# or
def on_success(value) :
    print 'found value:', value
wrapped = silentcrawler.wrap(soup, success=on_success)
# call on_success if everything will be ok
wrapped.find('div', {'id':'item-pic'}).img["src"].value_ 

there is more possibilities




回答2:


If I understand right, you want to find some fields based on an interesting class name, but they are not necessary the same element (not all <div>)

If so, with BeautifulSoup you can pass a compiled regex (from re.compile in place of a string in many cases. For example:

print soup.findAll(re.compile(".*"), {'class': 'blah'})
# [<div class="blah"></div>, <span class="blah"></span>]

We can use this to tidily loop over all the relevant looking DOM elements which might contain the image:

import re
import urllib

from BeautifulSoup import BeautifulSoup as BS


html = """
<html>
<body>
<div class="blah"></div>
<span class="blah"><img src="yay.jpg"></span>
<span class="other"></div>

</body>
</html>
"""

def get_img_src(soup, cssclass):
    for item in soup.findAll(re.compile(".*"), {'class': cssclass}):
        if item.img is not None and 'src' in dict(item.img.attrs):
            return item.img['src']


soup = BS(html)
img = get_img_src(soup, cssclass = "blah")
print img # outputs yay.jpg, or would return None if nothing was found

Debatable, but I think using the if checks is more appropriate in this case, because item.img['src']

It could equally be written like this:

def get_img_src(soup, cssclass):
    for item in soup.findAll(re.compile(".*"), {'class': cssclass}):
        try:
            return item.img['src']
        except TypeError:
            pass

..but it's strange to catch TypeError here (as 'NoneType' object has no attribute '__getitem__' isn't really the exception you are trying to catch, it's a byproduct of the syntax used by BeautifulSoup to access attributes)



来源:https://stackoverflow.com/questions/13783933/elegant-way-to-try-except-a-series-of-beautifulsoup-commands

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!