Making workaround of try-except to apply on many statement in single line by creating a separate function

问题

I am scrapping dictionary data from https://www.dictionary.com/ website. The purpose is to remove the unwanted elements from the dictionary pages and save them offline for further processing. Because of the webpages are somewhat unstructured there may and may not be the elements present that are mentioned in the code below to remove; the absence of the elements gives an exception (In snippet 2). And since in the actual code, there are many elements to be removed and they may be present or absent, if we apply the try - except to every such statement the lines of code will increase drasticly.

Thus I am working on a work-around for this problem by creating a separate function for try - except (In snippet 3), the idea of which I got from here. But I am unable to get the code in snippet 3 working as the command such as soup.find_all('style') is returning None where as it should return the list of all the style tags similar to snippet 2. I cannot apply the refered solution directly as sometime I have to reach the intended element to remvove indirectly by refering to its parent or sibling such as in soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent

Snippet 1 is used to set the environment for code execution.

It would be great if you could provide some suggestion to get snippet 3 working.

Snippet 1 (Setting the environment for executing code):

import urllib.request
import requests
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',}

folder = "dictionary_com"

Snippet 2 (working):

def makedefinition(url):
    success = False
    while success==False:
        try:
            request=urllib.request.Request(url,headers=headers)
            final_url = urllib.request.urlopen(request, timeout=5).geturl()
            r = requests.get(final_url, headers=headers, timeout=5)
            success=True
        except:
            success=False

    soup = BeautifulSoup(r.text, 'lxml')

    soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})

    # there are many more elements to remove. mentioned only 2 for shortness
    remove = soup.find_all("style") # style tags
    remove.extend(safe_execute(soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)) # related content in the page

    for x in remove: x.decompose()

    return(soup)

# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)

with open(folder+"/demo.html", "w") as file:
    file.write(str(maggi))

Snippet 3 (not working):

soup = None

def safe_execute(command):
    global soup
    try:
        print(soup) # correct soup is printed
        print(exec(command)) # this should print the list of style tags but printing None, and for related content this should throw some exception
        return exec(command) # None is being returned for style
    except Exception:
        print(Exception.with_traceback())
        return []

def makedefinition(url):
    global soup
    success = False
    while success==False:
        try:
            request=urllib.request.Request(url,headers=headers)
            final_url = urllib.request.urlopen(request, timeout=5).geturl()
            r = requests.get(final_url, headers=headers, timeout=5)
            success=True
        except:
            success=False

    soup = BeautifulSoup(r.text, 'lxml')

    soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})

    # there are many more elements to remove. mentioned only 2 for shortness
    remove = safe_execute("soup.find_all('style')") # style tags
    remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent")) # related content in the page

    for x in remove: x.decompose()

    return(soup)

# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)

with open(folder+"/demo.html", "w") as file:
    file.write(str(maggi))

回答1:

In your code in snippet 3 you use the exec builtin method which returns None regardless of what it does with its argument. For details see this SO thread.

Remedy:

Use exec to modify a variable and return it instead of returning the output of exec itself.

def safe_execute(command):
   d = {}
   try:
       exec(command, d)
       return d['output']
   except Exception:
       print(Exception.with_traceback())
       return []

Then call it as something like this:

remove = safe_execute("output = soup.find_all('style')")

EDIT:

Upon execution of this code, again None is returned. Upon debugging however, inside try section if we print(soup) a correct soup value is printed, but exec(command,d) gives NameError: name 'soup' is not defined.

This disparity have been overcome by using eval() instead of exec(). The function defined is:

def safe_execute(command):
    global soup
    try:
        output = eval(command)
        return(output)
    except Exception:
        return []

And the call looks like:

remove = safe_execute("soup.find_all('style')")
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent"))

来源：https://stackoverflow.com/questions/56916092/making-workaround-of-try-except-to-apply-on-many-statement-in-single-line-by-cre

标签

python-3.x

web-scraping

error-handling

beautifulsoup

try-except