Silent erroer handling in python?

问题

I got csv-file with numerous URLs. I read it into a pandas dataframe for convenience. I need to do some statistical work later - and pandas is just handy. It looks a little like this:

import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}, {"URLs" : "ww.audo.e", "electric" : 0}, {"URLs" : "NaN", "electric" : 0}]
df = pd.DataFrame(csv)

My task is to check if the websites contain certain strings and to add an extra column with 1 if so, and else 0. For example: I want to check, wether www.mercedes-benz.de contains the string car. I do the following:

for i, row in df.iterrows():
    page_content = requests.get(row['URLs'])
    if "car" in page_content.text:
        df.loc[i, 'car'] = '1'
    else:
        df.loc[i, 'car'] = '0'

The problem is: sometimes the URL is wrong/missing. My little script results in a error.

How can I handle/supress the error if the URL is wrong/missing? And, how can I e.g. use df.loc[i, 'url_wrong'] = '1' in these cases to indicate that the URL is wrong/missing?

回答1:

Try defining a function that does the "car" checking first and the use the .apply method of a pandas Series to get your 1, 0 or Wrong URL. The following should help:

import pandas as pd
import requests


data = [{"URLs" : "https://www.mercedes-benz.de", "electric" : 1},
        {"URLs" : "https://www.audi.de", "electric" : 0}, 
        {"URLs" : "https://ww.audo.e", "electric" : 0}, 
        {"URLs" : "NaN", "electric" : 0}]


def contains_car(link):
    try:
        return int('car' in requests.get(link).text)
    except:
        return "Wrong/Missing URL"


df = pd.DataFrame(data)

df['extra_column'] = df.URLs.apply(contains_car)


#                           URLs  electric extra_column
# 0  https://www.mercedes-benz.de         1            1
# 1           https://www.audi.de         0            1
# 2             https://ww.audo.e         0    Wrong/Missing URL
# 3                           NaN         0    Wrong/Missing URL

Edit:

You can search for more than just one keyword in the returned text from your HTTP request. Depending on the condition you set up, this can be done with either the builtin function any or the builtin function all. Using any means that finding any of the keywords should return 1, while using all means that all the keywords have to be matched in order to return 1. In the following example, I am using any with keywords such as 'car', 'automobile', 'vehicle':

import pandas as pd
import requests


data = [{"URLs" : "https://www.mercedes-benz.de", "electric" : 1},
        {"URLs" : "https://www.audi.de", "electric" : 0}, 
        {"URLs" : "https://ww.audo.e", "electric" : 0}, 
        {"URLs" : "NaN", "electric" : 0}]


def contains_keywords(link, keywords):
    try:
        output = requests.get(link).text
        return int(any(x in output for x in keywords))
    except:
        return "Wrong/Missing URL"


df = pd.DataFrame(data)
mykeywords = ('car', 'vehicle', 'automobile')
df['extra_column'] = df.URLs.apply(lambda l: contains_keywords(l, mykeywords))

Should yield:

#                            URLs  electric       extra_column
# 0  https://www.mercedes-benz.de         1                  1
# 1           https://www.audi.de         0                  1
# 2             https://ww.audo.e         0  Wrong/Missing URL
# 3                           NaN         0  Wrong/Missing URL

I hope this helps.

回答2:

I hope I do get you right, that 'NaN' is a "wrong/missing" URL. In this case you can just check for that. There are endless ways to indicate a missing URL. I'd prefere a missing value for car: Try this:

import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}, {"URLs" : "ww.audo-car.e", "electric" : 0}, {"URLs" : "NaN", "electric" : 0}]
df = pd.DataFrame(csv)

print(df)

for i, row in df.iterrows():
    page_content = row['URLs']
    if page_content is None or page_content is "NaN":
        df.loc[i, 'car'] = None
    elif "car" in page_content:
        df.loc[i, 'car'] = True
    else:
        df.loc[i, 'car'] = False 
    print(df.loc[i, 'car'])

print(df)

I edited some more things in your code, as they did not work. E.g this line with page_content = requests.get(row['URLs']) - requests is not defined. I guess you mean row.

来源：https://stackoverflow.com/questions/44590079/silent-erroer-handling-in-python

标签

python-3.x

loops

pandas

error-handling

get-request