Validating URLs in Python

问题

I've been trying to figure out what the best way to validate a URL is (specifically in Python) but haven't really been able to find an answer. It seems like there isn't one known way to validate a URL, and it depends on what URLs you think you may need to validate. As well, I found it difficult to find an easy to read standard for URL structure. I did find the RFCs 3986 and 3987, but they contain much more than just how it is structured.

Am I missing something, or is there no one standard way to validate a URL?

回答1:

This looks like it might be a duplicate of How do you validate a URL with a regular expression in Python?

You should be able to use the urlparse library described there.

>>> from urllib.parse import urlparse # python2: from urlparse import urlparse
>>> urlparse('actually not a url')
ParseResult(scheme='', netloc='', path='actually not a url', params='', query='', fragment='')
>>> urlparse('http://google.com')
ParseResult(scheme='http', netloc='google.com', path='', params='', query='', fragment='')

call urlparse on the string you want to check and then make sure that the ParseResult has attributes for scheme and netloc

回答2:

The original question is a bit old, but you might also want to look at the Validator-Collection library I released a few months back. It includes high-performing regex-based validation of URLs for compliance against the RFC standard. Some details:

Tested against Python 2.7, 3.4, 3.5, 3.6
No dependencies on Python 3.x, one conditional dependency in Python 2.x (drop-in replacement for Python 2.x's buggy re module)
Unit tests that cover ~80 different succeeding/failing URL patterns, including non-standard characters and the like. As close to covering the whole spectrum of the RFC standard as I've been able to find.

It's also very easy to use:

from validator_collection import validators, checkers

checkers.is_url('http://www.stackoverflow.com')
# Returns True

checkers.is_url('not a valid url')
# Returns False

value = validators.url('http://www.stackoverflow.com')
# value set to 'http://www.stackoverflow.com'

value = validators.url('not a valid url')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

In addition, Validator-Collection includes about 60+ other validators, including domains and email addresses as well, so something folks might find useful.

回答3:

I would use the validators package. Here is the link to the documentation and installation instructions.

It is just as simple as

import validators
url = 'YOUR URL'
validators.url(url)

It will return true if it is, and false if not.

回答4:

you can also try using urllib.request to validate by passing the URL in the urlopen function and catching the exception for URLError.

from urllib.request import urlopen, URLError

def validate_web_url(url="http://google"):
    try:
        urlopen(url)
        return True
    except URLError:
        return False

This would return False in this case

回答5:

Assuming you are using python 3, you could use urllib. The code would go something like this:

import urllib.request as req
import urllib.parse as p

def foo():
    url = 'http://bar.com'
    request = req.Request(url)
    try:
        response = req.urlopen(request)
        #response is now a string you can search through containing the page's html
    except:
        #The url wasn't valid

If there is no error on the line "response = ..." then the url is valid.

来源：https://stackoverflow.com/questions/22238090/validating-urls-in-python

标签

python

url

url-validation