Distinguish a filename from an URL

问题

In WeasyPrint’s public API I accept either filenames or URLs (among other types) for the HTML input:

document = HTML(filename='/foo/bar/baz.html')
document = HTML(url='http://example.net/bar/baz.html')

There is also the option not to name the argument and let WeasyPrint guess its type:

document = HTML(sys.argv[1])

Some cases are easy: if it starts with a / on Unix it’s a filename, if it starts with http:// it’s probably an URL. But we need an general algorithm that gives an answer for any string.

Currently I try to match this regexp: ^([a-z][a-z0-1.+-]*):. A string that matches starts with a valid URI scheme according to RFC 3986 (URI). This is not bad on Unix, but utterly fails on Windows: C:\foo\bar.html matches and is treated like an URL.

I could change the * to + in the regexp and only match URI schemes that are at least two characters long. Apparently there is no known URI scheme shorter than that.

Or is there a better criteria? Maybe I should just restrict "guessed" URLs to a handful of schemes. More exotic cases can still use HTML(url=foo).

url.startswith(['http:', 'https:', 'ftp:', 'data:'])

回答1:

If you really must guess well between filenames and URLs, I'd say a string with 2 or more word characters and then a colon was a URL, anything else is a file, just as you suggest.

Another option: try to open it as a file. If it fails, try to open it as a URL.

Better might be to listen to the Zen of Python, "resist the temptation to guess". Doesn't the caller know if he's talking about a filename or a URL? Have them specify it.

回答2:

The correct thing is to accept file-like objects, not paths.

Then I can pass you a file, a retrieved URL, or some other thing you haven't thought of.

回答3:

You could check the scheme if you wanted from urlparse if you want.

from urlparse import urlparse

scheme = urlparse(url).scheme
if not scheme or scheme=='file':
    pass # treat it as a file

来源：https://stackoverflow.com/questions/11687916/distinguish-a-filename-from-an-url

标签

python

url

filenames