问题
In WeasyPrint’s public API I accept either filenames or URLs (among other types) for the HTML input:
document = HTML(filename='/foo/bar/baz.html')
document = HTML(url='http://example.net/bar/baz.html')
There is also the option not to name the argument and let WeasyPrint guess its type:
document = HTML(sys.argv[1])
Some cases are easy: if it starts with a /
on Unix it’s a filename, if it starts with http://
it’s probably an URL. But we need an general algorithm that gives an answer for any string.
Currently I try to match this regexp: ^([a-z][a-z0-1.+-]*):
. A string that matches starts with a valid URI scheme according to RFC 3986 (URI). This is not bad on Unix, but utterly fails on Windows: C:\foo\bar.html
matches and is treated like an URL.
I could change the *
to +
in the regexp and only match URI schemes that are at least two characters long. Apparently there is no known URI scheme shorter than that.
Or is there a better criteria? Maybe I should just restrict "guessed" URLs to a handful of schemes. More exotic cases can still use HTML(url=foo)
.
url.startswith(['http:', 'https:', 'ftp:', 'data:'])
回答1:
If you really must guess well between filenames and URLs, I'd say a string with 2 or more word characters and then a colon was a URL, anything else is a file, just as you suggest.
Another option: try to open it as a file. If it fails, try to open it as a URL.
Better might be to listen to the Zen of Python, "resist the temptation to guess". Doesn't the caller know if he's talking about a filename or a URL? Have them specify it.
回答2:
The correct thing is to accept file-like objects, not paths.
Then I can pass you a file, a retrieved URL, or some other thing you haven't thought of.
回答3:
You could check the scheme if you wanted from urlparse
if you want.
from urlparse import urlparse
scheme = urlparse(url).scheme
if not scheme or scheme=='file':
pass # treat it as a file
来源:https://stackoverflow.com/questions/11687916/distinguish-a-filename-from-an-url