Which characters make a URL invalid?
Are these valid URLs?
example.com/file[/].html
http://example.com/file[/].html
<
To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.
There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.
Excluded US-ASCII Characters disallowed within the URI syntax:
control =
space =
delims = "<" | ">" | "#" | "%" | <">
The character "#" is excluded because it is used to delimit a URI from a fragment identifier. The percent character "%" is excluded because it is used for the encoding of escaped characters. In other words, the "#" and "%" are reserved characters that must be used in a specific context.
List of unwise characters are allowed but may cause problems:
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Characters that are reserved within a query component and/or have special meaning within a URI/URL:
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like ftp://user@hostname/
where the '@' character has special meaning.
Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:
http://mw1.google.com/mw-earth-vectordb/kml-samples/gp/seattle/gigapxl/$[level]/r$[y]_c$[x].jpg
Some of the character restrictions for URIs/URLs are programming language dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like http://api.google.com/q?exp=a|b
is not allowed and must be encoded instead as http://api.google.com/q?exp=a%7Cb
if using Java with a URI object instance.