I have tried finding a full list of patterns to use for verifying input via HTML5 form verification for various types, specifically url
, email
,
These patterns aren't necessarily simple, but here's what I think works best in every situation. Keep in mind that (quite recently) Internationalized Domain Names (IDNs) are available too. With that, an un-testable amount of characters are allowed in URLs (there still exist lots of characters that aren't allowed in domain names, but the list of allowed characters is so big, and will change so often for different Top-Level Domains, that it's not practical to keep up with them). If you want to support the internationalized domain names, you should use the second URL pattern, otherwise, use the first.
Here's a live demo to see the following patterns in action. Scroll down for an explanation, reasoning and analysis of these patterns.
URLs
https?:\/\/(?![^\/]{253}[^\/])((?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(\/.*)?
https?:\/\/(?!.{253}.+$)((?!-.*|.*-\.)([^ !-,\.\/:-@\[-`{-~]{1,63}\.)+([^ !-\/:-@\[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9])\.){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(\/.*)?
Emails
(?!(^[.-].*|[^@]*[.-]@|.*\.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+\/=?^_`{|}~.-]+@)(?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}
Phone numbers
((\+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((\+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
https?:\/\/(?![^\/]{253}[^\/])((?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(\/.*)?
Explanation:
-
.international
"), which most likely won't change any time soon.0.0.0.0
, 127.0.0.1
, etc. are not checked for01.1.1.1
) [4].Note that the default http:.*
pattern built into modern browsers will always be enforced, so even if you remove the https?://
at the start in this pattern, it will still be enforced. Use type="text"
to avoid it.
https?:\/\/(?!.{253}.+$)((?!-.*|.*-\.)([^ !-,\.\/:-@\[-`{-~]{1,63}\.)+([^ !-\/:-@\[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9])\.){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(\/.*)?
Explanation:
Since there is a huge amount of characters that are allowed in IDNs, it's not practically possible to list every possible combination in a HTML attribute (you'd get a huge pattern, so in that case it's much better to test it by some other method than regex) [5].
!"#$%&'()*+, ./ :;<=>?@ [\]^_`` {|}~
with the exception of a period as domain seperator.
[!-,]
[\.\/]
[:-@]
[\[-``]
[{-~]
.xn--*
with *
being an encoded version of the actual TLD. This encoding uses 2 Latin letters or Arabic numerals per original character, so the arbitrary limit here is doubled to 30.(?!(^[.-].*|[^@]*[.-]@|.*\.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+\/=?^_`{|}~.-]+@)(?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}
Explanation:
Since email addresses require a whole lot more than this pattern to be 100% foolproof, this will cover the near full 100% of them. A 100% complete pattern does exist, but contains PCRE (PHP)-only regex lookaheads, so it won't work in HTML forms.
!#$%&'*+\/=?^_``{|}~.-
[6].@
can only be 63 characters long, and the total address can only be 254 characters long [8].-
or .
, and no two dots may appear consecutively [8].((\+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((\+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Explanation:
[CTRY]
stands for the country code, and X stands for the first non-zero digit (such as 6
in mobile numbers),
00[CTRY]X
+[CTRY]X
0X
[CTRY]X
(This is not officially correct syntax, but Chrome Autofill seems to like it for some reason.)This regex is just for 10-digit phone numbers. Since phone number lengths may vary between countries, it's best to use a less strict version of this pattern, or modify it to work for the desired countries. So, this pattern should generally be used as a kind of template pattern.
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
Yes, I know, I'm very western-centric, but this may be useful too, since it might be difficult to make this too, and in case you're making a site for western people too, this will always work (Asian names have a representation in exactly this format too).
ÐÞ ðþ
):
A-Z
matches all uppercase Latin letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
Ά-Ϋ
matches all uppercase Greek letters, including the accented ones: Ά·ΈΉΊΌΎΏΐ ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ ΪΫ
.À-ÖØ-Þ
matches all uppercase accented Latin letters, and the Ð and Þ: ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ
. In between there's also the character ×
(between Ö
and Ø
), which is left out this way.a-z
matches all lowercase Latin letters: abcdefghijklmnopqrstuvwxyz
ά-ώ
matches all lowercase Greek letters, including the accented ones: άέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
ß-öø-ÿ
matches all lowercase accented Latin letters, and the ß, ð and þ: ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
. In between there's also the character ÷
(between ö
and ø
), which is left out this way.