Converting regex to account for international characters

喜欢而已 提交于 2019-12-13 09:24:28

问题


I currently have the following regex for validating on inputting a company name into a form:

$regexpRange = $min.','.$max;
$regexpPattern = '/^(?=[A-Za-z\d\'\s\,\.]{'.$regexpRange.'}$)(?=.*[a-z\d])[a-zA-Z\d]+[A-Za-z\d\'\s\,\.]+$/m';

I need to update this to international standards to allow for international characters. I have zero experience with this

Can someone assist in helping me understand how to solve this?


回答1:


Here are the required steps:

  • Use the u pattern option. This turns on PCRE_UTF8 and PCRE_UCP (the PHP docs forget to mention that one):

    PCRE_UTF8

    This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

    PCRE_UCP

    This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.

  • \d will do just fine with PCRE_UCP (it's equivalent to \p{N} already), but you have to replace these [a-z] ranges to account for accented characters:

    • Replace [a-zA-Z] with \p{L}
    • Replace [a-z] with \p{Ll}
    • Replace [A-Z] with \p{Lu}

    \p{X} means: a character from Unicode category X, where L means letter, Ll means lowercase letter and Lu means uppercase letter. You can get a list from the docs.

    Note that you can use \p{X} inside a character class: [\p{L}\d\s] for instance.

  • And make sure you use UTF8 encoding for your strings in PHP. Also, make sure you use Unicode-aware functions to handle these strings.



来源:https://stackoverflow.com/questions/30112917/converting-regex-to-account-for-international-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!