Should we consider using range [a-z] as a bug?

前端 未结 3 850
余生分开走
余生分开走 2020-12-15 05:17

In my locale (et_EE) [a-z] means:

abcdefghijklmnopqrsšz

So, 6 ASCII chars (tuvwxy) and one from Estonian alphabet

3条回答
  •  误落风尘
    2020-12-15 05:38

    Possible Locale Bugs

    The problem you're facing is not with POSIX character classes per se, but with the fact that the classes are dependent on locale. For example, regex(7) says:

    Within a bracket expression, the name of a character class enclosed in "[:" and ":]" stands for the list of all characters belonging to that class...These stand for the character classes defined in wctype(3). A locale may provide others.

    The emphasis is mine, but the manual page is clearly saying that the character classes are dependent on locale. Further, wctype(3) says:

    The behavior of wctype() depends on the LC_CTYPE category of the current locale.

    In other words, if your locale incorrectly defines a character class, then it's a bug that should be filed against the specific locale. On the other hand, if the character class simply defines the character set in a way that you are not expecting, then it may not be a bug; it may just be a problem that needs to be coded around.

    Character Classes as Shortcuts

    Character classes are shortcuts for defining sets. You certainly aren't restricted to the pre-defined sets for your locale, and you are free to use the Unicode character sets defined by perlre(1), or simply create the sets explicitly if that provides greater accuracy.

    You already know this, so I'm not trying to be pedantic. I'm just pointing out that if you can't or won't fix the locale (which is the source of the problem here) then you should use an explicit set, as you have done.

    A convenience class is only convenient if it works for your use case. If it doesn't, toss it overboard!

提交回复
热议问题