Regex [A-Z] Do Not Recognize Local Characters

问题

I've checked other problems and I've read their solutions, they do not work. I've tested the regular expression it works on non-locale characters. Code is simply to find any capital letters in a string and doing some procedure on them. Such as minikŞeker bir kedi would return kŞe however my code do not recognize Ş as a letter within [A-Z]. When I try re.LOCALE as some people request I get error ValueError: cannot use LOCALE flag with a str pattern when I use re.UNICODE

import re
corp = "minikŞeker bir kedi"
pattern = re.compile(r"([\w]{1})()([A-Z]{1})", re.U)
corp = re.sub(pattern, r"\1 \3", corp)
print(corp)

Works for minikSeker bir kedi doesn't work for minikŞeker bir kedi and throws error for re.L. The Error I'm getting is ValueError: cannot use LOCALE flag with a str pattern Searching for it yielded some git discussions but nothing useful.

回答1:

The problem is that Ş is not in the range [A-Z]. That range is the class of all characters whose codepoints lie U+0040 and U+005A (inclusive). (If you were using bytes-mode, it would be all bytes between 0x40 and 0x5A.) And Ş is U+0153 (or, e.g., 0xAA in bytes, assuming latin2). Which isn't in that range.

And using a locale won't change that. As re.LOCALE explains, all it does is:

Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale.

Also, you almost never want to use re.LOCALE. As the docs say:

The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales.

If you only care about a single script, you can build a class of the appropriate ranges for that script.

If you want to work with all scripts, you need to build a class out of a Unicode character class like Lu for "all uppercase letters". Unfortunately, Python's re doesn't have a mechanism for doing this directly. You can build a giant class out of the information in unicodedata, but that's pretty annoying:

Lu = '[' + ''.join(chr(c) for c in range(0, 0x10ffff) 
                   if unicodedata.category(chr(c)) == 'Lu') + ']'

And then:

pattern = re.compile(r"([\w]{1})()(" + Lu + r"{1})", re.U)

… or maybe:

pattern = re.compile(rf"([\w]{{1}})()({Lu}{{1}})", re.U)

But the good news is that part of the reason re doesn't have any way to specify Unicode classes is that for a long time, the plan was to replace re with a new module, so many suggested new features for re were rejected. But the good news is that the intended new module is available as a third-party library, regex. It works just fine, and is a near drop-in replacement for re; it was just improving too quickly to lock it down to the slower Python release schedule. If you install it, then you can write your code this way:

import regex
corp = "minikŞeker bir kedi"
pattern = regex.compile(r"([\w]{1})()(\p{Lu}{1})", re.U)
corp = regex.sub(pattern, r"\1 \3", corp)
print(corp)

The only change I made was to replace re with regex, and then use \p{Lu} instead of [A-Z].

There are, of course, lots of other regex engines out there, and many of them also support Unicode character classes. Most of those that do follow some variation on the same \p syntax. (They all copied it from Perl, but the details differ—e.g., regex's idea of Unicode classes comes from the unicodedata module, while PCRE and PCRE2 attempt to be as close to Perl as possible, and so on.)

回答2:

abarnet's answer is great, but if all you want to do is find upper case characters, str.isupper() works without the need for an extra module.

>>> foo = "minikŞeker bir kedi"
>>> for i, c in enumerate(foo):
...     if c.isupper():
...         print(foo[i-1:i+2])
...         break
... 
kŞe

or perhaps

>>> foo = "minikŞeker bir kedi"
>>> ''.join((' ' if c.isupper() else '') + c for c in foo)
'minik Şeker bir kedi'

来源：https://stackoverflow.com/questions/50302910/regex-a-z-do-not-recognize-local-characters

标签

python

regex

nlp