Detecting words that start with an accented uppercase using regular expressions

夙愿已清 提交于 2019-12-01 21:10:33

Start with http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

\p{javaUpperCase}   Equivalent to java.lang.Character.isUpperCase()

To match an uppercase letter at the beginning of the string, you need the pattern ^\p{Lu}.

Unfortunately, Java does not support the mandatory \p{Uppercase} property, necessary for meeting UTS#18’s RL1.2.

That’s hardly the only thing missing from Java regular expressions to meet even Level 1, the most bareboned Basic Unicode Functionality. Without Level 1, you really can’t work with Unicode test using regular expressions. Too much is broken or absent.

UTS#18’s RL1.1 will finally be met with JDK7, but I do not believe there are currently any plans to meet RL1.2, RL1.2a, or any of the others that it’s currently lacking, nor even meeting the two Strong Recommendations. Alas!

Indeed, of the very short list of mandatory properties required by RL1.2, Java is missing the \p{Alphabetic}, \p{Uppercase}, \p{Lowercase}, \p{White_Space}, \p{Noncharacter_Code_Point}, \p{Default_Ignorable_Code_Point}, \p{ANY}, and \p{ASSIGNED} properties. Those are all mandatory but either completely missing or else fail to obey The Unicode Standard with respect to their definitions. This is also the problem with the POSIX compatible properties in Java: they’re all broken with respect to UTS#18.

Prior to JDK7, it is also missing the mandatory Script properties. JDK7 does get script properties at long last, but that’s all — nothing else. Java is still light years away from meeting even RL1.2a, which is a daily gotcha for zillions of programmers.

In JDK7, you can finally also two-part properties in the form \p{name=value} if they’re block, script, or general categories. That means these are all the same in JDK7’s Pattern class:

  • \p{Block=Number_Forms}, \p{blk=Number_Forms}, and \p{InNumber_Forms}.
  • \p{Script=Latin}, \p{sc=Latin}, \p{IsLatin}, and \p{Latin}.
  • \p{General_Category=Lu}, \p{GC=Lu}, and \p{Lu}.

However, you still cannot use the the long forms like \p{Lowercase_Letter} and \p{Letter_Number}, and the POSIX-looking properties are all broken from RL1.2a’s perspective. Plus super-basic properties from RL1.2 like \p{White_Space} and \p{Alphabetic} are still missing.

There was some talk of trying to fix \b and \B, which are miserably broken with respect to \w and \W, but I don't know how they’re going to fix all that without fully complying with RL1.2a. And no, I have no idea when they will add those basic properties to Java. You can’t get by without them, either.

To fully work with Unicode using regexes in Java at even Level 1, you really cannot use the standard Pattern class that Java comes with. The easiest way to do so is to instead use JNI to connect up with ICU regex libraries using the Google Android code, which is available. There do exist other languages that are at least Level-1 compliant (or better) with UTS#18, but if you want to stay within Java, ICU is currently your own real option.

java has an method java.lang.Character.isUpperCase, its not exactly a regular expression, but might satisfy.

http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isUpperCase(int)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!