Detecting words that start with an accented uppercase using regular expressions

I want to extract the words that begin with a capital — including accented capitals — using regular expressions in Java.

This is my conditional for words beginning with capital A through Z:

if (link.text().matches("^[A-Z].+") == true)

But I also want words that begin with an accented uppercase character, too.

Do you have any ideas?

Start with http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

\p{javaUpperCase}   Equivalent to java.lang.Character.isUpperCase()

To match an uppercase letter at the beginning of the string, you need the pattern ^\p{Lu}.

Unfortunately, Java does not support the mandatory \p{Uppercase} property, necessary for meeting UTS#18’s RL1.2.

That’s hardly the only thing missing from Java regular expressions to meet even Level 1, the most bareboned Basic Unicode Functionality. Without Level 1, you really can’t work with Unicode test using regular expressions. Too much is broken or absent.

UTS#18’s RL1.1 will finally be met with JDK7, but I do not believe there are currently any plans to meet RL1.2, RL1.2a, or any of the others that it’s currently lacking, nor even meeting the two Strong Recommendations. Alas!

Indeed, of the very short list of mandatory properties required by RL1.2, Java is missing the \p{Alphabetic}, \p{Uppercase}, \p{Lowercase}, \p{White_Space}, \p{Noncharacter_Code_Point}, \p{Default_Ignorable_Code_Point}, \p{ANY}, and \p{ASSIGNED} properties. Those are all mandatory but either completely missing or else fail to obey The Unicode Standard with respect to their definitions. This is also the problem with the POSIX compatible properties in Java: they’re all broken with respect to UTS#18.

Prior to JDK7, it is also missing the mandatory Script properties. JDK7 does get script properties at long last, but that’s all — nothing else. Java is still light years away from meeting even RL1.2a, which is a daily gotcha for zillions of programmers.

In JDK7, you can finally also two-part properties in the form \p{name=value} if they’re block, script, or general categories. That means these are all the same in JDK7’s Pattern class:

\p{Block=Number_Forms}, \p{blk=Number_Forms}, and \p{InNumber_Forms}.
\p{Script=Latin}, \p{sc=Latin}, \p{IsLatin}, and \p{Latin}.
\p{General_Category=Lu}, \p{GC=Lu}, and \p{Lu}.

However, you still cannot use the the long forms like \p{Lowercase_Letter} and \p{Letter_Number}, and the POSIX-looking properties are all broken from RL1.2a’s perspective. Plus super-basic properties from RL1.2 like \p{White_Space} and \p{Alphabetic} are still missing.

There was some talk of trying to fix \b and \B, which are miserably broken with respect to \w and \W, but I don't know how they’re going to fix all that without fully complying with RL1.2a. And no, I have no idea when they will add those basic properties to Java. You can’t get by without them, either.

To fully work with Unicode using regexes in Java at even Level 1, you really cannot use the standard Pattern class that Java comes with. The easiest way to do so is to instead use JNI to connect up with ICU regex libraries using the Google Android code, which is available. There do exist other languages that are at least Level-1 compliant (or better) with UTS#18, but if you want to stay within Java, ICU is currently your own real option.

java has an method java.lang.Character.isUpperCase, its not exactly a regular expression, but might satisfy.

http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isUpperCase(int)

来源：https://stackoverflow.com/questions/5442875/detecting-words-that-start-with-an-accented-uppercase-using-regular-expressions

标签

java

regex

unicode

diacritics