How to know if a string contains accents?
The right way to do this is to use normalize(str,NFD)
fromjava.text.Normalizer
, and then delete the characters of general category Mark \pM
or Non-Spacing Mark \p{Mn}
. Java does not support the standard Unicode property \p{Diacritic}
or you could use that. Note that not all Diacritics are Non-Spacing Marks, nor vice versa.
However, this is probably the wrong thing to do. If you are trying to do accent-insensitive string searches and comparisons, the right way to do that is to leave the strings as they are. You need to create a UCA collation object with the level set to 1 (or rather, PRIMARY), then use that to compare your strings. If strings compare equal at the primary strength, it disregards things like accent marks.
Here are examples in Java of how to do that using ICU’s Collator class. If you’re using proper UCA collators, then you don’t have to normalize; they take care of this for you.
This answer in Perl uses two UCA collator objects, one at the primary strength to completely ignore accents for string searches and comparisons, and another that allows diacritics to be distinguished at the secondary strength as is normal for Unicode.