character-properties

How do I match only fully-composed characters in a Unicode string in Perl?

怎甘沉沦 提交于 2021-02-18 22:10:27
问题 I'm looking for a way to match only fully composed characters in a Unicode string. Is [:print:] dependent upon locale in any regular expression implementation that incorporates this character class? For example, will it match Japanese character 'あ', since it is not a control character, or is [:print:] always going to be ASCII codes 0x20 to 0x7E? Is there any character class, including Perl REs, that can be used to match anything other than a control character? If [:print:] includes only

Matching Unicode letter characters in PCRE/PHP

不打扰是莪最后的温柔 提交于 2021-01-29 06:36:11
问题 I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern: // unicode letters, apostrophe, hyphen, space $namePattern = "/^([\\p{L}'\\- ])+$/"; This is eventually passed to a call to preg_match() . As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张. Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think

Matching Unicode Dashes in Java Regular Expressions?

折月煮酒 提交于 2020-01-21 07:05:44
问题 I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression: private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s"); which, if I'm reading the Pattern documentation correctly, should capture any of the unicode

Matching Unicode Dashes in Java Regular Expressions?

柔情痞子 提交于 2020-01-21 07:05:25
问题 I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression: private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s"); which, if I'm reading the Pattern documentation correctly, should capture any of the unicode

Regex word-breaker in unicode

人走茶凉 提交于 2020-01-15 10:57:10
问题 How do I convert the regular expression \w+ To give me the whole words in Unicode – not just ASCII? I use .net 回答1: In .NET, \w will match Unicode characters that are Unicode letters or digits. For example, it would match ì and Æ . To just match ASCII characters, you could use [a-zA-Z0-9] . 回答2: This works as expected for me string foo = "Hola, la niña está gritando en alemán: Maüschen raus!"; Regex r = new Regex(@"\w+"); MatchCollection mc = r.Matches(foo); foreach (Match ma in mc) { Console

POSIX character equivalents in Java regular expressions

耗尽温柔 提交于 2020-01-13 10:33:27
问题 I would like to use a regular expression like this in Java : [[=a=][=e=][=i=]] . But Java doesn't support the POSIX classes [=a=], [=e=] etc . How can I do this? More precisely, is there a way to not use US-ASCII? 回答1: Java does support posix character classes. The syntax is just different, for instance: \p{Lower} \p{Upper} \p{ASCII} \p{Alpha} \p{Digit} \p{Alnum} \p{Punct} \p{Graph} \p{Print} \p{Blank} \p{Cntrl} \p{XDigit} \p{Space} 回答2: Quoting from http://download.oracle.com/javase/1.6.0

POSIX character equivalents in Java regular expressions

白昼怎懂夜的黑 提交于 2020-01-13 10:33:27
问题 I would like to use a regular expression like this in Java : [[=a=][=e=][=i=]] . But Java doesn't support the POSIX classes [=a=], [=e=] etc . How can I do this? More precisely, is there a way to not use US-ASCII? 回答1: Java does support posix character classes. The syntax is just different, for instance: \p{Lower} \p{Upper} \p{ASCII} \p{Alpha} \p{Digit} \p{Alnum} \p{Punct} \p{Graph} \p{Print} \p{Blank} \p{Cntrl} \p{XDigit} \p{Space} 回答2: Quoting from http://download.oracle.com/javase/1.6.0