character-properties


Regex word-breaker in unicode

人走茶凉 提交于 2020-01-15 10:57:10
问题 How do I convert the regular expression \w+ To give me the whole words in Unicode – not just ASCII? I use .net 回答1: In .NET, \w will match Unicode characters that are Unicode letters or digits. For example, it would match ì and Æ . To just match ASCII characters, you could use [a-zA-Z0-9] . 回答2: This works as expected for me string foo = "Hola, la niña está gritando en alemán: Maüschen raus!"; Regex r = new Regex(@"\w+"); MatchCollection mc = r.Matches(foo); foreach (Match ma in mc) { Console

POSIX character equivalents in Java regular expressions

耗尽温柔 提交于 2020-01-13 10:33:27
问题 I would like to use a regular expression like this in Java : [[=a=][=e=][=i=]] . But Java doesn't support the POSIX classes [=a=], [=e=] etc . How can I do this? More precisely, is there a way to not use US-ASCII? 回答1: Java does support posix character classes. The syntax is just different, for instance: \p{Lower} \p{Upper} \p{ASCII} \p{Alpha} \p{Digit} \p{Alnum} \p{Punct} \p{Graph} \p{Print} \p{Blank} \p{Cntrl} \p{XDigit} \p{Space} 回答2: Quoting from http://download.oracle.com/javase/1.6.0

POSIX character equivalents in Java regular expressions

白昼怎懂夜的黑 提交于 2020-01-13 10:33:27
问题 I would like to use a regular expression like this in Java : [[=a=][=e=][=i=]] . But Java doesn't support the POSIX classes [=a=], [=e=] etc . How can I do this? More precisely, is there a way to not use US-ASCII? 回答1: Java does support posix character classes. The syntax is just different, for instance: \p{Lower} \p{Upper} \p{ASCII} \p{Alpha} \p{Digit} \p{Alnum} \p{Punct} \p{Graph} \p{Print} \p{Blank} \p{Cntrl} \p{XDigit} \p{Space} 回答2: Quoting from http://download.oracle.com/javase/1.6.0

POSIX character equivalents in Java regular expressions

淺唱寂寞╮ 提交于 2020-01-13 10:33:10
问题 I would like to use a regular expression like this in Java : [[=a=][=e=][=i=]] . But Java doesn't support the POSIX classes [=a=], [=e=] etc . How can I do this? More precisely, is there a way to not use US-ASCII? 回答1: Java does support posix character classes. The syntax is just different, for instance: \p{Lower} \p{Upper} \p{ASCII} \p{Alpha} \p{Digit} \p{Alnum} \p{Punct} \p{Graph} \p{Print} \p{Blank} \p{Cntrl} \p{XDigit} \p{Space} 回答2: Quoting from http://download.oracle.com/javase/1.6.0

Matching only a unicode letter in Python re

北城余情 提交于 2019-12-28 06:18:34
问题 I have a string from which i want to extract 3 groups: '19 janvier 2012' -> '19', 'janvier', '2012' Month name could contain non ASCII characters, so [A-Za-z] does not work for me: >>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError:

Matching only a unicode letter in Python re

主宰稳场 提交于 2019-12-28 06:18:07
问题 I have a string from which i want to extract 3 groups: '19 janvier 2012' -> '19', 'janvier', '2012' Month name could contain non ASCII characters, so [A-Za-z] does not work for me: >>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError:

Javascript + Unicode regexes

ⅰ亾dé卋堺 提交于 2019-12-24 05:50:53
问题 How can I use Unicode-aware regular expressions in JavaScript? For example, there should be something akin to \w that can match any code-point in Letters or Marks category (not just the ASCII ones), and hopefully have filters like [[P*]] for punctuation etc. 回答1: Situation for ES 6 The upcoming ECMAScript language specification, edition 6, includes Unicode-aware regular expressions. Support must be enabled with the u modifier on the regex. See Unicode-aware regular expressions in ES6. Until

Regex Not Matching Unicode

混江龙づ霸主 提交于 2019-12-21 21:36:52
问题 How would I go about using Regex to match Unicode strings? I'm loading in a couple keywords from a text file and using them with Regex on another file. The keywords both contain unicode (such as á , etc). I'm not sure where the problem is. Is there some option I have to set? Code: foreach (string currWord in _keywordList) { MatchCollection mCount = Regex.Matches( nSearch.InnerHtml, "\\b" + @currWord + "\\b", RegexOptions.IgnoreCase); if (mCount.Count > 0) { wordFound.Add(currWord); MessageBox

Unicode regexp to match line-breaks?

只愿长相守 提交于 2019-12-20 03:03:56
问题 I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this: ~^[\p{L}\p{M}\p{N} ]+$~u This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z} inside the class but with no success. I also tried "s" but it didn’t work. Any help is much appreciated. Thanks! 回答1: A Unicode linebreak is either a carriage return immediately followed by a line

Python: Split unicode string on word boundaries

懵懂的女人 提交于 2019-12-18 13:16:39
问题 I need to take a string, and shorten it to 140 characters. Currently I am doing: if len(tweet) > 140: tweet = re.sub(r"\s+", " ", tweet) #normalize space footer = "… " + utils.shorten_urls(post['url']) avail = 140 - len(footer) words = tweet.split() result = "" for word in words: word += " " if len(word) > avail: break result += word avail -= len(word) tweet = (result + footer).strip() assert len(tweet) <= 140 So this works great for English, and English like strings, but fails for a Chinese

工具导航Map