character-properties

POSIX character equivalents in Java regular expressions

淺唱寂寞╮ 提交于 2020-01-13 10:33:10
问题 I would like to use a regular expression like this in Java : [[=a=][=e=][=i=]] . But Java doesn't support the POSIX classes [=a=], [=e=] etc . How can I do this? More precisely, is there a way to not use US-ASCII? 回答1: Java does support posix character classes. The syntax is just different, for instance: \p{Lower} \p{Upper} \p{ASCII} \p{Alpha} \p{Digit} \p{Alnum} \p{Punct} \p{Graph} \p{Print} \p{Blank} \p{Cntrl} \p{XDigit} \p{Space} 回答2: Quoting from http://download.oracle.com/javase/1.6.0

Matching only a unicode letter in Python re

北城余情 提交于 2019-12-28 06:18:34
问题 I have a string from which i want to extract 3 groups: '19 janvier 2012' -> '19', 'janvier', '2012' Month name could contain non ASCII characters, so [A-Za-z] does not work for me: >>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError:

Matching only a unicode letter in Python re

主宰稳场 提交于 2019-12-28 06:18:07
问题 I have a string from which i want to extract 3 groups: '19 janvier 2012' -> '19', 'janvier', '2012' Month name could contain non ASCII characters, so [A-Za-z] does not work for me: >>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError:

Javascript + Unicode regexes

ⅰ亾dé卋堺 提交于 2019-12-24 05:50:53
问题 How can I use Unicode-aware regular expressions in JavaScript? For example, there should be something akin to \w that can match any code-point in Letters or Marks category (not just the ASCII ones), and hopefully have filters like [[P*]] for punctuation etc. 回答1: Situation for ES 6 The upcoming ECMAScript language specification, edition 6, includes Unicode-aware regular expressions. Support must be enabled with the u modifier on the regex. See Unicode-aware regular expressions in ES6. Until

Regex Not Matching Unicode

混江龙づ霸主 提交于 2019-12-21 21:36:52
问题 How would I go about using Regex to match Unicode strings? I'm loading in a couple keywords from a text file and using them with Regex on another file. The keywords both contain unicode (such as á , etc). I'm not sure where the problem is. Is there some option I have to set? Code: foreach (string currWord in _keywordList) { MatchCollection mCount = Regex.Matches( nSearch.InnerHtml, "\\b" + @currWord + "\\b", RegexOptions.IgnoreCase); if (mCount.Count > 0) { wordFound.Add(currWord); MessageBox

Unicode regexp to match line-breaks?

只愿长相守 提交于 2019-12-20 03:03:56
问题 I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this: ~^[\p{L}\p{M}\p{N} ]+$~u This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z} inside the class but with no success. I also tried "s" but it didn’t work. Any help is much appreciated. Thanks! 回答1: A Unicode linebreak is either a carriage return immediately followed by a line

Python: Split unicode string on word boundaries

懵懂的女人 提交于 2019-12-18 13:16:39
问题 I need to take a string, and shorten it to 140 characters. Currently I am doing: if len(tweet) > 140: tweet = re.sub(r"\s+", " ", tweet) #normalize space footer = "… " + utils.shorten_urls(post['url']) avail = 140 - len(footer) words = tweet.split() result = "" for word in words: word += " " if len(word) > avail: break result += word avail -= len(word) tweet = (result + footer).strip() assert len(tweet) <= 140 So this works great for English, and English like strings, but fails for a Chinese

Is There a Way to Match Any Unicode non-Alphabetic Character?

女生的网名这么多〃 提交于 2019-12-18 05:47:48
问题 I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messed up (i.e. elipses, etc...). They also correctly have a bunch of Non-English, but still Alphabetic characters, like é, and Russian characters, etc... Is there any way to make a Regex that will match any unicode alphabetic character (from alphabets of any language)? Or one that will only match non-alphabetic characters?

Replace Unicode Control Characters

Deadly 提交于 2019-12-17 18:34:01
问题 I need to replace all special control character in a string in Java. I want to ask the Google maps API v3, and Google doesn't seems to like these characters. Example: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F This URL contains this character: http://www.fileformat.info/info/unicode/char/008f/index.htm So I receive some data, and I need to geocode this data. I know some character would not pass the geocoding, but I don't know the exact list. I was not

What is the {L} Unicode category?

一世执手 提交于 2019-12-17 09:48:09
问题 I came across some regular expressions that contain [^\\p{L}] . I understand that this is using some form of a Unicode category, but when I checked the documentation, I found only the following "L" categories: Lu Uppercase letter UPPERCASE_LETTER Ll Lowercase letter LOWERCASE_LETTER Lt Titlecase letter TITLECASE_LETTER Lm Modifier letter MODIFIER_LETTER Lo Other letter OTHER_LETTER What is L in this context? 回答1: Taken from this link: http://www.regular-expressions.info/unicode.html Check the