发表新帖

发表新帖

Regex “punct” character class matches different characters depending on Ruby version

后端未结

关注

 1  518

温柔的废话

Ruby\'s character classes for punctuation characters, i.e. [:punct:], \\p{Punct}, or \\p{P} seem to match different character

相关标签:

1条回答

梦如初夏

2021-01-19 07:23
Ruby 1.9.3 used US_ASCII as its default encoding, which properly matched all punctuation. Ruby 2.0 switched its default encoding to UTF-8, introducing the bug you discovered, which causes punctuation to be improperly matched. Ruby 2.4 patched this bug.

The correct behavior would be to match all punctuation, as ruby 1.9.3 and 2.4 do. This is consistent with the POSIX regex definition for punctuation.

One choice for making your code consistent is to encode all strings as US_ASCII or an alternative which doesn't have the UTF-8 bug:

matched, unmatched = chars.partition { |c| c.encode(Encoding::US_ASCII) =~ /[[:punct:]]/ }

But that's probably not desirable because it forces you to use a restrictive encoding for your strings.

The other option is to manually define the punctuation:
```
/[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]/
```
It's somewhat inelegant, but you can throw it into a variable and add it to regexes that way:
```
punctuation = "[!\"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]"
my_regex = /#{punctuation}/
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题