Regex “punct” character class matches different characters depending on Ruby version

后端 未结 1 514
温柔的废话
温柔的废话 2021-01-19 06:39

Ruby\'s character classes for punctuation characters, i.e. [:punct:], \\p{Punct}, or \\p{P} seem to match different character

相关标签:
1条回答
  • 2021-01-19 07:23

    Ruby 1.9.3 used US_ASCII as its default encoding, which properly matched all punctuation. Ruby 2.0 switched its default encoding to UTF-8, introducing the bug you discovered, which causes punctuation to be improperly matched. Ruby 2.4 patched this bug.

    The correct behavior would be to match all punctuation, as ruby 1.9.3 and 2.4 do. This is consistent with the POSIX regex definition for punctuation.

    One choice for making your code consistent is to encode all strings as US_ASCII or an alternative which doesn't have the UTF-8 bug:

    matched, unmatched = chars.partition { |c| c.encode(Encoding::US_ASCII) =~ /[[:punct:]]/ }

    But that's probably not desirable because it forces you to use a restrictive encoding for your strings.

    The other option is to manually define the punctuation:

    /[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]/
    

    It's somewhat inelegant, but you can throw it into a variable and add it to regexes that way:

    punctuation = "[!\"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{|}~]"
    my_regex = /#{punctuation}/
    
    0 讨论(0)
提交回复
热议问题