Why does Ruby /[[:punct:]]/ miss some punctuation characters?

前端 未结 2 1196
北恋
北恋 2020-12-07 01:11

Ruby /[[:punct:]]/ is supposed to match all \"punctuation characters\". According to Wikipedia, this means /[\\]\\[!\"#$%&\'()*+,./:;<=>?@\\^_`{

2条回答
  •  失恋的感觉
    2020-12-07 01:27

    The greater than symbol is in the "Symbol, Math" category, not the punctuation category. You can see this if you force the regex's encoding to UTF-8 (it defaults to the source encoding, and presumably your source is UTF-8 encoded, while my default source is something else):

    2.1.2 :004 > /[[:punct:]]/u =~ '<'
     => nil 
    2.1.2 :005 > /[[:punct:]]/ =~ '<'
     => 0 
    

    If you force the regex to ASCII encoding (/n - more options here) you'll see it categorize '<' in punct, which I think is what you want. However, this will probably cause problems if your source contains characters outside the ASCII subset of UTF-8.

    2.1.2 :009 > /[[:punct:]]/n =~ '<'
     => 0 
    

    A better solution would be to use the 'Symbol' category instead in your regex instead of the 'punct' one, which matches '<' in UTF-8 encoding:

    2.1.2 :012 > /\p{S}/u =~ '<'
     => 0 
    

    There's a longer list of categories here.

提交回复
热议问题