Does multibyte character interfere with end-line character within a regex?

两盒软妹~` 提交于 2019-12-20 11:52:44

问题


With this regex:

regex1 = /\z/

the following strings match:

"hello" =~ regex1 # => 5
"こんにちは" =~ regex1 # => 5

but with these regexes:

regex2 = /#$/?\z/
regex3 = /\n?\z/

they show difference:

"hello" =~ regex2 # => 5
"hello" =~ regex3 # => 5
"こんにちは" =~ regex2 # => nil
"こんにちは" =~ regex3 # => nil

What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/ is "\n"). Are the multibyte characters interfering with $/? How?


回答1:


The problem you reported is definitely a bug of the Regexp of RUBY_VERSION #=> "2.0.0" but already existing in previous 1.9 when the encoding allow multi-byte chars such as __ENCODING__ #=> #<Encoding:UTF-8>

Does not depend on Linux , it's possibile to reproduce the same behavoir in OSX and Windows too.

In the while bug 8210 will be fixed, we can help by isolating and understanding the cases in which the problem occurs. This can also be useful for any workaround when applicable to specific cases.

I understand that the problem occurs when:

  • searching something before end of string \z.
  • and the last character of the string is multi-byte.
  • and the the before search uses zero or one pattern ?
  • but the number of zero or one char searched in less than the number of bytes of the last character.

The bug may be caused by misunderstandings between the number of bytes and the number of chars that is actually checked by the regular expression engine.

A few examples may help:

TEST 1: where last character:"は" is 3 bytes:

s = "んにちは"

testing for zero or one of ん [3 bytes] before end of string:

s =~ /ん?\z/u   #=> 4"       # OK it works 3 == 3

when we try with ç [2 bytes]

s =~ /ç?\z/u   #=> nil       # KO: BUG when 3 > 2
s =~ /x?ç?\z/u #=> 4         # OK it works 3 == ( 1+2 )

when test for zero or one of \n [1 bytes]

s =~ /\n?\z/u #=> nil"      # KO: BUG when 3 > 1
s =~ /\n?\n?\z/u #=> nil"   # KO: BUG when 3 > 2
s =~ /\n?\n?\n?\z/u #=> 4"  # OK it works 3 == ( 1+1+1)

By results of TEST1 we can assert: if the last multi-byte character of the string is 3 bytes , then the 'zero or one before' test only works when we test for at least 3 bytes (not 3 character) before.

TEST 2: Where last character "ç" is 2 bytes

s = "in French there is the ç" 

check for zero or one of ん [3 bytes]"

s =~ /ん?\z/u #=> 24        # OK 2 <= 3

check for zero or one of é [2 bytes]

s =~ /é?\z/u #=> 24         # OK 2 == 2
s =~ /x?é?\z/u #=> 24       # OK 2 < (2+1)

test for zero or one of \n [1 bytes]

s =~ /\n?\z/u    #=> nil    # KO 2 > 1  ( the BUG occurs )
s =~ /\n?\n?\z/u #=> 24     # OK 2 == (1+1)
s =~ /\n?\n?\n?\z/u #=> 24  # OK 2 < (1+1+1)

By results of TEST2 we can assert: if the last multi-byte character of the string is 2 bytes , then the 'zero or one before' test only works when we check for at least 2 bytes (not 2 character) before.

When the multi-byte character is not at the end of the string I found it works correctly.

public gist with my test code available here




回答2:


In Ruby trunk, the issue has now been accepted as a bug. Hopefully, it will be fixed.

Update: Two patches have been posted in Ruby trunk.



来源:https://stackoverflow.com/questions/15779859/does-multibyte-character-interfere-with-end-line-character-within-a-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!