Scanning for Unicode Numbers in a string with \d

前端未结

关注

 3  565

傲寒

According to the Oniguruma documentation, the \\d character type matches:

decimal digit char
Unicode: General_Category -- Decimal_N

相关标签:

3条回答

离开以前

2020-12-19 05:25
Noted by Brian Candler on ruby-talk:
- \w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
- \d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.
The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:
```
\w  word character  
    Not Unicode: alphanumeric, "_" and multibyte char.  
    Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
```
In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.

This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.
```
p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]
```
It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)

Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc
0 讨论(0)
发布评论:

提交评论
- 加载中...

甜味超标

2020-12-19 05:25

\d will only match for ASCII numbers by default. You can manually turn on Unicode matching in a regex using the (counter-intuitive) (?u) syntax:


          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2020-12-19 05:32
              
            
            
                                                                       
Try the Unicode character class \p{N} instead. That matches all Unicode digits. No idea why \d isn't working.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...