Maximum Hex value in regex

前端未结
关注
 5  1138
南旧 2021-02-12 13:27
Without using u flag the hex range that can be used is [\\x{00}-\\x{ff}], but with u flag it goes up to a 4-byte value \\x{7fffffff}

      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   没有蜡笔的小新
                                             
                
                
                (楼主)
            
              
              
                2021-02-12 13:48
              

            
            
                        
I'm not sure about php but there really is no governor on code points

so it doesn't matter that there are only some 1.1 million valid ones.

That is subject to change at any time, but its not really up to engines

to enforce that. There are reserved cp's that are holes in the valid range,

there are surrogates in the valid range, the reasons are endless for there

to be no other restriction other than the word size.  

For UTF-32, you can't go over 31 bits because 32 is the sign bit.

0x00000000 - 0x7FFFFFFF

Makes sense since unsigned int as a data type is the natural size of 32-bit hardware registers. 

For UTF-16, even truer you can see the same limitation masked to 16 bit.
Bit 32 is still the sign bit leaving  0x0000 - 0xFFFF as a valid range.  

Usually, if you use an engine that supports ICU you should be able to use it,

which converts both source and regex into UTF-32. Boost Regex is one such engine.

edit:   

Regarding UTF-16  

I guess when Unicode outgrew 16 bit, they punched a hole in the 16-bit range for surrogate pairs. But it left only 20 total bits between the pair as useable.  

10 bits in each surrogate with the other 6 used to determine hi or lo.

Looks like this left the Unicode folks with a limit of 20 bits + an extra 0xFFFF rounded, to a total of 0x10FFFF codepoints, with unusable holes.  

To be able to convert to a different encoding (8/16/32) all the codepoints

must actually be convertible. Thus the forever backward compatibile 20-bit is

the trap they ran into early, but now must live with.  

Regardless, regex engines won't be enforcing this limit anytime soon, probably never.

As far as surrogates, they are the hole, and an mal-formed literal surrogate can't be converted between modes. That just pertains to a literal encoded character during conversion, not a hex representation of one. For instance its easy to search a text in UTF-16 (only) mode for unpaired surrogates, or even paired one's.  

But I guess regex engines don't really care about holes or limits, they only care about what mode the subject string is in. No, the engine is not going to say:

'Hey wait, the mode is UTF-16 I better convert \x{210C1} to \x{D844}\x{DCC1}. Wait, if I did that, what do I do if its quantified \x{210C1}+,start injecting regex constructs around it? Worse yet, what if its in a class [\x{210C1}]? Nah.. better limit it to \x{FFFF}.

Some handy dandy, pseudo-code surrogate conversions I use:  

 Definitions:
 ====================
 10-bits
  3FF = 000000  1111111111

 Hi Surrogate
 D800 = 110110  0000000000
 DBFF = 110110  1111111111 

 Lo Surrogate
 DC00 = 110111  0000000000
 DFFF = 110111  1111111111


 Conversions:
 ====================
 UTF-16 Surrogates to UTF-32
 if ( TESTFOR_SURROGATE_PAIR(hi,lo) )
 {
    u32Out = 0x10000 + (  ((hi & 0x3FF) << 10) | (lo & 0x3FF)  );
 }

 UTF-32 to UTF-16 Surrogates
 if ( u32In >= 0x10000)
 {
    u32In -= 0x10000;
    hi = (0xD800 + ((u32In & 0xFFC00) >> 10));
    lo = (0xDC00 + (u32In & 0x3FF));
 }

 Macro's:
 ====================
 #define TESTFOR_SURROGATE_HI(hs) (((hs & 0xFC00)) == 0xD800 )
 #define TESTFOR_SURROGATE_LO(ls) (((ls & 0xFC00)) == 0xDC00 )
 #define TESTFOR_SURROGATE_PAIR(hs,ls) ( (((hs & 0xFC00)) == 0xD800) && (((ls & 0xFC00)) == 0xDC00) )
 //
 #define PTR_TESTFOR_SURROGATE_HI(ptr) (((*ptr & 0xFC00)) == 0xD800 )
 #define PTR_TESTFOR_SURROGATE_LO(ptr) (((*ptr & 0xFC00)) == 0xDC00 )
 #define PTR_TESTFOR_SURROGATE_PAIR(ptr) ( (((*ptr & 0xFC00)) == 0xD800) && (((*(ptr+1) & 0xFC00)) == 0xDC00) )

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复