Without using u
flag the hex range that can be used is [\\x{00}-\\x{ff}]
, but with u
flag it goes up to a 4-byte value \\x{7fffffff}
I'm not sure about php but there really is no governor on code points
so it doesn't matter that there are only some 1.1 million valid ones.
That is subject to change at any time, but its not really up to engines
to enforce that. There are reserved cp's that are holes in the valid range,
there are surrogates in the valid range, the reasons are endless for there
to be no other restriction other than the word size.
For UTF-32, you can't go over 31 bits because 32 is the sign bit.
0x00000000 - 0x7FFFFFFF
Makes sense since unsigned int
as a data type is the natural size of 32-bit hardware registers.
For UTF-16, even truer you can see the same limitation masked to 16 bit.
Bit 32 is still the sign bit leaving 0x0000 - 0xFFFF
as a valid range.
Usually, if you use an engine that supports ICU you should be able to use it,
which converts both source and regex into UTF-32. Boost Regex is one such engine.
edit:
Regarding UTF-16
I guess when Unicode outgrew 16 bit, they punched a hole in the 16-bit range for surrogate pairs. But it left only 20 total bits between the pair as useable.
10 bits in each surrogate with the other 6 used to determine hi or lo.
Looks like this left the Unicode folks with a limit of 20 bits + an extra 0xFFFF rounded, to a total of 0x10FFFF codepoints, with unusable holes.
To be able to convert to a different encoding (8/16/32) all the codepoints
must actually be convertible. Thus the forever backward compatibile 20-bit is
the trap they ran into early, but now must live with.
Regardless, regex engines won't be enforcing this limit anytime soon, probably never.
As far as surrogates, they are the hole, and an mal-formed literal surrogate can't be converted between modes. That just pertains to a literal encoded character during conversion, not a hex representation of one. For instance its easy to search a text in UTF-16 (only) mode for unpaired surrogates, or even paired one's.
But I guess regex engines don't really care about holes or limits, they only care about what mode the subject string is in. No, the engine is not going to say:
'Hey wait, the mode is UTF-16 I better convert \x{210C1}
to \x{D844}\x{DCC1}
. Wait, if I did that, what do I do if its quantified \x{210C1}+
,start injecting regex constructs around it? Worse yet, what if its in a class [\x{210C1}]
? Nah.. better limit it to \x{FFFF}
.
Some handy dandy, pseudo-code surrogate conversions I use:
Definitions:
====================
10-bits
3FF = 000000 1111111111
Hi Surrogate
D800 = 110110 0000000000
DBFF = 110110 1111111111
Lo Surrogate
DC00 = 110111 0000000000
DFFF = 110111 1111111111
Conversions:
====================
UTF-16 Surrogates to UTF-32
if ( TESTFOR_SURROGATE_PAIR(hi,lo) )
{
u32Out = 0x10000 + ( ((hi & 0x3FF) << 10) | (lo & 0x3FF) );
}
UTF-32 to UTF-16 Surrogates
if ( u32In >= 0x10000)
{
u32In -= 0x10000;
hi = (0xD800 + ((u32In & 0xFFC00) >> 10));
lo = (0xDC00 + (u32In & 0x3FF));
}
Macro's:
====================
#define TESTFOR_SURROGATE_HI(hs) (((hs & 0xFC00)) == 0xD800 )
#define TESTFOR_SURROGATE_LO(ls) (((ls & 0xFC00)) == 0xDC00 )
#define TESTFOR_SURROGATE_PAIR(hs,ls) ( (((hs & 0xFC00)) == 0xD800) && (((ls & 0xFC00)) == 0xDC00) )
//
#define PTR_TESTFOR_SURROGATE_HI(ptr) (((*ptr & 0xFC00)) == 0xD800 )
#define PTR_TESTFOR_SURROGATE_LO(ptr) (((*ptr & 0xFC00)) == 0xDC00 )
#define PTR_TESTFOR_SURROGATE_PAIR(ptr) ( (((*ptr & 0xFC00)) == 0xD800) && (((*(ptr+1) & 0xFC00)) == 0xDC00) )