Delphi Unicode String Length in Bytes

折月煮酒 提交于 2019-12-03 12:05:12

My understanding of Delphi's UnicodeString type is that it's UTF-16 internally.

You are correct about UTF-16 encoding of Delphi's UnicodeString. This means what one 16-bit character is wide enough to represent all code points from the Basic Multilingual Plane as exactly one Char element of string array.

But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes.

However, you've got a little misconception here. Length function does not perform any deep inspection of characters and simply returns number of 16-bit WideChar elements, without taking into account any surrogates within your string. This means what if you assign a single character from any of Supplementary Planes to the UnicodeString, Length will return 2.

program Egyptian;

{$APPTYPE CONSOLE}

var
  S: UnicodeString;

begin
  S := #$1304E;  // single char
  Writeln(Length(S));
  Readln;
end.

Conclusion: byte size of string data is always fixed and equals Length(S) * SizeOf(Char), no matter if S contains any variable-length characters.

Delphi's UnicodeString is encoded with UTF-16. UTF-16 is a variable length encoding, just like UTF-8. In other words, a single Unicode code point may require multiple character elements to encode it. As a point of interest, the only fixed length Unicode encoding is UTF-32. The UTF-16 encoding uses 16 bit character elements, hence the name.

In a Unicode Delphi, Char is an alias for WideChar which is a UTF-16 character element. And string is an alias for UnicodeString, which is an array of WideChar elements. The Length() function returns the number of elements in the array.

So, SizeOf(Char) is always 2 for UnicodeString. Some Unicode code points are encoded with multiple character elements, or Chars. But Length() returns the number of characters elements and not the number of code points. The character elements all have the same size. So

memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));

is correct.

Others have explained how UnicodeString is encoded and how to calculate its byte length. I just want to mention that the RTL already has such a function - SysUtils.ByteLength():

memorystream1.WriteBuffer(PChar(rawHtml)^, ByteLength(rawHtml));

What you are doing is correct (with the sizeof(Char)).

What you refer to is that not one character refers to one code point (due to surrogate pairs for example). But the USC2 encoded (NOT UTF-16) characters in the string take up exactly the amount of bytes with Length( Str ) * sizeof( Char ).

Note that the Unicode encoding used in Delphi is the same as all Windows API call expect in the ....W variants.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!