Understanding Text Encoding (In .Net)

后端 未结 3 2038
感情败类
感情败类 2020-12-14 09:00

I have done very little with encoding of Text. Truthfully, I don\'t really even know what it means exactly.

For example, if I have something like:

D         


        
3条回答
  •  佛祖请我去吃肉
    2020-12-14 09:27

    First and foremeost: do not despair, you are not alone. Awareness of the treatment of character encoding and text representation in general is an unfortunately uncommon thing, but there is no better time to start learning than right now!

    In modern systems, including .NET, text strings are represented in memory by some encoding of Unicode code points. These are just numbers. The code point for the character A is 65. The code point for the copyright (c) is 169. The code point for the Thai digit six is 3670.

    The term "encoding" refers to how these numbers are represented in memory. There are a number of standard encodings that are used so that textual representation can remain consistent as data is transmitted from one system to another.

    A simple encoding standard is UCS-2, whereby the code point is stored in the raw as a 16-bit word. This is limited due to the fact that it can only represent code points 0000-FFFF and such a range does not cover the full breadth of Unicode code points.

    UTF-16 is the encoding used internally by the .NET String class. Most characters fit into a single 16-bit word here, but values larger than FFFF are encoded using surrogate pairs (see the Wiki). Because of this encoding scheme, code points D800-DFFF cannot be enocded by UTF-16.

    UTF-8 is perhaps the most popular encoding used today, for a number of reasons which are outlined in the Wiki article.

提交回复
热议问题