How many bytes do we need to store an arabic character

吃可爱长大的小学妹 提交于 2019-12-09 07:05:31

问题


I'm a little confused about the storage needed for representing an arabic character.

Please let me know if this is true:

  • in ISO/IEC 8859-6 encoding it takes 2 bytes (http://en.wikipedia.org/wiki/ISO/IEC_8859-6)
  • in UNICODE it takes 4 bytes (http://en.wikipedia.org/wiki/Arabic_Unicode)

What are the advantages of each encoding? When should we prefer one over another one?


回答1:


Well first, Unicode is not an encoding. It is a standard for assigning code points to every character in every language. These code points are integers; how many bytes they take up depends on the specific encoding. The most common Unicode encodings are UTF-8 and UTF-16.

To summarise:

  • ISO 8859-6 uses 1 byte for each Arabic character, but doesn't support "Arabic presentation forms", nor characters from any other script than ASCII.
  • UTF-8 uses 2 bytes for each Arabic character, and 3 bytes for "Arabic presentation forms".
  • UTF-16 uses 2 bytes for each Arabic character, including "Arabic presentation forms".

I will use two examples: 'ح' (U+062D) and 'ﻰ' (U+FEF0). Those numbers are hexadecimal codes representing the Unicode code point of each of those characters.

In ISO 8859-6, most Arabic characters take up just a single byte, since that encoding is dedicated to Arabic. For example, the character 'ح' (U+062D) is encoded as the single byte "CD", as you can see from the table on the Wikipedia article. The character 'ﻰ' (U+FEF0) is listed as an "Arabic Presentation Form", so I suppose that explains why it doesn't appear in ISO 8859-6 at all (you can't encode this character in that encoding).

There are two very common Unicode encodings which let you encode all characters: UTF-8 and UTF-16. They have slightly different uses. UTF-8 uses one byte for ASCII characters, between 2 and 3 bytes for basic characters (including all of Arabic) and 4 bytes for other characters. UTF-16 uses two bytes for basic characters, and 4 bytes for other characters. So basically, if you are using lots of ASCII, UTF-8 is better. For international text, UTF-16 is better.

In UTF-8, 'ح' (U+062D) is encoded as the 2-byte sequence "D8 AD", while 'ﻰ' (U+FEF0) is encoded as the 3-byte sequence "EF BB B0". Basically, characters between U+0080 and U+07FF use 2 bytes, and characters between U+07FF and U+FFFF use 3 bytes. So all the basic Arabic and Arabic supplement characters use 2 bytes, whereas the Arabic Presentation Forms use 3 bytes.

In UTF-16, 'ح' (U+062D) is encoded as the 2-byte sequence "2D 06", while 'ﻰ' (U+FEF0) is encoded as the 2-byte sequence "F0 FE". In UTF-16, all Arabic characters are two bytes. This is further complicated by endianness. Note that the bytes in UTF-16 are just the code points with the two parts swapped around. An equally valid encoding is "06 2D" for the first one, and "FE F0" for the second.

In summary, I would usually recommend UTF-8 as it is unambiguous and supports ASCII text very well. Arabic characters are 2 bytes in either encoding (unless you use "presentation forms"). You can use ISO 8859-6 if you are only using ASCII and Arabic characters, and nothing else, and that will save you some space, but it usually isn't worth it, as it will break as soon as some other characters come along. UTF-8 and UTF-16 support all characters in Unicode.




回答2:


There are several different unicode encodings, the amount of space used depends on which one you're using: http://unicode.org/faq/utf_bom.html



来源:https://stackoverflow.com/questions/4322191/how-many-bytes-do-we-need-to-store-an-arabic-character

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!