What character encoding should I use for a web page containing mostly Arabic text? Is utf-8 okay?

前端 未结 5 942
孤独总比滥情好
孤独总比滥情好 2020-11-28 13:55

What character encoding should I use for a web page containing mostly Arabic text?

Is utf-8 okay?

5条回答
  •  萌比男神i
    2020-11-28 14:06

    UTF-8 is fine, yes. It can encode any code point in the Unicode standard.


    Edited to add

    To make the answer more complete, your realistic choices are:

    • UTF-8
    • UTF-16
    • UTF-32

    Each comes with tradeoffs and advantages.

    UTF-8

    As Joe Gauterin points out, UTF-8 is very efficient for European texts but can get increasingly inefficient the "farther" from the Latin alphabet you get. If your text is all Arabic it will actually be larger than the equivalent text in UTF-16. This is rarely a problem, however, in practice in these days of cheap and plentiful RAM unless you have a lot of text to deal with. More of a problem is that the variable-length of the encoding makes some string operations difficult and slow. For example you can't easily get the fifth Arabic character in a string because some characters might be 1 byte long (punctuation, say), while others are two or three. This makes actual processing of strings slow and error-prone.

    On the other hand, UTF-8 is likely your best choice if you're doing a lot of mixed European/Arabic text. The more European text in your documents, the better the UTF-8 choice will be.

    UTF-16

    UTF-16 will give you better space efficiency than UTF-8 if you're using predominantly Arabic text. I don't know about the Arabic code points, however, so I don't know if you risk having variable-length encodings here. (My guess is that this is not an issue, however.) If you do, in fact, have variable-length encodings, all the string processing problems of UTF-8 apply here as well. If not, no problems.

    On the other hand, if you have mixed European and Arabic texts, UTF-16 will be less space-efficient. Also, if you find yourself expanding your text forms to other texts like, say, Chinese, you definitely go back to variable length forms and the associated problems.

    UTF-32

    UTF-32 will basically double your space requirements. On the other hand it's constant sized for all known (and, likely, unknown;) script forms. For raw string processing it's your fastest, best option without the problems that variable-length encoding will cause you. (This presupposes you have a string library that knows about 32-bit characters, naturally.)

    Recommendation

    My own recommendation is that you use UTF-8 as your external format (because everybody supports it) for storage, transmission, etc. unless you really see a benefit size-wise with UTF-16. So any time you read a string from the outside world it would be UTF-8 and any time you put one to the outside world it, too, would be UTF-8. Within your software, though, unless you're in the habit of manipulating massive strings (in which case I'd recommend different data structures anyway!) I'd recommend using UTF-16 or UTF-32 instead (depending on if there's any variable-length encoding issues in your UTF-16 data) for the speed efficiency and simplicity of code.

提交回复
热议问题