Why do those Thai characters display on the web page with a long tail?

余生长醉 提交于 2019-12-02 18:57:19

There are two problem, one in the output system (font renderer) which is not Thai aware and one in the input system that generated this text in the first place.

If you had done your homework, you would know that mai tho and maitaikhu (UniCode names) are what UniCode refers to as Non Spacing Markers (NSM). This means that the font renderer should not move to the next character cell when displaying this glyph.

In order to avoid the mess you see above, the Thai API Consortium (TAPIC) made the WTT 2.0 standard that describes both how the font rendering algorithm should handle Thai letter order when it receives it as input and also how the input method should allow such characters to be input if you attempt to type them.

Standardization and Implementations of Thai Language Overview

libthai includes both input and output methods.

thaicheck is a small program that can detect letter sequence problems and fix them.

By the way, you cannot have a sequence (word) of do dek, mai tho and maitaikhu; the input sequence is noise.

Bear in mind that some editors have broken input methods that allow typing multiple NSM that cannot be combined but the output method will render only legal sequences; the result being an illegal input string that looks OK to the user on his system.

The codes you mention are all in UTF-8, which is why each character needs 3 bytes. The respectice Unicode codes are:

The latter two are in the category Mark, Nonspacing, and have the Combine property (Canonical_Combining_Class) set to 107, meaning that the code points are combined with the preceding code point in rendering.

You example starts with a single character and adds lots of nonspacing marks on top of it.

Compare with this C# code:

char DODEK = (char)0x0e14;
char MAITHO = (char)0x0e49;
char MAITAIKHU = (char)0x0e47;

string thai = new string(new char[] { DODEK, MAITHO, MAITAIKHU });
Console.WriteLine("number of code points: " + thai.Length);

var si = new System.Globalization.StringInfo(thai);
Console.WriteLine("number of text elements: " + si.LengthInTextElements);

Output:

number of code points: 3
number of text elements: 1

See also .Net StringInfo class.

You are never supposed to combine hundreds of unicode characters into one single graphical character, although unicode formats technically allow it; you usually combine not more than 2 or 3 characters.

In Thai, you have vowels and tone marks, which are displayed above the consonnant character (sometimes vowels appear below, or even around the consonnant characters...). It's a bit like accents over vowels in French (é, è...) or umlauts in German. It's not normal to have more than two such signs in Thai (and more than one in French or German). It means your input is illegal Thai text (maybe written to provide some funny graphical effects, like "ASCII art"). I'm not surprised that such illegal text is interpreted differently according to the browser.

Matas Vaitkevicius

What you have found is called Combining Characters or as common folk it call Zalgo.

It works because Unicode allows to combine characters by adding diacritic marks after character.

Any system that uses Unicode will work with these characters.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!