Arabic: 'source' Unicode to final display Unicode

吃可爱长大的小学妹 提交于 2019-12-03 09:14:18

The joining and RTL conversion don't happen at the level of Unicode characters.

In other words: the order of the characters and the actual unicode codepoints are not changed during this process.

In fact, the merging and handling RTL/LTR transitions is handled by the text rendering engine.

This quote from the Wikipedia article on the Arabic alphabet explains it quite nicely:

Finally, the Unicode encoding of Arabic is in logical order, that is, the characters are entered, and stored in computer memory, in the order that they are written and pronounced without worrying about the direction in which they will be displayed on paper or on the screen. Again, it is left to the rendering engine to present the characters in the correct direction, using Unicode's bi-directional text features. In this regard, if the Arabic words on this page are written left to right, it is an indication that the Unicode rendering engine used to display them is out-of-date.

The processing you're looking for is called ligature. Unlike many latin-based languages, where you can simply put one character after another to render the text, ligatures are fundamental in arabic. The substitution is done in the text rendering engine, and the ligature infos are generally stored in font files.

note how they are NOT the same characters

They are the same for an Arabic reader. It is still readable. There is no transform to do on your Unicode16 source text. You must provide the whole string to your text renderer. In C/C++, and as you are going the platform independent way, you can use Pango for rendering.

Note : Perhaps you wanted to write لعبة جديدة (i.e. new game) ? Because what you give as an example has no meaning in Arabic.

I realise this is an old question, but what you're looking for is FriBidi, the GNU implementation of the Unicode bidirectional algorithm.

This program does the glyph selection that was asked about in the question, as well as handling bidirectional text (mixture of right-to-left and left-to-right text).

What you are looking for is an Arabic script synthesis algorithm. I'm not aware one exists as open source. If you arrive at one please post.

Some points:

At the storage level, there is no Unicode transform. There is an abstract representation of the string as pointed out by other answers.

At the rendering level, you could choose to use Unicode Presentation Forms, but you could also choose to use other forms. Unicode Presentation Forms are not a standard for what presentation output encoding should be - rather they are just one example of presentation codes that can be output by the rendering engine using script synthesis.

To make it clearer: There wouldn't be a single standard transform (ie synthesis algorithm) that would transform from A to B, where A is standard Unicode Arabic page, and B is standard Unicode Arabic Presentation Forms. Rather, there would be different transformations that can vary in complexity and can have different encoding systems for B, but one of the encodings that can be used for B is the Unicode Presentation Forms. For example, a simple typewriter style would require a simple rendering algorithm that would not require Presentation Forms. Indeed there does exist modern writing styles (not in common usage though) where A and B are actually identical, only that a different font page would be used to do the rendering. On the other hand, the transform to render typesetting or traditional calligraphic forms would be more complex and require something similar to the Unicode Presentation Forms.

Here are a couple of pointers for more information on the topic:

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!