Parsing through Arabic / RTL text from left to right

浪尽此生 提交于 2019-11-27 05:20:29

As your string currently stands, the word لطيفة is stored prior to the word اليوم; the fact that اليوم is displayed "first" (that is, further to the left), is just a (correct) result of the Unicode Bidirectional Algorithm in displaying the text.

That is: the string you start with ("Test:لطيفة;اليوم;a;b") is the result of the user entering "Test:", then لطيفة, then ";", then اليوم, and then ";a;b". Thus, the way C# is splitting it does in fact mirror the way that the string is created. It's just that the way it is created is not reflected in the display of the string, because the two consecutive Arabic words are treated as a single unit when they are displayed.

If you'd like a string to display Arabic words in left-to-right order with semicolons in between, while also storing the words in that same order, then you should put a Left-to-Right mark (U+200E) after the semicolon. This will effectively section off each Arabic word as its own unit, and the Bidirectional Algorithm will then treat each word separately.

For instance, the following code begins with a string identical to the one you use (with the addition of a single Left-to-Right mark), yet it will split it up according to the way that you are expecting it to (that is, spl[0] = ‏"Test:اليوم", and spl[1] = "‏لطيفة"):

static void Main(string[] args) {
    string s = "Test:اليوم;\u200Eلطيفة;a;b";
    string[] spl = s.Split(';');
}

You can also use Uniscribe library of Microsoft. ScriptItemize method will give you character clusters, their start index in the original string and the RTL order. Using this information you can find consecutive clusters that contains only Arabic. Splitting them with respect to ';' and reversing the direction will give you what you need.

They strings are not reversed but are actually split in the correct order. RTL languages are RTL when displayed, but in memory they are kept "left to right" just like English. I'll try to demonstrate, which is a bit hard since I don't have an Arabic keyboard installed.

Your string is s = "Arbi/Arbi, Alarbia". s[0] is A (the Arabic A'in), s[1] is R and so forth. s[4] is /, and s[9] is , . So when splitting, you get s[0:8] in the first part and s[10:] in the second.

This is the correct way of handling RTL strings. If you want the reverse, you need to reverse the array yourself.

Keep in mind that switching between RTL and LTR is one of the most frustrating tasks out there. You have no idea how long you'll spend figuring out what to do with numbers or English words inside RTL strings. The best thing you can do is to avoid the problem altogether, and just try to get Excel to show the strings as RTL.

It looks like (according to Reflector) that Split internally uses Substring and that uses an internal function that just copies letters left to right without any consideration of culture. Because of that, I don't see any way around just reversing the array that Split returns.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!