Parsing through Arabic / RTL text from left to right

Let's say I have a string in an RTL language such as Arabic with some English chucked in:

string s = "Test:لطيفة;اليوم;a;b"

Notice there are semicolons in the string. When I use the Split command like string[] spl = s.Split(';');, then some of the strings are saved in reverse order. This is what happens:

‏‏‏‏‏spl[0] = "‏Test:لطيفة"
spl[1] = "‏"اليوم
spl[2] = ‏"a"
spl[3] = ‏"b"

The above is out of order compared to the original. Instead, I expect to get this:

‏‏spl[0] = ‏"Test:اليوم"
spl[1] = "‏لطيفة"
spl[2] = ‏"a"
spl[3] = ‏"b"

I'm prepared to write my own split function. However, the chars in the string also parse in reverse order, so I'm back to square one. I just want to go through each character as it's shown on the screen.

As your string currently stands, the word لطيفة is stored prior to the word اليوم; the fact that اليوم is displayed "first" (that is, further to the left), is just a (correct) result of the Unicode Bidirectional Algorithm in displaying the text.

That is: the string you start with ("Test:لطيفة;اليوم;a;b") is the result of the user entering "Test:", then لطيفة, then ";", then اليوم, and then ";a;b". Thus, the way C# is splitting it does in fact mirror the way that the string is created. It's just that the way it is created is not reflected in the display of the string, because the two consecutive Arabic words are treated as a single unit when they are displayed.

If you'd like a string to display Arabic words in left-to-right order with semicolons in between, while also storing the words in that same order, then you should put a Left-to-Right mark (U+200E) after the semicolon. This will effectively section off each Arabic word as its own unit, and the Bidirectional Algorithm will then treat each word separately.

For instance, the following code begins with a string identical to the one you use (with the addition of a single Left-to-Right mark), yet it will split it up according to the way that you are expecting it to (that is, spl[0] = ‏"Test:اليوم", and spl[1] = "‏لطيفة"):

static void Main(string[] args) {
    string s = "Test:اليوم;\u200Eلطيفة;a;b";
    string[] spl = s.Split(';');
}

You can also use Uniscribe library of Microsoft. ScriptItemize method will give you character clusters, their start index in the original string and the RTL order. Using this information you can find consecutive clusters that contains only Arabic. Splitting them with respect to ';' and reversing the direction will give you what you need.

They strings are not reversed but are actually split in the correct order. RTL languages are RTL when displayed, but in memory they are kept "left to right" just like English. I'll try to demonstrate, which is a bit hard since I don't have an Arabic keyboard installed.

Your string is s = "Arbi/Arbi, Alarbia". s[0] is A (the Arabic A'in), s[1] is R and so forth. s[4] is /, and s[9] is , . So when splitting, you get s[0:8] in the first part and s[10:] in the second.

This is the correct way of handling RTL strings. If you want the reverse, you need to reverse the array yourself.

Keep in mind that switching between RTL and LTR is one of the most frustrating tasks out there. You have no idea how long you'll spend figuring out what to do with numbers or English words inside RTL strings. The best thing you can do is to avoid the problem altogether, and just try to get Excel to show the strings as RTL.

It looks like (according to Reflector) that Split internally uses Substring and that uses an internal function that just copies letters left to right without any consideration of culture. Because of that, I don't see any way around just reversing the array that Split returns.

来源：https://stackoverflow.com/questions/12630566/parsing-through-arabic-rtl-text-from-left-to-right

标签

string

unicode

right-to-left