I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox
and get the .PlainText
out)
Take the RTF code for the string a基bমূcΟιd
pasted straight into Wordpad:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}
It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at
\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9
Notice the 基
(u+57FA
) is \'8a\'ee
but the মূ
, which is actually two characters ম
(\u2478?
) and ূ
(\u2498?
), is \u2478?\u2498?
which is fine, but the Οι
which is two separate characters Ο
and ι
is \'cf\'e9
.
Is there a way to determine if I'm looking at something that should be one character such as 基
= \'bb\'f9
or two characters Ο
and ι
= \'cf\'e9
?
I was thinking that maybe the \lang
was it, but that isn't the case at all because the \lang
does not change from when it's first set. I am already accounting for the Different Codepages from different Charset
values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.
How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?
\'xx
escapes represent bytes and should be interpreted using the fcharset
encoding. (Or potentially cchs
. Falling back to the ansicpg
if not present.)
You need to know that encoding intimately to be able to decide whether a single \'xx
sequence represents a character on its own or is only a part of a multi-byte character; typically you will be consuming each section of text as a unit before converting that byte string into a Unicode string using whatever library or OS interface you have available, to avoid having to write byte-by-byte parsers for every code page supported by RTF.
\uxxxx?
escapes represent UTF-16 code units. This is much simpler, but Word[pad] only produces this form of encoding as a last resort, because it's not compatible with earlier RTF versions. (?
is the fallback character for when the receiver can't cope with the Unicode.)
So:
The two characters
Οι
are represented as two byte-escapes because the font associated with that stretch of text is using a Greek single-byte encoding (charset 161 = cp1253).The one character
基
is represented as two byte-escapes because the font associated with that stretch of text is using a Japanese multibyte encoding (charset 128 = cp932 ≈ Shift-JIS). In Shift-JIS the leading\'8a
byte signals a further byte to come, as do various others in the top-bit-set range (but not all of them).The two characters
মূ
are represented as Unicode code unit escapes, because there's no other option: there isn't any RTF-compatible code page that contains Bengali characters. (Code page 57003 for ISCII came much later.)
RTF has tags for specifying the codepage/encoding used to encode Unicode characters. The actual hex codes for the characters are the byte octets used by the specified encoding. In this case, \ansicpg1252
for Ansi codepage 1252.
来源:https://stackoverflow.com/questions/8257366/detect-multibyte-and-chinese-characters-in-rtf-markup