Writing unicode to rtf file

♀尐吖头ヾ 提交于 2019-12-18 09:23:09

问题


I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things. I use japanese here as an example but it´s the same for other languages i have tried.

public void writeToFile(){

    String strJapanese = "日本語";
    DataOutputStream outStream;
    File file = new File("C:\\file.rtf");

    try{

        outStream = new DataOutputStream(new FileOutputStream(file));
        outStream.writeBytes(strJapanese);
        outStream.close();

    }catch (Exception e){
        System.out.println(e.toString());
    }
}

I alse have tried:

byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);

Or more specific:

byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);

The output stream also has the writeUTF method:

outStream.writeUTF(strJapanese);

You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.

If it does work but my computer can´t open it properly, is there a way to check that?


回答1:


By default stings in JAVA are in UTF-8 (unicode), but when you want to write it down you need to specify encoding

try {
    FileOutputStream fos = new FileOutputStream("test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
} catch (IOException e) {
    e.printStackTrace();
}

ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html




回答2:


DataOutputStream outStream;

You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.

outStream.writeBytes(strJapanese);

In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.

byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);

This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.

outStream.writeUTF(strJapanese);

This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.

Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.

In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.

This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.

RTF is not a very nice format.




回答3:


You can write any Unicode character expressed as its decimal number by using the \u control word. E.g. \u1234? would represent the character whose Unicode code point is 1234, and ? is the replacement character for cases where the character cannot be adequadely represented (e.g. because the font doesn't contain it).



来源:https://stackoverflow.com/questions/7894772/writing-unicode-to-rtf-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!