How to parse a string that is in a different encoding from java

后端 未结 3 655

I have a string that I have read in from a Word document. I think it is in \"Cp1252\" encoding. Java uses UTF8.

How do I search that string for those special charact

相关标签:
3条回答
  • 2020-12-20 07:35

    Conversion is generally done by something like this:

    String properlyEncoded = 
        new String(original.getBytes(originalEncoding), newEncoding);
    

    Note that it is not unlikely that some information is lost during the conversion.

    0 讨论(0)
  • 2020-12-20 07:51

    First you need to make sure that you properly convert from CP1252 bytes to Java's character representation (which is UTF-16). Since you're using a library for parsing .docx files, this has probably happened.

    Now all you need to do is call projDateString.replace('\u2013', '-') and do something with the return value. No need for replaceAll(), since you're not working with regular expressions.

    0 讨论(0)
  • 2020-12-20 07:54

    Java strings are always in UTF-16, at least as far as the API is concerned... but you can generally just think of them as being "Unicode". The fact that they're UTF-16 is only really relevant when it comes to characters outside the Basic Multilingual Plane, i.e. with Unicode values above U+FFFF. They have to be represented as surrogate pairs in Java. But I don't think you need to worry about this in your case. So just think of the values in Strings as "Unicode text" without a specific encoding... in particular, definitely not in UTF-8 or CP1252. Those are the encodings used to convert binary data (e.g. a byte array) into text data (e.g. a string).

    You shouldn't be using String.getBytes() or new String(byte[]) without specifying the encoding - that's the problem. Those always use the platform default encoding - which is almost always the wrong choice.

    You say you "have a string that I have read in from a Word document" - how did you read it in? How did it start off life?

    If you have the bytes and you know the relevant encoding, you should use:

    String text = new String(bytes, encoding);
    

    You should never have to deal with a string which has been created using the wrong encoding - if you get to that stage, you're almost bound to be risking information loss. Tackle the problem as early as you possibly can, rather than trying to fix the data up later on.

    The next thing to understand is that the String class in Java is immutable. Calling replaceAll on a string won't change the existing string. It will instead return a new string with the replacements made.

    So this statement:

    projDateString2.replaceAll("\0x96", "\u2013");
    

    will never do what you want. Even if everything else is correct, you should be using:

    projDateString2 = projDateString2.replaceAll("\0x96", "\u2013");
    

    (or something similar). I don't think that actually will do what you want anyway, but you need to be aware of it for when everything else is sorted out.

    0 讨论(0)
提交回复
热议问题