Converting UTF-8 to ISO-8859-1 in Java

后端 未结 4 1510
挽巷
挽巷 2020-12-08 11:45

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctl

4条回答
  •  甜味超标
    2020-12-08 12:26

    I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.

    The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:

    public final class HtmlEncoder {
      private HtmlEncoder() {}
    
      public static  T escapeNonLatin(CharSequence sequence,
          T out) throws java.io.IOException {
        for (int i = 0; i < sequence.length(); i++) {
          char ch = sequence.charAt(i);
          if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
            out.append(ch);
          } else {
            int codepoint = Character.codePointAt(sequence, i);
            // handle supplementary range chars
            i += Character.charCount(codepoint) - 1;
            // emit entity
            out.append("&#x");
            out.append(Integer.toHexString(codepoint));
            out.append(";");
          }
        }
        return out;
      }
    }
    

    Example usage:

    String foo = "This is Cyrillic Ya: \u044F\n"
        + "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
    
    StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
    System.out.println(sb.toString());
    

    Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C ) is encoded as “. A couple of other arbitrary code points are likewise encoded.

    Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

提交回复
热议问题