How to establish the codepoint of encoded characters?

帅比萌擦擦* 提交于 2019-12-13 02:22:31

问题


Given a stream of bytes (that represent characters) and the encoding of the stream, how would I obtain the code points of the characters?

InputStreamReader r = new InputStreamReader(bla, Charset.forName("UTF-8"));
int whatIsThis = r.read(); 

What is returned by read() in the above snippet? Is it the unicode codepoint?


回答1:


Reader.read() returns a value that can be cast to char or -1 if no more data is available.

A char is (implicitly) a 16-bit code unit in the UTF-16BE encoding. This encoding can represent basic multilingual plane characters with a single char. The supplementary range is represented using two-char sequences.

The Character type contains methods for translating UTF-16 code units to Unicode code points:

A code point that requires two chars will satisfy the isHighSurrogate and isLowSurrogate when you pass in two sequential values from a sequence. The codePointAt methods can be used to extract code points from code unit sequences. There are similar methods for working from code points to UTF-16 code units.


A sample implementation of a code point stream reader:

import java.io.*;
public class CodePointReader implements Closeable {
  private final Reader charSource;
  private int codeUnit;

  public CodePointReader(Reader charSource) throws IOException {
    this.charSource = charSource;
    codeUnit = charSource.read();
  }

  public boolean hasNext() { return codeUnit != -1; }

  public int nextCodePoint() throws IOException {
    try {
      char high = (char) codeUnit;
      if (Character.isHighSurrogate(high)) {
        int next = charSource.read();
        if (next == -1) { throw new IOException("malformed character"); }
        char low = (char) next;
        if(!Character.isLowSurrogate(low)) {
          throw new IOException("malformed sequence");
        }
        return Character.toCodePoint(high, low);
      } else {
        return codeUnit;
      }
    } finally {
      codeUnit = charSource.read();
    }
  }

  public void close() throws IOException { charSource.close(); }
}



回答2:


It does not read unicode code points, but UTF-16 code units. There is no difference for code points below 0xFFFF, but code points above 0xFFFF are represented by 2 code units each. This is because you cannot have value above 0xFFFF in 16-bit.

So is in this case:

byte[] a = {-16, -96, -128, -128}; //UTF-8 for 𠀀 U+20000

ByteArrayInputStream is = new ByteArrayInputStream(a);
InputStreamReader r = new InputStreamReader(is, Charset.forName("UTF-8"));
int whatIsThis = r.read();
int whatIsThis2 = r.read();
System.out.println(whatIsThis); //55360 not a valid stand alone code point 
System.out.println(whatIsThis2); //56320 not a valid stand alone code point

With the surrogate values, we put them together to get 0x20000:

((55360 - 0xD800) * 0x400) + (56320 - 0xDC00) + 0x10000 == 0x20000

More about how UTF-16 works: http://en.wikipedia.org/wiki/UTF-16



来源:https://stackoverflow.com/questions/14222473/how-to-establish-the-codepoint-of-encoded-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!