Map supplementary Unicode characters to BMP (if possible)

主宰稳场 提交于 2019-12-07 23:38:22

问题


I ran into the issue that my XML parser (VTD-XML) doesn't seem to be able to handle Unicode Supplementary characters (please correct if I'm already wrong here). It seems, the parser only uses the lower 16 bit of such characters.

I cannot switch to another parser within the project I'm occupied with. I am parsing Medline abstracts (https://www.ncbi.nlm.nih.gov/pubmed) and it seems there have been added documents that contain supplementary characters over the last year (e.g. https://www.ncbi.nlm.nih.gov/pubmed/?term=26855708, ends of results section).

As a quick and dirty fix I would just delete all characters above 0xFFFF from the documents. Obviously, that will destroy some expressions in the document texts and so I'm not really happy with that solution.

Since I can't change the parser, I was wondering if there exists some possibility to map supplementary characters to characters within the BMP that are likely to have a glyph with similar appearance, if existent.

Of course I welcome any other idea. It would even be fine to replace the supplementary characters with some kind of placeholder and then put the original character back in but this seems error prone. Better ideas?

Edit: Here is some - hopefully - minimal example of how this issue comes up with VTD-XML:

@Test
public void parseUnicodeBeyondBMP() throws NavException, FileNotFoundException, IOException, EncodingException, EOFException, EntityException, ParseException {
    // character codpoint 0x10400
    String unicode = "<supplementary>\uD801\uDC00</supplementary>";
    byte[] unicodeBytes = unicode.getBytes();
    assertEquals(unicode, new String(unicodeBytes, "UTF-8"));

    VTDGen vg = new VTDGen();
    vg.setDoc(unicodeBytes);
    vg.parse(false);
    VTDNav vn = vg.getNav();
    long fragment = vn.getContentFragment();
    int offset = (int) fragment;
    int length = (int) (fragment >> 32);
    String originalBytePortion = new String(Arrays.copyOfRange(unicodeBytes, offset, offset+length));
    String vtdString = vn.toRawString(offset, length);
    // this actually succeeds
    assertEquals("\uD801\uDC00", originalBytePortion);
    // this fails ;-( the returned character is Ѐ, codepoint 0x400, thus the high surrogate is missing
    assertEquals("\uD801\uDC00", vtdString);
}

来源:https://stackoverflow.com/questions/41808207/map-supplementary-unicode-characters-to-bmp-if-possible

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!