Replacing unicode punctuation with ASCII approximations

后端 未结 6 959
梦谈多话
梦谈多话 2020-12-01 16:23

I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sente

6条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-01 16:38

    While this does not exactly answers your question, you can convert your Unicode text to US-ASCII replacing non-ASCII characters with '?' symbols.

    String input = "aáeéiíoóuú"; // 10 chars.
    
    Charset ch = Charset.forName("US-ASCII");
    CharsetEncoder enc = ch.newEncoder();
    enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
    enc.replaceWith(new byte[]{'?'});
    
    ByteBuffer out = null;
    
    try {
        out = enc.encode(CharBuffer.wrap(input));
    } catch (CharacterCodingException e) { 
        /* ignored, shouldn't happen */ 
    }
    
    String outStr = ch.decode(out).toString();
    
    // Prints "a?e?i?o?u?"
    System.out.println(outStr);
    

提交回复
热议问题