I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sente
While this does not exactly answers your question, you can convert your Unicode text to US-ASCII replacing non-ASCII characters with '?' symbols.
String input = "aáeéiíoóuú"; // 10 chars.
Charset ch = Charset.forName("US-ASCII");
CharsetEncoder enc = ch.newEncoder();
enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
enc.replaceWith(new byte[]{'?'});
ByteBuffer out = null;
try {
out = enc.encode(CharBuffer.wrap(input));
} catch (CharacterCodingException e) {
/* ignored, shouldn't happen */
}
String outStr = ch.decode(out).toString();
// Prints "a?e?i?o?u?"
System.out.println(outStr);