Java Replace Unicode Characters in a String

邮差的信 提交于 2019-12-04 19:23:54
Paul

The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.

The regex to match the unicode-string:

A unicode-character looks like \uABCD, so \u, followed by a 4-character hexnumber string. Matching these can be done using

\\u[A-Fa-f\d]{4}

But there's a problem with this:
In a String like "just some \\uabcd arbitrary text" the \u would still get matched. So we need to make sure the \u is preceeded by an even number of \s:

(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}

Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:

(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})

As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:

$1\\$3

Now for the actual code:

String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";

Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);

That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\" as a pattern-string in java matches one \ as regex-matched character.

EDIT:
On actual strings, the characters need to be filtered out and be replaced by their integer-representation:

StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
   if(c > 127)
       sb.append("\\").append(String.format("%04x", (int) c));
   else
       sb.append(c);

This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char in java always represents unicode-characters. This approach preserves any control-chars like "\n", "\r", etc., which is why I chose it over other definitions.

Try using String.replaceAll() method

s = s.replaceAll("\u", "\");

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!