How to remove surrogate characters in Java?

前端 未结 5 774
时光说笑
时光说笑 2020-12-14 04:16

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pai

5条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-12-14 04:25

    Here's a couple things:

    • Character.isSurrogate(char c):

      A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

    • Checking for pairs seems pointless, why not just remove all surrogates?

    • x == false is equivalent to !x

    • StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).

    I suggest this:

    public static String removeSurrogates(String query) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < query.length(); i++) {
            char c = query.charAt(i);
            // !isSurrogate(c) in Java 7
            if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
                sb.append(firstChar);
            }
        }
        return sb.toString();
    }
    

    Breaking down the if statement

    You asked about this statement:

    if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
        sb.append(firstChar);
    }
    

    One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:

    static boolean isSurrogate(char c) {
        return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
    }
    
    static boolean isNotSurrogate(char c) {
        return !isSurrogate(c);
    }
    
    ...
    
    if (isNotSurrogate(c)) {
        sb.append(firstChar);
    }
    

提交回复
热议问题