How to remove surrogate characters in Java?

前端 未结 5 760
时光说笑
时光说笑 2020-12-14 04:16

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pai

5条回答
  •  南方客
    南方客 (楼主)
    2020-12-14 04:27

    Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).

    Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:

    return query.replaceAll("[^\u0000-\uffff]", "");
    

提交回复
热议问题