remove non-UTF-8 characters from xml with declared encoding=utf-8 - Java

后端 未结 6 1578
再見小時候
再見小時候 2020-12-13 14:42

I have to handle this scenario in Java:

I\'m getting a request in XML form from a client with declared encoding=utf-8. Unfortunately it may contain not utf-8 charact

6条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-13 15:23

    1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character?

    I am assuming that you rather mean that you want to get rid of non-ASCII characters, because you're talking about a "legacy" side. You can get rid of anything outside the printable ASCII range using the following regex:

    string = string.replaceAll("[^\\x20-\\x7e]", "");
    

    2) I get xml as an array of bytes - how to handle this operation safely in that case?

    You need to wrap the byte[] in an ByteArrayInputStream, so that you can read them in an UTF-8 encoded character stream using InputStreamReader wherein you specify the encoding and then use a BufferedReader to read it line by line.

    E.g.

    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
        for (String line; (line = reader.readLine()) != null;) {
            line = line.replaceAll("[^\\x20-\\x7e]", "");
            // ...
        }
        // ...
    

提交回复
热议问题