remove non-UTF-8 characters from xml with declared encoding=utf-8 - Java

后端 未结 6 1565
再見小時候
再見小時候 2020-12-13 14:42

I have to handle this scenario in Java:

I\'m getting a request in XML form from a client with declared encoding=utf-8. Unfortunately it may contain not utf-8 charact

6条回答
  •  庸人自扰
    2020-12-13 15:29

    Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

    String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa"};
    for (int i = 0; i < values.length; i++) {
        System.out.println(values[i].replaceAll(
                        "[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx
                        "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                        "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                        "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                , ""));
    }
    

    or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

    String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa"};
    for (int i = 0; i < values.length; i++) {
        System.out.println(Pattern.matches(
                        ".*(" +
                        "[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx
                        "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                        "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                        "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                        + ").*"
                , values[i]));
    }
    

    If you have the byte array available than you could filter them even more properly with:

    BufferedReader bufferedReader = null;
    try {
        bufferedReader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
        for (String currentLine; (currentLine = bufferedReader.readLine()) != null;) {
            currentLine = currentLine.replaceAll(
                            "[\\x00-\\x7F]|" + //single-byte sequences   0xxxxxxx
                            "[\\xC0-\\xDF][\\x80-\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                            "[\\xE0-\\xEF][\\x80-\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                            "[\\xF0-\\xF7][\\x80-\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    , ""));
        }
    

    For making a whole web app be UTF8 compatible read here:
    How to get UTF-8 working in Java webapps
    More on Byte Encodings and Strings.
    You can check your pattern here.
    The same in PHP here.

提交回复
热议问题