Guessing the encoding of text represented as byte[] in Java

后端 未结 7 2144
感动是毒
感动是毒 2020-11-28 07:45

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most

相关标签:
7条回答
  • 2020-11-28 08:21

    The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

    public static String guessEncoding(byte[] bytes) {
        String DEFAULT_ENCODING = "UTF-8";
        org.mozilla.universalchardet.UniversalDetector detector =
            new org.mozilla.universalchardet.UniversalDetector(null);
        detector.handleData(bytes, 0, bytes.length);
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        detector.reset();
        if (encoding == null) {
            encoding = DEFAULT_ENCODING;
        }
        return encoding;
    }
    

    The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

    I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

    0 讨论(0)
提交回复
热议问题