Java : How to determine the correct charset encoding of a stream

前端 未结 15 2021
花落未央
花落未央 2020-11-22 02:06

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct cha

15条回答
  •  夕颜
    夕颜 (楼主)
    2020-11-22 02:18

    If you use ICU4J (http://icu-project.org/apiref/icu4j/)

    Here is my code:

    String charset = "ISO-8859-1"; //Default chartset, put whatever you want
    
    byte[] fileContent = null;
    FileInputStream fin = null;
    
    //create FileInputStream object
    fin = new FileInputStream(file.getPath());
    
    /*
     * Create byte array large enough to hold the content of the file.
     * Use File.length to determine size of the file in bytes.
     */
    fileContent = new byte[(int) file.length()];
    
    /*
     * To read content of the file in byte array, use
     * int read(byte[] byteArray) method of java FileInputStream class.
     *
     */
    fin.read(fileContent);
    
    byte[] data =  fileContent;
    
    CharsetDetector detector = new CharsetDetector();
    detector.setText(data);
    
    CharsetMatch cm = detector.detect();
    
    if (cm != null) {
        int confidence = cm.getConfidence();
        System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
        //Here you have the encode name and the confidence
        //In my case if the confidence is > 50 I return the encode, else I return the default value
        if (confidence > 50) {
            charset = cm.getName();
        }
    }
    

    Remember to put all the try-catch need it.

    I hope this works for you.

提交回复
热议问题