How to read a text file with mixed encodings in Scala or Java?

后端 未结 7 1764
日久生厌
日久生厌 2020-12-07 16:00

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the f

7条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-07 16:10

    This is how I managed to do it with java:

        FileInputStream input;
        String result = null;
        try {
            input = new FileInputStream(new File("invalid.txt"));
            CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
            decoder.onMalformedInput(CodingErrorAction.IGNORE);
            InputStreamReader reader = new InputStreamReader(input, decoder);
            BufferedReader bufferedReader = new BufferedReader( reader );
            StringBuilder sb = new StringBuilder();
            String line = bufferedReader.readLine();
            while( line != null ) {
                sb.append( line );
                line = bufferedReader.readLine();
            }
            bufferedReader.close();
            result = sb.toString();
    
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch( IOException e ) {
            e.printStackTrace();
        }
    
        System.out.println(result);
    

    The invalid file is created with bytes:

    0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94
    

    Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

    With .REPLACE you see the standard unicode replacement character being used:

    //"h�ellö� wö�rld�"
    

    With .IGNORE, you see the invalid bytes ignored:

    //"hellö wörld"
    

    Without specifying .onMalformedInput, you get

    java.nio.charset.MalformedInputException: Input length = 1
        at java.nio.charset.CoderResult.throwException(Unknown Source)
        at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
        at sun.nio.cs.StreamDecoder.read(Unknown Source)
        at java.io.InputStreamReader.read(Unknown Source)
        at java.io.BufferedReader.fill(Unknown Source)
        at java.io.BufferedReader.readLine(Unknown Source)
        at java.io.BufferedReader.readLine(Unknown Source)
    

提交回复
热议问题