How to read a text file with mixed encodings in Scala or Java?

后端 未结 7 1743
日久生厌
日久生厌 2020-12-07 16:00

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the f

7条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-07 16:22

    Use ISO-8859-1 as the encoder; this will just give you byte values packed into a string. This is enough to parse CSV for most encodings. (If you have mixed 8-bit and 16-bit blocks, then you're in trouble; you can still read the lines in ISO-8859-1, but you may not be able to parse the line as a block.)

    Once you have the individual fields as separate strings, you can try

    new String(oldstring.getBytes("ISO-8859-1"), "UTF-8")
    

    to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it).

    Edit: you will have to use java.nio.charset.Charset.CharsetDecoder if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there's an error.

    val decoder = java.nio.charset.Charset.forName("UTF-8").newDecoder
    
    // By default will throw a MalformedInputException if encoding fails
    decoder.decode( java.nio.ByteBuffer.wrap(oldstring.getBytes("ISO-8859-1")) ).toString
    

提交回复
热议问题