How to read a text file with mixed encodings in Scala or Java?

后端未结

关注

 7  1743

日久生厌 2020-12-07 16:00

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the f

7条回答

挽巷 (楼主)

2020-12-07 16:22
Use ISO-8859-1 as the encoder; this will just give you byte values packed into a string. This is enough to parse CSV for most encodings. (If you have mixed 8-bit and 16-bit blocks, then you're in trouble; you can still read the lines in ISO-8859-1, but you may not be able to parse the line as a block.)

Once you have the individual fields as separate strings, you can try
```
new String(oldstring.getBytes("ISO-8859-1"), "UTF-8")
```
to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it).

Edit: you will have to use java.nio.charset.Charset.CharsetDecoder if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there's an error.
```
val decoder = java.nio.charset.Charset.forName("UTF-8").newDecoder

// By default will throw a MalformedInputException if encoding fails
decoder.decode( java.nio.ByteBuffer.wrap(oldstring.getBytes("ISO-8859-1")) ).toString
```
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...