I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the f
Use ISO-8859-1
as the encoder; this will just give you byte values packed into a string. This is enough to parse CSV for most encodings. (If you have mixed 8-bit and 16-bit blocks, then you're in trouble; you can still read the lines in ISO-8859-1, but you may not be able to parse the line as a block.)
Once you have the individual fields as separate strings, you can try
new String(oldstring.getBytes("ISO-8859-1"), "UTF-8")
to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it).
Edit: you will have to use java.nio.charset.Charset.CharsetDecoder
if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there's an error.
val decoder = java.nio.charset.Charset.forName("UTF-8").newDecoder
// By default will throw a MalformedInputException if encoding fails
decoder.decode( java.nio.ByteBuffer.wrap(oldstring.getBytes("ISO-8859-1")) ).toString