I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the f
The problem with ignoring invalid bytes is then deciding when they're valid again. Note that UTF-8 allows variable-length byte encodings for characters, so if a byte is invalid, you need to understand which byte to start reading from to get a valid stream of characters again.
In short, I don't think you'll find a library which can 'correct' as it reads. I think a much more productive approach is to try and clean that data up first.