How to read a text file with mixed encodings in Scala or Java?

后端 未结 7 1742
日久生厌
日久生厌 2020-12-07 16:00

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the f

相关标签:
7条回答
  • 2020-12-07 16:24

    The problem with ignoring invalid bytes is then deciding when they're valid again. Note that UTF-8 allows variable-length byte encodings for characters, so if a byte is invalid, you need to understand which byte to start reading from to get a valid stream of characters again.

    In short, I don't think you'll find a library which can 'correct' as it reads. I think a much more productive approach is to try and clean that data up first.

    0 讨论(0)
提交回复
热议问题