How to read a text file with mixed encodings in Scala or Java?

后端未结

关注

 7  1761

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the f

相关标签:

7条回答

一个人的身影

2020-12-07 16:24

The problem with ignoring invalid bytes is then deciding when they're valid again. Note that UTF-8 allows variable-length byte encodings for characters, so if a byte is invalid, you need to understand which byte to start reading from to get a valid stream of characters again.

In short, I don't think you'll find a library which can 'correct' as it reads. I think a much more productive approach is to try and clean that data up first.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2