Guessing the encoding of text represented as byte[] in Java

后端未结

关注

 7  2152

感动是毒 2020-11-28 07:45

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most

7条回答

暖寄归人 (楼主)

2020-11-28 08:07
Here's my favorite: https://github.com/codehaus/guessencoding

It works like this:
- If there's a UTF-8 or UTF-16 BOM, return that encoding.
- If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
- If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
- Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).
It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...