Guessing the encoding of text represented as byte[] in Java

后端未结

关注

 7  2143

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most

相关标签:

7条回答

别跟我提以往

2020-11-28 07:54

Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

How to determine if a String contains invalid encoded characters

Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-11-28 07:57

Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:

http://www.joelonsoftware.com/articles/Unicode.html

Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-11-28 07:59

There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-11-28 08:07
Here's my favorite: https://github.com/codehaus/guessencoding

It works like this:
- If there's a UTF-8 or UTF-16 BOM, return that encoding.
- If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
- If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
- Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).
It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2020-11-28 08:07

Check out jchardet

0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-11-28 08:15

Should be stuff already available

google search turned up icu4j

or

http://jchardet.sourceforge.net/

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页