Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most
Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,
How to determine if a String contains invalid encoded characters
Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.
Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:
http://www.joelonsoftware.com/articles/Unicode.html
Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.
There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.
Here's my favorite: https://github.com/codehaus/guessencoding
It works like this:
It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.
Check out jchardet
Should be stuff already available
google search turned up icu4j
or
http://jchardet.sourceforge.net/