Guessing the encoding of text represented as byte[] in Java

后端 未结 7 2143
感动是毒
感动是毒 2020-11-28 07:45

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most

相关标签:
7条回答
  • 2020-11-28 07:54

    Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

    How to determine if a String contains invalid encoded characters

    Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

    0 讨论(0)
  • 2020-11-28 07:57

    Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:

    http://www.joelonsoftware.com/articles/Unicode.html

    Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.

    0 讨论(0)
  • 2020-11-28 07:59

    There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

    0 讨论(0)
  • 2020-11-28 08:07

    Here's my favorite: https://github.com/codehaus/guessencoding

    It works like this:

    • If there's a UTF-8 or UTF-16 BOM, return that encoding.
    • If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
    • If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
    • Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).

    It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.

    0 讨论(0)
  • 2020-11-28 08:07

    Check out jchardet

    0 讨论(0)
  • 2020-11-28 08:15

    Should be stuff already available

    google search turned up icu4j

    or

    http://jchardet.sourceforge.net/

    0 讨论(0)
提交回复
热议问题