How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

前端 未结 4 1168
野趣味
野趣味 2020-11-30 00:15

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

When the BOM (Byte Order Mark) is there, I have n

4条回答
  •  孤街浪徒
    2020-11-30 00:41

    ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.

    The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.

    Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.

    There is code out there for checking validity of UTF chars.

提交回复
热议问题