how to determine text encoding

荒凉一梦 提交于 2020-01-13 02:39:27

问题


I know UTF file has BOM for determining encoding but what about other encoding that has no clue how to guess that encoding.

I am new java programmer. I have written code for guessing UTF encoding using UTF BOM. but I have problem with other encoding. How do I guess them.

Anybody can help me? thanks in Advance.


回答1:


This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).

  • GuessEncoding
  • jchardet (Java port of the algorithm used by mozilla firefox)

Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.




回答2:


Short answer is: you cannot.

Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.

This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.




回答3:


If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.

For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.

The most common solution is to let the user select the encoding if you cannot detect it.



来源:https://stackoverflow.com/questions/3211683/how-to-determine-text-encoding

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!