发表新帖

发表新帖

Detecting 'text' file type (ANSI vs UTF-8)

前端未结

关注

 5  962

北荒 2020-12-14 23:01

I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI.

Someone has ported the program

5条回答

不思量自难忘° (楼主)

2020-12-14 23:47
There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.)

For instance, the sequence C4 8D might be the “č” character in UTF-8, or it might be “ÄŤ” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “Ťš” in windows-1250, but it is not a valid UTF-8 string.

You have to resort to some kind of heuristic, e.g.
1. If the file contains a sequence which cannot be a valid UTF-8, assume it is ANSI.
2. Otherwise, if the file begins with UTF-8 BOM (EF BB BF), assume it is UTF-8 (it might not be, however, plain text ANSI file beginning with such characters is very improbable).
3. Otherwise, assume it is UTF-8. (Or, try more heuristics, maybe using the knowledge of the language of the text, etc.)
See also the method used by Notepad.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题