I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI.
Someone has ported the program
If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:
function UTF8FileBOM(const FileName: string): boolean;
var
txt: file;
bytes: array[0..2] of byte;
amt: integer;
begin
FileMode := fmOpenRead;
AssignFile(txt, FileName);
Reset(txt, 1);
try
BlockRead(txt, bytes, 3, amt);
result := (amt=3) and (bytes[0] = $EF) and (bytes[1] = $BB) and (bytes[2] = $BF);
finally
CloseFile(txt);
end;
end;
Otherwise, it is much more difficult.
When reading first try parsing the file as UTF-8. If it isn't valid UTF-8 interpret the file as the legacy encoding(ANSI). This will work on most files, since it's very unlikely that a legacy encoded file will be valid UTF-8.
What windows calls ANSI is a system locale dependent charset. And the text won't work correctly on a Russian, or Asian or... windows.
While the VCL doesn't support Unicode in Delphi 7, you still should internally work with unicode and only convert to ANSI for displaying it. I localized one of my programs to Korean and Russian, and that was the only way I got it working without large problems. You still could only display the Korean localization on a system set to Korean, but at least the text-files could be edited on any system.
There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.)
For instance, the sequence C4 8D might be the “č” character in UTF-8, or it might be “ÄŤ” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “Ťš” in windows-1250, but it is not a valid UTF-8 string.
You have to resort to some kind of heuristic, e.g.
See also the method used by Notepad.
//if is possible to decoded,then it is UTF8
function isFileUTF8(const Tex : AnsiString): boolean;
begin
result := (Tex <> '') and (UTF8Decode(Tex) <> '');
end;
If we summerize, then:
OTHER INFO PEOPLE MAY FOUND INTERESTING:
https://groups.google.com/forum/?lnk=st&q=delphi+WIN32+functions+to+detect+which+encoding++is+in+use&rnum=1&hl=pt-BR&pli=1#!topic/borland.public.delphi.internationalization.win32/_LgLolX25OA
function FileMayBeUTF8(FileName: WideString): Boolean;
var
Stream: TMemoryStream;
BytesRead: integer;
ArrayBuff: array[0..127] of byte;
PreviousByte: byte;
i: integer;
YesSequences, NoSequences: integer;
begin
if not WideFileExists(FileName) then
Exit;
YesSequences := 0;
NoSequences := 0;
Stream := TMemoryStream.Create;
try
Stream.LoadFromFile(FileName);
repeat
{read from the TMemoryStream}
BytesRead := Stream.Read(ArrayBuff, High(ArrayBuff) + 1);
{Do the work on the bytes in the buffer}
if BytesRead > 1 then
begin
for i := 1 to BytesRead-1 do
begin
PreviousByte := ArrayBuff[i-1];
if ((ArrayBuff[i] and $c0) = $80) then
begin
if ((PreviousByte and $c0) = $c0) then
begin
inc(YesSequences)
end
else
begin
if ((PreviousByte and $80) = $0) then
inc(NoSequences);
end;
end;
end;
end;
until (BytesRead < (High(ArrayBuff) + 1));
//Below, >= makes ASCII files = UTF-8, which is no problem.
//Simple > would catch only UTF-8;
Result := (YesSequences >= NoSequences);
finally
Stream.Free;
end;
end;
Now testing this function...
In my humble opinion only way how to START doing this check correctly is to check OS charset in first place because in the end there almost in all cases are made some references to OS. No way to scape it anyway...
Remarks: