Detecting 'text' file type (ANSI vs UTF-8)

前端 未结 5 959
北荒
北荒 2020-12-14 23:01

I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI.

Someone has ported the program

5条回答
  •  抹茶落季
    2020-12-14 23:53

    If we summerize, then:

    • Best solution for basic usage is to use outdated ( if we use IsTextUnicode(); );
    • Best solution for advanced usage is to use function above, then check BOM ( ~ 1KB ), then check Locale info under particual OS and only then get about 98% accuracy?

    OTHER INFO PEOPLE MAY FOUND INTERESTING:

    https://groups.google.com/forum/?lnk=st&q=delphi+WIN32+functions+to+detect+which+encoding++is+in+use&rnum=1&hl=pt-BR&pli=1#!topic/borland.public.delphi.internationalization.win32/_LgLolX25OA

    function FileMayBeUTF8(FileName: WideString): Boolean;
    var
     Stream: TMemoryStream;
     BytesRead: integer;
     ArrayBuff: array[0..127] of byte;
     PreviousByte: byte;
     i: integer;
     YesSequences, NoSequences: integer;
    
    begin
       if not WideFileExists(FileName) then
         Exit;
       YesSequences := 0;
       NoSequences := 0;
       Stream := TMemoryStream.Create;
       try
         Stream.LoadFromFile(FileName);
         repeat
    
         {read from the TMemoryStream}
    
           BytesRead := Stream.Read(ArrayBuff, High(ArrayBuff) + 1);
               {Do the work on the bytes in the buffer}
           if BytesRead > 1 then
             begin
               for i := 1 to BytesRead-1 do
                 begin
                   PreviousByte := ArrayBuff[i-1];
                   if ((ArrayBuff[i] and $c0) = $80) then
                     begin
                       if ((PreviousByte and $c0) = $c0) then
                         begin
                           inc(YesSequences)
                         end
                       else
                         begin
                           if ((PreviousByte and $80) = $0) then
                             inc(NoSequences);
                         end;
                     end;
                 end;
             end;
         until (BytesRead < (High(ArrayBuff) + 1));
    //Below, >= makes ASCII files = UTF-8, which is no problem.
    //Simple > would catch only UTF-8;
         Result := (YesSequences >= NoSequences);
    
       finally
         Stream.Free;
       end;
    end;
    

    Now testing this function...

    In my humble opinion only way how to START doing this check correctly is to check OS charset in first place because in the end there almost in all cases are made some references to OS. No way to scape it anyway...

    Remarks:

    • WideFileExists() function is taken from TntClasses.pas ( Koders.net source ).

提交回复
热议问题