How to identify doc, docx, pdf, xls and xlsx based on file header

后端 未结 4 1243
暖寄归人
暖寄归人 2020-12-28 18:28

How to identify doc, docx, pdf, xls and xlsx based on file header in C#? I don\'t want to rely on the file extensions neither MimeMapping.GetMimeMapping for this as either o

4条回答
  •  暖寄归人
    2020-12-28 18:41

    The answer from user2173353 is the most correct one, given that the OP specifically mentioned Office file formats. However, I didn't like the idea of adding an entire library (OpenMCDF) just to identify legacy Office formats, so I wrote my own routine for doing just this.

        public static CfbFileFormat GetCfbFileFormat(Stream fileData)
        {
            if (!fileData.CanSeek)
                throw new ArgumentException("Data stream must be seekable.", nameof(fileData));
    
            try
            {
                // Notice that values in a CFB files are always little-endian. Fortunately BinaryReader.ReadUInt16/ReadUInt32 reads with little-endian.
                // If using .net < 4.5 this BinaryReader constructor is not available. Use a simpler one but remember to also remove the 'using' statement.
                using (BinaryReader reader = new BinaryReader(fileData, Encoding.Unicode, true))
                {
                    // Check that data has the CFB file header
                    var header = reader.ReadBytes(8);
                    if (!header.SequenceEqual(new byte[] {0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1}))
                        return CfbFileFormat.Unknown;
    
                    // Get sector size (2 byte uint) at offset 30 in the header
                    // Value at 1C specifies this as the power of two. The only valid values are 9 or 12, which gives 512 or 4096 byte sector size.
                    fileData.Position = 30;
                    ushort readUInt16 = reader.ReadUInt16();
                    int sectorSize = 1 << readUInt16;
    
                    // Get first directory sector index at offset 48 in the header
                    fileData.Position = 48;
                    var rootDirectoryIndex = reader.ReadUInt32();
    
                    // File header is one sector wide. After that we can address the sector directly using the sector index
                    var rootDirectoryAddress = sectorSize + (rootDirectoryIndex * sectorSize);
    
                    // Object type field is offset 80 bytes into the directory sector. It is a 128 bit GUID, encoded as "DWORD, WORD, WORD, BYTE[8]".
                    fileData.Position = rootDirectoryAddress + 80;
                    var bits127_96 = reader.ReadInt32();
                    var bits95_80 = reader.ReadInt16();
                    var bits79_64 = reader.ReadInt16();
                    var bits63_0 = reader.ReadBytes(8);
    
                    var guid = new Guid(bits127_96, bits95_80, bits79_64, bits63_0);
    
                    // Compare to known file format GUIDs
    
                    CfbFileFormat result;
                    return Formats.TryGetValue(guid, out result) ? result : CfbFileFormat.Unknown;
                }
            }
            catch (IOException)
            {
                return CfbFileFormat.Unknown;
            }
            catch (OverflowException)
            {
                return CfbFileFormat.Unknown;
            }
        }
    
        public enum CfbFileFormat
        {
            Doc,
            Xls,
            Msi,
            Ppt,
            Unknown
        }
    
        private static readonly Dictionary Formats = new Dictionary
        {
            {Guid.Parse("{00020810-0000-0000-c000-000000000046}"), CfbFileFormat.Xls},
            {Guid.Parse("{00020820-0000-0000-c000-000000000046}"), CfbFileFormat.Xls},
            {Guid.Parse("{00020906-0000-0000-c000-000000000046}"), CfbFileFormat.Doc},
            {Guid.Parse("{000c1084-0000-0000-c000-000000000046}"), CfbFileFormat.Msi},
            {Guid.Parse("{64818d10-4f9b-11cf-86ea-00aa00b929e8}"), CfbFileFormat.Ppt}
        };
    

    Additional formats identifiers can be added as needed.

    I've tried this on .doc and .xls, and it has worked fine. I haven't tested on CFB files using 4096 byte sector size, as I don't even know where to find those.

    The code is based on information from the following documents:

    • http://fileformats.archiveteam.org/wiki/Microsoft_Compound_File
    • https://msdn.microsoft.com/en-us/library/dd942138.aspx

提交回复
热议问题