How to identify doc, docx, pdf, xls and xlsx based on file header

后端 未结 4 1242
暖寄归人
暖寄归人 2020-12-28 18:28

How to identify doc, docx, pdf, xls and xlsx based on file header in C#? I don\'t want to rely on the file extensions neither MimeMapping.GetMimeMapping for this as either o

4条回答
  •  天命终不由人
    2020-12-28 18:50

    Using file signatures it is not so feasible (since the new office formats are ZIP files and the old Office files are OLE CF / OLE SS containers), but you can use C# code to read them and figure out what they are.

    For newest Office formats, you can read the (DOCX/PPTX/XLSX/...) ZIP file using System.IO.Packaging : https://msdn.microsoft.com/en-us/library/ms568187(v=vs.110).aspx Doing that, you can find the ContentType of the first document part and infer using that.

    For older Office files (Office 2003) you can use this library to distinguish them based on their contents (note that MSI and MSG files are also using this file format): http://sourceforge.net/projects/openmcdf/

    E.g., here are the contents of an XLS file:

    I hope this helps! :)

    It would have certainly helped me, if I had found this answer earlier. ;)

提交回复
热议问题