Identifying the type of a file without extension from binary data

感情迁移 提交于 2019-12-05 19:16:04

You could read the first few bytes of the file and look for a "magic number". The Wikipedia page on magic numbers suggests that PDF files begin with ASCII %PDF and doc files begin with hex D0 CF 11 E0.

Identifying text files is going be pretty tough in the general case, because a lot of standard magic numbers are actually ASCII text at the beginning of a binary file. For your case, if you can guarantee that you won't be getting anything but PDF, DOC, or TXT, what you could probably get away with is checking for the PDF and DOC magic numbers, and then assuming it's text if it's not either of those.

You haven't said what OS your on. If its a *nix based one then there is a python wrapper (that uses ctypes) around libmagic which uses the same underlying mechanism as the file command which can identify files without extensions by examining the contents. Alternately just examine how libmagic uses the file definitions and just work out how it identifies the two primary file types (doc, pdf) and everything left must be text ;-) and extend your existing code.

PDF documents start with %PDF-version number , but some of them could be entirely compressed.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!