Identifying the type of a file without extension from binary data

I have some files without extension. I would like associate extensions to them. For that I have written a python program to read the data in the file. My doubt is how can I identify its type without the extension without using third party tools.

I have to identify a pdf, doc and text file only. Other type of files are not possible.

My server is cent os

You could read the first few bytes of the file and look for a "magic number". The Wikipedia page on magic numbers suggests that PDF files begin with ASCII %PDF and doc files begin with hex D0 CF 11 E0.

Identifying text files is going be pretty tough in the general case, because a lot of standard magic numbers are actually ASCII text at the beginning of a binary file. For your case, if you can guarantee that you won't be getting anything but PDF, DOC, or TXT, what you could probably get away with is checking for the PDF and DOC magic numbers, and then assuming it's text if it's not either of those.

You haven't said what OS your on. If its a *nix based one then there is a python wrapper (that uses ctypes) around libmagic which uses the same underlying mechanism as the file command which can identify files without extensions by examining the contents. Alternately just examine how libmagic uses the file definitions and just work out how it identifies the two primary file types (doc, pdf) and everything left must be text ;-) and extend your existing code.

PDF documents start with %PDF-version number , but some of them could be entirely compressed.

来源：https://stackoverflow.com/questions/12190128/identifying-the-type-of-a-file-without-extension-from-binary-data

标签

python

file

binaryfiles

binary-data