How to calculate the entropy of a file?

前端 未结 11 1332
野趣味
野趣味 2020-11-28 20:16

How to calculate the entropy of a file? (Or let\'s just say a bunch of bytes)
I have an idea, but I\'m not sure that it\'s mathematically correct.

My id

11条回答
  •  孤独总比滥情好
    2020-11-28 20:37

    Re: I need the whole thing to make assumptions on the file's contents: (plaintext, markup, compressed or some binary, ...)

    As others have pointed out (or been confused/distracted by), I think you're actually talking about metric entropy (entropy divided by length of message). See more at Entropy (information theory) - Wikipedia.

    jitter's comment linking to Scanning data for entropy anomalies is very relevant to your underlying goal. That links eventually to libdisorder (C library for measuring byte entropy). That approach would seem to give you lots more information to work with, since it shows how the metric entropy varies in different parts of the file. See e.g. this graph of how the entropy of a block of 256 consecutive bytes from a 4 MB jpg image (y axis) changes for different offsets (x axis). At the beginning and end the entropy is lower, as it part-way in, but it is about 7 bits per byte for most of the file.

    Source: https://github.com/cyphunk/entropy_examples. [Note that this and other graphs are available via the novel http://nonwhiteheterosexualmalelicense.org license....]

    More interesting is the analysis and similar graphs at Analysing the byte entropy of a FAT formatted disk | GL.IB.LY

    Statistics like the max, min, mode, and standard deviation of the metric entropy for the whole file and/or the first and last blocks of it might be very helpful as a signature.

    This book also seems relevant: Detection and Recognition of File Masquerading for E-mail and Data Security - Springer

提交回复
热议问题