How to calculate the entropy of a file?

前端 未结 11 1330
野趣味
野趣味 2020-11-28 20:16

How to calculate the entropy of a file? (Or let\'s just say a bunch of bytes)
I have an idea, but I\'m not sure that it\'s mathematically correct.

My id

11条回答
  •  半阙折子戏
    2020-11-28 20:45

    Without any additional information entropy of a file is (by definition) equal to its size*8 bits. Entropy of text file is roughly size*6.6 bits, given that:

    • each character is equally probable
    • there are 95 printable characters in byte
    • log(95)/log(2) = 6.6

    Entropy of text file in English is estimated to be around 0.6 to 1.3 bits per character (as explained here).

    In general you cannot talk about entropy of a given file. Entropy is a property of a set of files.

    If you need an entropy (or entropy per byte, to be exact) the best way is to compress it using gzip, bz2, rar or any other strong compression, and then divide compressed size by uncompressed size. It would be a great estimate of entropy.

    Calculating entropy byte by byte as Nick Dandoulakis suggested gives a very poor estimate, because it assumes every byte is independent. In text files, for example, it is much more probable to have a small letter after a letter than a whitespace or punctuation after a letter, since words typically are longer than 2 characters. So probability of next character being in a-z range is correlated with value of previous character. Don't use Nick's rough estimate for any real data, use gzip compression ratio instead.

提交回复
热议问题