Compute hash of only the core image data (excluding metadata) for an image

后端 未结 4 2036
花落未央
花落未央 2020-12-13 20:31

I\'m writing a script to calculate the MD5 sum of an image excluding the EXIF tag.

In order to do this accurately, I need to know where the EXIF tag is located in th

相关标签:
4条回答
  • 2020-12-13 20:41

    One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.

    The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.

    This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)

    import struct
    import os
    import hashlib
    
    def png(fh):
        hash = hashlib.md5()
        assert fh.read(8)[1:4] == "PNG"
        while True:
            try:
                length, = struct.unpack(">i",fh.read(4))
            except struct.error:
                break
            if fh.read(4) == "IDAT":
                hash.update(fh.read(length))
                fh.read(4) # CRC
            else:
                fh.seek(length+4,os.SEEK_CUR)
        print "Hash: %r" % hash.digest()
    
    def jpeg(fh):
        hash = hashlib.md5()
        assert fh.read(2) == "\xff\xd8"
        while True:
            marker,length = struct.unpack(">2H", fh.read(4))
            assert marker & 0xff00 == 0xff00
            if marker == 0xFFDA: # Start of stream
                hash.update(fh.read())
                break
            else:
                fh.seek(length-2, os.SEEK_CUR)
        print "Hash: %r" % hash.digest()
    
    
    if __name__ == '__main__':
        png(file("sample.png"))
        jpeg(file("sample.jpg"))
    
    0 讨论(0)
  • 2020-12-13 20:42

    It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):

    In [1]: import Image
    
    In [2]: import hashlib
    
    In [3]: im = Image.open('foo.jpg')
    
    In [4]: hashlib.md5(im.tobytes()).hexdigest()
    Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5'
    

    This works on any type of image that PIL can handle. The tobytes method returns the a string containing the pixel data.

    BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:

    In [6]: hashlib.sha512(im.tobytes()).hexdigest()
    Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'
    

    On my machine, calculating the MD5 checksum for a 2500x1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:

    #!/usr/bin/env python3
    
    from PIL import Image
    import hashlib
    import sys
    
    im = Image.open(sys.argv[1])
    print(hashlib.sha512(im.tobytes()).hexdigest(), end="")
    

    For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.

    0 讨论(0)
  • 2020-12-13 20:43

    I would use a metadata stripper to preprocess your hashing :

    From ImageMagick package you have ...

    mogrify -strip blah.jpg
    

    and if you do

    identify -list format 
    

    it apparently works with all the cited formats.

    0 讨论(0)
  • 2020-12-13 20:47

    You can use stream which is part of the ImageMagick suite:

    $ stream -map rgb -storage-type short image.tif - | sha256sum
    d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  -
    

    or

    $ sha256sum <(stream -map rgb -storage-type short image.tif -)
    d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  /dev/fd/63
    

    This example is for a TIFF file which is RGB with 16 bits per sample (i.e. 48 bits per pixel). So I use map to rgb and a short storage-type (you can use char here if the RGB values are 8-bits).

    This method reports the same signature hash that the verbose Imagemagick identify command reports:

    $ identify -verbose image.tif | grep signature
    signature: d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64
    

    (for ImageMagick v6.x; the hash reported by identify on version 7 is different to that obtained using stream, but the latter may be reproduced by any tool capable of extracting the raw bitmap data - such as dcraw for some image types.)

    0 讨论(0)
提交回复
热议问题