Is there a safe way to run a diff on two zip compressed files?

前端 未结 13 1041
长发绾君心
长发绾君心 2020-12-16 12:49

Seems this would not be a deterministic thing, or is there a way to do this reliably?

相关标签:
13条回答
  • 2020-12-16 13:26

    A lot of the solutions here are either only checking the CRC to see if differences exist, are complicated scripts, require uncompressing to disk, use external programs, or need specific compression formats other than the one you were asking about (zcat does NOT work with zip).

    Here's one that's simple, easy to read, and should work wherever you have bash that shows the differences between the file contents if, like me, that's what you needed when you happened across this question:

    diff \
        <(zipinfo -1 "$zip1" '*' \
        | grep '[^/]$' \
        | sort \
        | while IFS= read -r file; do unzip -c "$zip1" "$file"; done \
        ) \
        <(zipinfo -1 "$zip2" '*' \
        | grep '[^/]$' \
        | sort \
        | while IFS= read -r file; do unzip -c "$zip2" "$file"; done \
        )
    

    This decompresses in-memory, not to disk, releasing data from the pipe as it diffs (it wont decompress and then compare, so shouldn't use much memory).
    Want to change diffing options for ignoring whitespace or using side-by-side? Change diff to diff -w or gvimdiff (this one will keep all files in memory) et cetera.
    Say you only want to diff the .js files? Change * to *.js.
    Only want to see the filenames that are missing from one or the other? Remove the while line and it wont bother decompressing.

    Easy.

    It will even safely handle (skip and record it to stderr) filenames with "illegal" characters like newlines and backslashes.
    Doesn't get "safe"r than this.

    slm's answer is pretty good for returning files that are different (without showing differences) and doesn't even decompress at all which is nice. If for some reason you want that but a step above CRC, in this answer you could add | sha512sum before the ; done for example and get 'the worst of both worlds' :P


    Similarly it's relatively easy to compare an archive and a real directory:

    diff \
        <(zipinfo -1 "$zip" '*' \
        | grep '[^/]$' \
        | sort \
        | while IFS= read -r file; do unzip -c "$zip" "$file"; done \
        ) \
        <(find "$directory" -type f -name '*' \
        | sort \
        | while IFS= read -r file
          do
              printf 'Archive:  %s\n  inflating: %s\n' "$directory" `echo $file | sed "s|$directory/||"`
              cat "$file"
              echo
          done \
        )
    

    Or, ignoring files only in the directory, basically a handy dry-run of unzip -o -d "$directory":

    diff \
        <(zipinfo -1 "$zip" '*' \
        | grep '[^/]$' \
        | sort \
        | while IFS= read -r file; do unzip -c "$zip" "$file"; done \
        ) \
        <(zipinfo -1 "$zip" '*' \
        | grep '[^/]$' \
        | sort \
        | while IFS= read -r file
          do
              printf 'Archive:  %s\n  inflating: %s\n' "$directory" "$file"
              cat "$directory/$file"
              echo
          done \
        )
    

    Windows? Sorry. Whilst the scripts are simple and would be a cinch to port to the [syntactically] fantastic powershell, it wouldn't work. The native cmdlet only extracts to disk and MS still haven't fixed the broken binary data piping in PS so you cant "safely" use an external zip.exe in this manner either.

    Apparenlty others have done similar things using the .NET API directly, but it'd become less of an elegant port and more of a reimplementation in .NET :|


    A note about the "illegal filenames" mentioned before:
    If you want it to work with these it actually isn't too difficult; you'll just need to swap $file with $(echo "$file" | sed 's/\\/\\\\/g;s/\^J/\n/g;s/\^M/\r/g').

    Add other ctrl chars as you happen across them.

    The reason is, for some reason, even though zipinfo displays a filename with \n in it as ^J, it will not accept these safe names for unzip, only the original! And even though it CAN extract to those illegal filenames with unzip -^ there's no way to get these original filenames through zipinfo at all. So you need to build the original, illegal filename from the safe, unusable one to reference them for the diff :(
    If you do this, note that there is no way to distinguish between ^J literally and \n displaying as ^J, and that zip doesn't support / or ^@ within filenames at all.


    As a bonus; you can write all these diffs straight to an archive and keep them all in a folder heirarchy matching the original files instead of trying to read it all at once in one big splat.

    (zipinfo -1 "$zip1"; zipinfo -1 "$zip2") \
        | grep '[^/]$' \
        | sort \
        | uniq \
        | while IFS= read -r file; do
            (diff <(unzip -p "$zip1" "$file") <(unzip -p "$zip2" "$file") | zip 'diff.zip' - \
            && zipinfo -s 'diff.zip' - | awk '{ print $4; }' | grep '[^0]' \
            && printf "@ -\n@=$file\n" | zipnote -w 'diff.zip' \
            || zip -d 'diff.zip' -
            ) >/dev/null
          done
    

    Not as pretty a script, but now you can open it up in your gui archiver of choice or do unzip -p diff.zip some/dir/some.file to see the differences with that file specifically, or be greeted with "not found" if there are no differences, which is much prettier in practice.

    0 讨论(0)
  • 2020-12-16 13:31

    In general, you cannot avoid decompressing and then comparing. Different compressors will result in different DEFLATEd byte streams, which when INFLATEd result in the same original text. You cannot simply compare the DEFLATEd data, one to another. That will FAIL in some cases.

    But in a ZIP scenario, there is a CRC32 calculated and stored for each entry. So if you want to check files, you can simply compare the stored CRC32 associated to each DEFLATEd stream, with the caveats on the uniqueness properties of the CRC32 hash. It may fit your needs to compare the FileName and the CRC.

    You would need a ZIP library that reads zip files and exposes those things as properties on the "ZipEntry" object. DotNetZip will do that for .NET apps.

    0 讨论(0)
  • 2020-12-16 13:33

    Reliable: unzip both, diff.

    I have no idea if that answer's good enough for your use, but it works.

    0 讨论(0)
  • 2020-12-16 13:33

    zipcmp compares the zip archives zip1 and zip2 and checks if they contain the same files, comparing their names, uncompressed sizes, and CRCs. File order and compressed size differences are ignored.

    sudo apt-get install zipcmp

    0 讨论(0)
  • 2020-12-16 13:33

    I found relief with this simple Perl script: diffzips.pl

    It recursively diffs every zip file inside the original zip, which is especially useful for different Java package formats: jar, war, and ear.

    zipcmp uses more simple approach and it doesn't recurse into archived zips.

    0 讨论(0)
  • 2020-12-16 13:38

    A python solution for zip files:

    import difflib
    import zipfile
    
    def diff(filename1, filename2):
        differs = False
    
        z1 = zipfile.ZipFile(open(filename1))
        z2 = zipfile.ZipFile(open(filename2))
        if len(z1.infolist()) != len(z2.infolist()):
            print "number of archive elements differ: {} in {} vs {} in {}".format(
                len(z1.infolist()), z1.filename, len(z2.infolist()), z2.filename)
            return 1
        for zipentry in z1.infolist():
            if zipentry.filename not in z2.namelist():
                print "no file named {} found in {}".format(zipentry.filename,
                                                            z2.filename)
                differs = True
            else:
                diff = difflib.ndiff(z1.open(zipentry.filename),
                                     z2.open(zipentry.filename))
                delta = ''.join(x[2:] for x in diff
                                if x.startswith('- ') or x.startswith('+ '))
                if delta:
                    differs = True
                    print "content for {} differs:\n{}".format(
                        zipentry.filename, delta)
        if not differs:
            print "all files are the same"
            return 0
        return 1
    

    Use as

    diff(filename1, filename2)
    

    It compares files line-by-line in memory and shows changes.

    0 讨论(0)
提交回复
热议问题