How to compare two tarball's content

偶尔善良 提交于 2019-12-02 18:54:11

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.


If you can afford to extract just one tar file, you can use the --diff option of tar to look for differences with the contents of other tar file.


One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!

execute a tar tvf to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.

Just remember that this last scheme does not really do checksum.

Sample tar and output (all files are zero size in this example).

$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/

Command to generate sorted name/size list

$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/

You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.

Try also pkgdiff to visualize differences between packages (detects added/removed/renamed files and changed content, exist with zero code if unchanged):

pkgdiff PKG-0.tgz PKG-1.tgz

I realise that this is a late reply, but I came across the thread whilst attempting to achieve the same thing. The solution that I've implemented outputs the tar to stdout, and pipes it to whichever hash you choose:

tar -xOzf archive.tar.gz | sort | sha1sum

Note that the order of the arguments is important; particularly O which signals to use stdout.

Here is my variant, it is checking the unix permission too:

Works only if the filenames are shorter than 200 char.

diff <(tar -tvf 1.tar | awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2) <(tar -tvf 2.tar|awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2)
Evan

Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."

If not extracting the archives nor needing the differences, try diff's -q option:

diff -q 1.tar 2.tar

This quiet result will be "1.tar 2.tar differ" or nothing, if no differences.

There is tool called archdiff. It is basically a perl script that can look into the archives.

Takes two archives, or an archive and a directory and shows a summary of the
differences between them.
Jason Swift

I have a similar question and i resolve it by python, here is the code. ps:although this code is used to compare two zipball's content,but it's similar with tarball, hope i can help you

import zipfile
import os,md5
import hashlib
import shutil

def decompressZip(zipName, dirName):
    try:
        zipFile = zipfile.ZipFile(zipName, "r")
        fileNames = zipFile.namelist()
        for file in fileNames:
            zipFile.extract(file, dirName)
        zipFile.close()
        return fileNames
    except Exception,e:
        raise Exception,e

def md5sum(filename):
    f = open(filename,"rb")
    md5obj = hashlib.md5()
    md5obj.update(f.read())
    hash = md5obj.hexdigest()
    f.close()
    return str(hash).upper()

if __name__ == "__main__":
    oldFileList = decompressZip("./old.zip", "./oldDir")
    newFileList = decompressZip("./new.zip", "./newDir")

    oldDict = dict()
    newDict = dict()

    for oldFile in oldFileList:
        tmpOldFile = "./oldDir/" + oldFile
        if not os.path.isdir(tmpOldFile):
            oldFileMD5 = md5sum(tmpOldFile)
            oldDict[oldFile] = oldFileMD5

    for newFile in newFileList:
        tmpNewFile = "./newDir/" + newFile
        if not os.path.isdir(tmpNewFile):
            newFileMD5 = md5sum(tmpNewFile)
            newDict[newFile] = newFileMD5

    additionList = list()
    modifyList = list()

    for key in newDict:
        if not oldDict.has_key(key):
            additionList.append(key)
        else:
            newMD5 = newDict[key]
            oldMD5 = oldDict[key]
            if not newMD5 == oldMD5:
            modifyList.append(key)

    print "new file lis:%s" % additionList
    print "modified file list:%s" % modifyList

    shutil.rmtree("./oldDir")
    shutil.rmtree("./newDir")

There is also diffoscope, which is more generic, and allows to compare things recursively (including various formats).

pip install diffoscope

One can use a simple script:

#!/usr/bin/env bash
set -eu

tar1=$1
tar2=$2
shift 2
tar_opts=("$@")

tmp1=`mktemp -d`
_trap="rm -r "$tmp1"; ${_trap:-}" && trap "$_trap" EXIT
tar xf "$tar1" -C "$tmp1"

tmp2=`mktemp -d`
_trap="rm -r "$tmp2"; ${_trap:-}" && trap "$_trap" EXIT
tar xf "$tar2" -C "$tmp2"

diff -ur "${tar_opts[@]:+${tar_opts[@]}}" "$tmp1" "$tmp2"

Usage:

diff-tars.sh TAR1 TAR2 [DIFF_OPTS]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!