git finding duplicate commits (by patch-id)

后端 未结 7 1787
天命终不由人
天命终不由人 2020-12-16 02:05

I\'d like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.

This seems to be an intended use of patch-

相关标签:
7条回答
  • 2020-12-16 02:12

    For anyone wanting to do this on windows powershell the equivalent command to unagi's answer is:

    git rev-list --no-merges --all  | %{&git.exe show $_} | 
      git patch-id | ConvertFrom-String -PropertyNames PatchId, Commit | 
      Group-Object PatchId | Where-Object count -gt 1 | 
      %{$_.group.Commit + " "}
    

    Gives an output like:

    1605e0e1e13d7b3f456c20432d8edec664ca7117
    1e8efa8f2f01962a2c08fd25caf687d330383428
    
    b45b6db084b27ae420ac8e9cf6511110ebb46513
    4a2e1e3ba5a9a1d5db1d00343813e1404f6124e2
    

    With the duplicate commit hashes grouped together.

    CAUTION: On my repo this was a slow command so be sure to filter the call to rev-list appropriately!

    0 讨论(0)
  • 2020-12-16 02:14

    I have a draft that works on a toy repo, but as it keeps the patch->commit map in memory it might have problems on large repos:

    # print commit pairs with the same patch-id
    for c in $(git rev-list HEAD); do \
        git show $c | git patch-id; done \
    | perl -anle '($p,$c)=@F;print "$c $s{$p}" if $s{$p};$s{$p}=$c'
    

    The output should be pairs of commits with the same patch-id (3 duplicates A B C come out as "A B" then "B C").

    Change the git rev-list command to restrict the commits checked:

    git log --format=%H HEAD somefile
    

    Append "| xargs git show" to view the commits in detail, or "| xargs git show -s --oneline" for a summary:

    0569473 add 6-8
    5e56314 add 6-8 again
    bece3c3 comment
    e037ed6 add comment again
    

    It turns out patch-id didn't work in my original case as there were additional changes in that later commit. "git log -S" was more useful.

    0 讨论(0)
  • 2020-12-16 02:17

    The nifty command suggested by bsb requires a couple of small tweaks:

    (1) Instead of git show, which runs git diff-tree --cc, the command should use

        git diff-tree -p
    

    Otherwise git patch-id generates spurious null SHA1 hashes.

    (2) When the pipe to xargs is used, xargs should have the -L 1 argument. Otherwise a triplicated commit will not be paired with an equivalent commit.

    Here's an alias to go in ~/.gitconfig:

    dup = "!f() { for c in $(git rev-list HEAD); do git diff-tree -p $c | git patch-id; done | perl -anle '($p,$c)=@F;print \"$c $s{$p}\" if $s{$p};$s{$p}=$c' | xargs -L 1 git show -s --oneline; }; f" # "git dup" lists duplicate commits
    
    0 讨论(0)
  • 2020-12-16 02:18

    To search for duplicate commits of commit $hash, excluding merge commits:

    git rev-list --no-merges --all | xargs -r git show | git patch-id \
        | grep ^$(git show $hash|git patch-id|cut -c1-40) | cut -c42-80 \
        | xargs -r git show -s --oneline
    

    For searching the duplicate of a merge commit $mergehash, replace $(git show $hash|git patch-id|cut -c1-40) above by one of the two patch IDs (1st column) given by git diff-tree -m -p $mergehash | git patch-id. They correspond to the diffs of the merge commit with each of its two parents.

    To find duplicates of all commits, excluding merge commits:

    git rev-list --no-merges --all | xargs -r git show | git patch-id \
        | sort | uniq -w40 -D | cut -c42-80 \
        | xargs -r git log --no-walk --pretty=format:"%h %ad %an (%cn) %s" --date-order --date=iso
    

    The search for duplicate commits can be extended or limited by changing the arguments to git rev-list, which accepts numerous options. For example, to limit the search to a specific branch specify its name instead of the option --all; or to search in the last 100 commits pass the arguments HEAD ^HEAD~100.

    Note that these commands are fast since they use no shell loop, and batch-process commits.

    To include merge commits, remove the option --no-merges, and replace xargs -r git show by xargs -r -L1 git diff-tree -m -p. This is much slower because git diff-tree is executed once per commit.

    Explanation:

    • The first line generates a map of the patch IDs with the commit hashes (2-column data, of 40 characters each).

    • The second line only keeps commit hashes (2nd column) corresponding to the duplicate patch IDs (1st column).

    • The last line prints custom information about the duplicate commits.

    0 讨论(0)
  • 2020-12-16 02:21

    Make sure to use a recent version of Git

    The git log --format=%H mentioned by the OP bsb's answer is not always unique.

    That is because, before Git 2.29 (Q4 2020), the patch-id computation did not ignore the "incomplete last line" marker like whitespaces.

    See commit 82a6201 (19 Aug 2020) by René Scharfe (rscharfe).
    (Merged by Junio C Hamano -- gitster -- in commit 5122614, 24 Aug 2020)

    patch-id: ignore newline at end of file in diff_flush_patch_id()

    Reported-by: Tilman Vogel
    Initial-test-by: Tilman Vogel
    Signed-off-by: René Scharfe

    Whitespace is ignored when calculating patch IDs.
    This is done by removing all whitespace from diff lines before hashing them, including a newline at the end of a file.
    If that newline is missing, however, diff reports that fact in a separate line containing "\ No newline at end of file\n", and this marker is hashed like a context line.

    This goes against our goal of making patch IDs independent of whitespace.

    Use the same heuristic that 2485eab55cc (git-patch-id: do not trip over "no newline" markers, 2011-02-17) added to git patch-id(man) instead and skip diff lines that start with a backslash and a space and are longer than twelve characters.

    0 讨论(0)
  • 2020-12-16 02:32

    For looking for duplicates of a specific commit, this may work for you.

    First, determine the patch id of the target commit:

    $ THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF='7a3e67c'
    $ git show $THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF | git patch-id
    f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
    

    The first SHA is the patch-id. Next, list the patch ids for every commit and filter out any that match:

    $ for c in $(git rev-list --all); do git show $c | git patch-id; done | grep 'f6ea51cd6acd30cd627ce1a56e2733c1777d5b52'
    f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 5028e2b5500bd5f4637531e337e17b73f5d0c0b1
    f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
    f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 929c66b5783a0127a7689020d70d398f095b9e00
    

    All together, with a few extra flags, and in the form of a utility script:

    test ! -z "$1" && TARGET_COMMIT_SHA="$1" || TARGET_COMMIT_SHA="HEAD"
    
    TARGET_COMMIT_PATCHID=$(
    git show --patch-with-raw "$TARGET_COMMIT_SHA" |
        git patch-id |
        cut -d' ' -f1
    )
    MATCHING_COMMIT_SHAS=$(
    for c in $(git rev-list --all); do
        git show --patch-with-raw "$c" |
            git patch-id
    done |
        fgrep "$TARGET_COMMIT_PATCHID" |
        cut -d' ' -f2
    )
    
    echo "$MATCHING_COMMIT_SHAS"
    

    Usage:

    $ git list-dupe-commits 7a3e67c
    5028e2b5500bd5f4637531e337e17b73f5d0c0b1
    7a3e67ce38dbef471889d9f706b9161da7dc5cf3
    929c66b5783a0127a7689020d70d398f095b9e00
    

    It isn't terribly speedy, but for most repos should get the job done (just measured 36 seconds for a repo with 826 commits and a 158MB .git dir on a 2.4GHz Core 2 Duo).

    0 讨论(0)
提交回复
热议问题