git finding duplicate commits (by patch-id)

无人久伴 提交于 2019-11-29 04:09:08

Because the duplicate changes are likely to be not on the same branch (except when there are reverts in between them), you could use git cherry:

git cherry [-v] [<upstream> [<head> [<limit>]]]

Where upstream would be the branch to check for duplicates of changes in head.

For looking for duplicates of a specific commit, this may work for you.

First, determine the patch id of the target commit:

$ THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF='7a3e67c'
$ git show $THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF | git patch-id
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3

The first SHA is the patch-id. Next, list the patch ids for every commit and filter out any that match:

$ for c in $(git rev-list --all); do git show $c | git patch-id; done | grep 'f6ea51cd6acd30cd627ce1a56e2733c1777d5b52'
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 5028e2b5500bd5f4637531e337e17b73f5d0c0b1
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 929c66b5783a0127a7689020d70d398f095b9e00

All together, with a few extra flags, and in the form of a utility script:

test ! -z "$1" && TARGET_COMMIT_SHA="$1" || TARGET_COMMIT_SHA="HEAD"

TARGET_COMMIT_PATCHID=$(
git show --patch-with-raw "$TARGET_COMMIT_SHA" |
    git patch-id |
    cut -d' ' -f1
)
MATCHING_COMMIT_SHAS=$(
for c in $(git rev-list --all); do
    git show --patch-with-raw "$c" |
        git patch-id
done |
    fgrep "$TARGET_COMMIT_PATCHID" |
    cut -d' ' -f2
)

echo "$MATCHING_COMMIT_SHAS"

Usage:

$ git list-dupe-commits 7a3e67c
5028e2b5500bd5f4637531e337e17b73f5d0c0b1
7a3e67ce38dbef471889d9f706b9161da7dc5cf3
929c66b5783a0127a7689020d70d398f095b9e00

It isn't terribly speedy, but for most repos should get the job done (just measured 36 seconds for a repo with 826 commits and a 158MB .git dir on a 2.4GHz Core 2 Duo).

I have a draft that works on a toy repo, but as it keeps the patch->commit map in memory it might have problems on large repos:

# print commit pairs with the same patch-id
for c in $(git rev-list HEAD); do \
    git show $c | git patch-id; done \
| perl -anle '($p,$c)=@F;print "$c $s{$p}" if $s{$p};$s{$p}=$c'

The output should be pairs of commits with the same patch-id (3 duplicates A B C come out as "A B" then "B C").

Change the git rev-list command to restrict the commits checked:

git log --format=%H HEAD somefile

Append "| xargs git show" to view the commits in detail, or "| xargs git show -s --oneline" for a summary:

0569473 add 6-8
5e56314 add 6-8 again
bece3c3 comment
e037ed6 add comment again

It turns out patch-id didn't work in my original case as there were additional changes in that later commit. "git log -S" was more useful.

The nifty command suggested by bsb requires a couple of small tweaks:

(1) Instead of git show, which runs git diff-tree --cc, the command should use

    git diff-tree -p

Otherwise git patch-id generates spurious null SHA1 hashes.

(2) When the pipe to xargs is used, xargs should have the -L 1 argument. Otherwise a triplicated commit will not be paired with an equivalent commit.

Here's an alias to go in ~/.gitconfig:

dup = "!f() { for c in $(git rev-list HEAD); do git diff-tree -p $c | git patch-id; done | perl -anle '($p,$c)=@F;print \"$c $s{$p}\" if $s{$p};$s{$p}=$c' | xargs -L 1 git show -s --oneline; }; f" # "git dup" lists duplicate commits

To search for duplicate commits of commit $hash, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | grep ^$(git show $hash|git patch-id|cut -c1-40) | cut -c42-80 \
    | xargs -r git show -s --oneline

For searching the duplicate of a merge commit $mergehash, replace $(git show $hash|git patch-id|cut -c1-40) above by one of the two patch IDs (1st column) given by git diff-tree -m -p $mergehash | git patch-id. They correspond to the diffs of the merge commit with each of its two parents.

To find duplicates of all commits, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | sort | uniq -w40 -D | cut -c42-80 \
    | xargs -r git log --no-walk --pretty=format:"%h %ad %an (%cn) %s" --date-order --date=iso

The search for duplicate commits can be extended or limited by changing the arguments to git rev-list, which accepts numerous options. For example, to limit the search to a specific branch specify its name instead of the option --all; or to search in the last 100 commits pass the arguments HEAD ^HEAD~100.

Note that these commands are fast since they use no shell loop, and batch-process commits.

To include merge commits, remove the option --no-merges, and replace xargs -r git show by xargs -r -L1 git diff-tree -m -p. This is much slower because git diff-tree is executed once per commit.

Explanation:

  • The first line generates a map of the patch IDs with the commit hashes (2-column data, of 40 characters each).

  • The second line only keeps commit hashes (2nd column) corresponding to the duplicate patch IDs (1st column).

  • The last line prints custom information about the duplicate commits.

For anyone wanting to do this on windows powershell the equivalent command to unagi's answer is:

git rev-list --no-merges --all  | %{&git.exe show $_} | 
  git patch-id | ConvertFrom-String -PropertyNames PatchId, Commit | 
  Group-Object PatchId | Where-Object count -gt 1 | 
  %{$_.group.Commit + " "}

Gives an output like:

1605e0e1e13d7b3f456c20432d8edec664ca7117
1e8efa8f2f01962a2c08fd25caf687d330383428

b45b6db084b27ae420ac8e9cf6511110ebb46513
4a2e1e3ba5a9a1d5db1d00343813e1404f6124e2

With the duplicate commit hashes grouped together.

CAUTION: On my repo this was a slow command so be sure to filter the call to rev-list appropriately!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!