I\'d like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.
This seems to be an intended use of patch-
For anyone wanting to do this on windows powershell the equivalent command to unagi's answer is:
git rev-list --no-merges --all | %{&git.exe show $_} |
git patch-id | ConvertFrom-String -PropertyNames PatchId, Commit |
Group-Object PatchId | Where-Object count -gt 1 |
%{$_.group.Commit + " "}
Gives an output like:
1605e0e1e13d7b3f456c20432d8edec664ca7117
1e8efa8f2f01962a2c08fd25caf687d330383428
b45b6db084b27ae420ac8e9cf6511110ebb46513
4a2e1e3ba5a9a1d5db1d00343813e1404f6124e2
With the duplicate commit hashes grouped together.
CAUTION: On my repo this was a slow command so be sure to filter the call to rev-list appropriately!
I have a draft that works on a toy repo, but as it keeps the patch->commit map in memory it might have problems on large repos:
# print commit pairs with the same patch-id
for c in $(git rev-list HEAD); do \
git show $c | git patch-id; done \
| perl -anle '($p,$c)=@F;print "$c $s{$p}" if $s{$p};$s{$p}=$c'
The output should be pairs of commits with the same patch-id (3 duplicates A B C come out as "A B" then "B C").
Change the git rev-list command to restrict the commits checked:
git log --format=%H HEAD somefile
Append "| xargs git show" to view the commits in detail, or "| xargs git show -s --oneline" for a summary:
0569473 add 6-8
5e56314 add 6-8 again
bece3c3 comment
e037ed6 add comment again
It turns out patch-id didn't work in my original case as there were additional changes in that later commit. "git log -S" was more useful.
The nifty command suggested by bsb requires a couple of small tweaks:
(1) Instead of git show
, which runs git diff-tree --cc
, the command should use
git diff-tree -p
Otherwise git patch-id
generates spurious null SHA1 hashes.
(2) When the pipe to xargs
is used, xargs
should have the -L 1
argument. Otherwise a triplicated commit will not be paired with an equivalent commit.
Here's an alias to go in ~/.gitconfig
:
dup = "!f() { for c in $(git rev-list HEAD); do git diff-tree -p $c | git patch-id; done | perl -anle '($p,$c)=@F;print \"$c $s{$p}\" if $s{$p};$s{$p}=$c' | xargs -L 1 git show -s --oneline; }; f" # "git dup" lists duplicate commits
To search for duplicate commits of commit $hash
, excluding merge commits:
git rev-list --no-merges --all | xargs -r git show | git patch-id \
| grep ^$(git show $hash|git patch-id|cut -c1-40) | cut -c42-80 \
| xargs -r git show -s --oneline
For searching the duplicate of a merge commit $mergehash
, replace $(git show $hash|git patch-id|cut -c1-40)
above by one of the two patch IDs (1st column) given by git diff-tree -m -p $mergehash | git patch-id
. They correspond to the diffs of the merge commit with each of its two parents.
To find duplicates of all commits, excluding merge commits:
git rev-list --no-merges --all | xargs -r git show | git patch-id \
| sort | uniq -w40 -D | cut -c42-80 \
| xargs -r git log --no-walk --pretty=format:"%h %ad %an (%cn) %s" --date-order --date=iso
The search for duplicate commits can be extended or limited by changing the arguments to git rev-list
, which accepts numerous options. For example, to limit the search to a specific branch specify its name instead of the option --all
; or to search in the last 100 commits pass the arguments HEAD ^HEAD~100
.
Note that these commands are fast since they use no shell loop, and batch-process commits.
To include merge commits, remove the option --no-merges
, and replace xargs -r git show
by xargs -r -L1 git diff-tree -m -p
. This is much slower because git diff-tree
is executed once per commit.
Explanation:
The first line generates a map of the patch IDs with the commit hashes (2-column data, of 40 characters each).
The second line only keeps commit hashes (2nd column) corresponding to the duplicate patch IDs (1st column).
The last line prints custom information about the duplicate commits.
Make sure to use a recent version of Git
The git log --format=%H
mentioned by the OP bsb's answer is not always unique.
That is because, before Git 2.29 (Q4 2020), the patch-id computation did not ignore the "incomplete last line" marker like whitespaces.
See commit 82a6201 (19 Aug 2020) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit 5122614, 24 Aug 2020)
patch-id: ignore newline at end of file in
diff_flush_patch_id()
Reported-by: Tilman Vogel
Initial-test-by: Tilman Vogel
Signed-off-by: René Scharfe
Whitespace is ignored when calculating patch IDs.
This is done by removing all whitespace from diff lines before hashing them, including a newline at the end of a file.
If that newline is missing, however, diff reports that fact in a separate line containing "\ No newline at end of file\n", and this marker is hashed like a context line.This goes against our goal of making patch IDs independent of whitespace.
Use the same heuristic that 2485eab55cc (git-patch-id: do not trip over "no newline" markers, 2011-02-17) added to git patch-id(man) instead and skip diff lines that start with a backslash and a space and are longer than twelve characters.
For looking for duplicates of a specific commit, this may work for you.
First, determine the patch id of the target commit:
$ THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF='7a3e67c'
$ git show $THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF | git patch-id
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
The first SHA is the patch-id. Next, list the patch ids for every commit and filter out any that match:
$ for c in $(git rev-list --all); do git show $c | git patch-id; done | grep 'f6ea51cd6acd30cd627ce1a56e2733c1777d5b52'
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 5028e2b5500bd5f4637531e337e17b73f5d0c0b1
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 929c66b5783a0127a7689020d70d398f095b9e00
All together, with a few extra flags, and in the form of a utility script:
test ! -z "$1" && TARGET_COMMIT_SHA="$1" || TARGET_COMMIT_SHA="HEAD"
TARGET_COMMIT_PATCHID=$(
git show --patch-with-raw "$TARGET_COMMIT_SHA" |
git patch-id |
cut -d' ' -f1
)
MATCHING_COMMIT_SHAS=$(
for c in $(git rev-list --all); do
git show --patch-with-raw "$c" |
git patch-id
done |
fgrep "$TARGET_COMMIT_PATCHID" |
cut -d' ' -f2
)
echo "$MATCHING_COMMIT_SHAS"
Usage:
$ git list-dupe-commits 7a3e67c
5028e2b5500bd5f4637531e337e17b73f5d0c0b1
7a3e67ce38dbef471889d9f706b9161da7dc5cf3
929c66b5783a0127a7689020d70d398f095b9e00
It isn't terribly speedy, but for most repos should get the job done (just measured 36 seconds for a repo with 826 commits and a 158MB .git dir on a 2.4GHz Core 2 Duo).