Removing big files from Git history

问题

I've read multiple answers advising on using either filter-branch or BFG to accomplish this task, but I feel I need further advice because my situation is a bit peculiar.

I have to manage two repositories, one is basically a clone of the other, and ideally, I'd want to pull the changes from the origin into the clone on a daily basis. However, the origin repo contains very big files in its history, which are above Github's size limits. So I have to remove these files, but at the same time, I don't want to harm the existing commit history beyond the changes to those specific files. From what I understand, BFG performs a complete rewrite of the history, which will fool Github into thinking that all existing files were deleted and recreated as new files, whereas filter-branch doesn't do that, but it's also extremely slow by comparison, and my repository is very big reaching about 100000 commits...

So I'm trying to figure out what's the best way to go about this. Should I use BFG at certain points, and simply accept that I'm gonna see ridiculous pull requests as a result of its modifications, or maybe I should use filter-branch in some manner? To clarify, there are only 3 files which are the cause of this grievance.

回答1:

Commit history in Git is nothing but commits.

No commit can ever be changed. So for anything to remove a big file from some existing commit, that thing—whether it's BFG, or git filter-branch, or git filter-repo, or whatever—is going to have to extract a "bad" commit, make some changes (e.g., remove the big file), and make a new and improved substitute commit.

The terrible part of this is that each subsequent commit encodes, in an unchangeable way, the raw hash ID of the bad commit. The immediate children of the bad commit encode it as their parent hash. So you—or the tool—must copy those commits to new-and-improved ones. What's improved about them is that they lack the big file and refer back to the replacement they just made for the initial bad commit.

Of course, their children encode their hash IDs as parent hash IDs, so now the tool must copy those commits. This repeats all the way up to the last commit in each branch, as identified by the branch name:

...--o--o--x--o--o--o   [old, bad version of branch]
         \
          ●--●--●--●   <-- branch

where x is the bad commit: x had to be copied to the first new-and-improved ● but then all subsequent commits had to be copied too.

The copies, being different commits, have different hash IDs. Every clone must now abandon the "bad" commits—the x one and all its descendants—in favor of the new-and-improved ones.

All these repository-editing tools should strive to make minimal changes. The BFG is probably the fastest and most convenient to use, but git filter-branch can be told to copy only all bad-and-descendant commits and to use --index-filter, which is its fastest (still slow!) filter. To do this, use:

git filter-branch --index-filter <command> -- <hash>..branch1 <hash>..branch2 ...

where the <command> is an appropriate "git rm --cached --ignore-unmatch" command (be sure to quote the whole thing) and the <hash> and branch names specify which commits to copy. Remember that A..B syntax means don't look at commit A or earlier, while looking at commits B and earlier so if commit x is, say, deadbeefbadf00d..., you'll want to use the hash of its parent as the limiter:

git filter-branch --index-filter "..." -- deadbeefbadf00d^..master

for instance (fill in the ... part with the right removal command).

(Note: I have not actually used The BFG, but if it re-copies commits unnecessarily, that's really bad, and I bet it does not.)

来源：https://stackoverflow.com/questions/59028333/removing-big-files-from-git-history

标签

git

github

git-filter-branch

git-rewrite-history

bfg-repo-cleaner