Cloning a git-svn repository leads to “disappearing” branches

问题

Foreword

We have a big SVN repository (200k+ commits and hundreds of branches and tags). A big, ominous, unmaintainable, frustrating mess. To work more efficiently, about a year ago I did a git svn clone on my development machine, so I locally develop on GIT and then push to SVN.

We're now thinking about splitting up the repository and move the main development branches to git, or at least to move our development branch on git.

Since I have my local git repository, I wanted to do some test by cloning a part of it and push it to our company's GitLab, but without much success, probably because I lack the knowledge of some Git mechanisms

Let's start

In order to do some quick tests without pushing the entire 30GB repository, I wanted to do a shallow clone of my local Git repo and push the clone using the following command:

git clone --depth=1 --no-single-branch file:///path/to/repo

I wanted to clone the HEAD revision of every branch, but the clone included only the master branch and our development branch, nothing else (I'm not sure about the tags, I didn't check). After a while I realised that the clone included only our dev branch because it was the only one that I ever checked out (even though the git svn repository is a full clone of the SVN repository).

I then tried to do a

git clone file:///path/to/repo

and I again got only the master and my development branch, nothing else.

In these two attempts I noticed that the clone was much smaller (200-700MB) than the original git repository (30GB). In the second try I was expecting a repository of the same size of the original.

So I realised that git is cloning only the checked out branches, not the remote ones (remotes/svn/*). Why, since the git svn repo is a full copy of the svn repo? Why is it not cloning all the branches? They are there (otherwise the git svn repo wouldn't be so big), they just aren't checked out. And... How we can talk of "remote" branches? Aren't they part of the git svn repo, and should be considered local?

So how could I tell git to consider all those branches when cloning the git svn repo? I wouldn't like to to a massive checkout of all the branches in the git svn repo, it sounds to me like a clumsy and messy solution.

Update

Thanks for your reply. I'm sorry for not replying you sooner, but you left me quite a lot of documentation to read, plus I had to do some other research on my own!

So, if my understanding is correct, my git-svn repository contains all the commits of the original svn repository and it's aware that the svn repository contains branches and tags, but locally it doesn't have the association between the commit's SHA1 and the label which is the branch name, and I have to add those associations manually.

Your snippet is a very useful starting point, thanks!

I also discovered the magic argument --mirror for the clone command, which imported also the remotes, so I didn't have to touch the git-svn repo, but I later created the branches directly on the cloned git repo.

回答1:

TL;DR: you'll need to create actual branch names for each branch you want to have as a branch. Remote-tracking names just don't count when cloning (well, usually). This can be very cheap! Read on for the long explanation.

Here's a cheap way to create local branches from each refs/remotes/svn/* name:

git for-each-ref --format='%(refname)' refs/remotes/svn |
    while read name; do
        local=${name#refs/remotes/svn/}  # remove the icky part from the name
        [ "$local" == HEAD ] && continue
        git branch $local $name
    done

This (note: untested, might have some minor bugs) will print an error message for those names that have corresponding local branch names; presumably you can ignore that.

... So I realised that git is cloning only the checked out branches, not the remote ones ...

There isn't really any such thing as a "remote branch". Well, unless you define "remote branch" in such a way that there is. Which ultimately leaves us with the problem of defining "branch" in the first place: see What exactly do we mean by "branch"? When being careful about this—as opposed to everyday conversation—I like to be sure to use the two-word phrase branch name to refer to names like master, which are actually already shortened: see below.

What Git deals with are commits, as found by names, and by other commits. See Think Like (a) Git for a proper definition of reachability and a lot of the associated stuff,¹ but the general idea is that names—full names like refs/heads/master or refs/remotes/svn/foo—each hold the hash ID of one commit. That one commit remembers which commit(s) come right before it. Those commits—the parent commits—remember their predecessor commits, the grandparents remember their predecessors, and so on.

What git clone does is:

create a new empty directory (or use one you tell it to use);
create a new empty repository in that directory, with git init;
add a remote, which consists of a simple name like origin and a URL (and some configuration—this can be slipped to step 4, or considered part of step 3);
do any additional necessary configuration;
run git fetch; and last
run a git checkout on a name that either you supply, or the other Git supplies, or—worst fallback case—try to git checkout master.

Step 5 here is the most important one for you here, because git fetch is where all the main action is.

Why is it not cloning all the branches?

When git fetch runs, it gets a listing from the other Git, in which the other Git tells it about all of its names. The other Git will say, e.g., I have refs/heads/master, that's commit a123456...; I have refs/remotes/svn/foo, that's commit b789abc... and so on.

Your Git then throws out any name that does not start with refs/heads/ or refs/tags/. The resulting list of names are their Git's branch names and tag names. All other names fall into other categories. In particular, any name starting with refs/remotes/ is a remote-tracking name,² so it gets thrown out.

Your Git then asks their Git for the commits (by hash ID) and any other objects needed to make the commits complete and useful. Your Git also asks for objects identified via tag names, as long as you're taking the tags—though exactly which tags get taken when gets very complex depending on git fetch options.

Once your Git has the commit objects, and other internal objects if/as needed, your Git then copies their branch names—their refs/heads/master and the like—to your remote-tracking names. Their refs/heads/master becomes your refs/remotes/origin/master. Their refs/heads/develop (if one exists) becomes your refs/remotes/origin/develop.

All of this happens during the git fetch step (step 5). Options like --single-branch or --no-single-branch affect which of their branch names are matched, but not the transformation from branch name to remote-tracking name. The --mirror option does affect the transformation, eliminating it entirely, but has a sometimes-unwanted side effect of implying --bare as well.

The last step, the git checkout in step 6, has one very big side effect. The new clone you just made has no branch names.³ So git checkout master or whatever other name is clearly doomed to fail, right? But it doesn't fail. Instead, Git makes use of a clever (?) trick: When you ask to check out a branch name that does not exist, Git looks at the remote-tracking names to see if there's one that would match up. If so, Git will create the (local) branch name using the commit hash ID stored in the corresponding remote-tracking name.

So this creates whichever branch you asked for—or in this case, since you didn't specify one, has the other Git tell your Git which branch name the other Git recommends. (That's usually just master anyway.) Step 6 is what creates that.

If you have tags in the origin repository, you will have some number of them—between zero and all—in the new clone too. You can explicitly ask for tags later, or not, with a later git fetch. You can explicitly ask not to have tags in your new clone at clone time. Tags that you do have at this point are simply copied from those in the other repository. The idea here is that—unlike branch names, which are totally private to each repository—the tag names will be shared across all repositories, spread by repository-joining, almost like some sort of virus.⁴

Since your source repository has mostly just remote-tracking names, rather than branches, your clone—shallow or not—omits those names and commits that are reachable only from those names.

¹This differs quite a bit from SVN, in which there's a single central server that can simply number each revision sequentially. Git literally can't rely on sequential numbering, because there may be separate clones that are sequentially-but-parallel-ly (apologies for the non-word here 😀) acquiring different commits. That is, suppose clones A and B are identical and each have 500 commits. Then Alice, who's working in clone A, creates commit #501. Meanwhile Bob, working in clone B, creates commit #501. The two commits are different—maybe on different branches—and they're both #501. Sequential numbers cannot work here.

²Git calls this a remote-tracking branch name. I used to use this phrase, but I now think the word branch in here is more misleading than useful. You can call it what you want: just remember that it's not a branch name as those actually start with refs/heads/.

Note: Git usually strips off the refs/heads/, refs/tags/, and refs/remotes/ parts here when printing the names, on the assumption that the output will still be clear enough. Sometimes Git only strips off refs/ though: try git branch -r, then try git branch -a. (Why are these different? It is a mystery.)

³If you used --mirror, your new clone has all the branch names, but then git clone skips step 6. Your new clone is bare so there is no work-tree, and git checkout cannot be used.

⁴This is also how commits spread. Suppose you have commits W, X, and Y in a row, that they don't have. You connect to their Git as a push operation, and you give them all these three commits and ask them to set one of their names to remember commit Y, which remembers X, which remembers W, which remembers a commit they already have.

Or: they have these commits and you don't. You connect to their Git as a fetch operation, they give you all three, and your Git sets your origin/whatever to remember commit Y now.

Basically, you get two Git repositories to mate. One sends, the other receives. The receiver gets all the new stuff that the receiver asks for that the sender sends, even if the receiver in the end didn't really want it after all: at this point, the receiver can reject the request to update some name to remember the last commit in a chain of commits. The receiver thereby keeps their old name and its old hash ID, or has no name (and no hash ID).

A commit or other Git object whose hash ID has no way to find it is eventually garbage-collected and tossed out. For bare repositories this tends to be faster, and since Git 2.11, the server "receive commits and other Git objects" process sticks them in a quarantine area first, before decide that they're good and accepting them, or deciding that they're bad and rejecting them. The accepted ones then migrate from quarantine to the real repository database, with the rejected ones being tossed quickly. Pre-2.11 the received objects went in right away, temporarily bloating servers that, e.g., reject large files (think of GitHub's 100MB file size limits).

Shallow clones modify (some of) these rules: with a shallow clone, the receiving Git has a special file full of hash IDs. It lacks those actual commits, but pretends that it has them, so that when the sender asks "do you have commit X" the answer is "yes", so that the sender then never sends commit X.

来源：https://stackoverflow.com/questions/58635202/cloning-a-git-svn-repository-leads-to-disappearing-branches

标签

git

git-branch

git-svn

git-clone