问题
I am just wondering what happens when a fork is done on github.
For example, when I fork a project does it make a copy on github server of all of that code, or just create a link to it?
So another question: In git since it hashes all the files if you add the same file to it it does not need to store the file contents again because the hash will be already in the system, correct?
Is github like this? So if I happen to upload the exact same piece of code as another user, when github gits it does it essentially just create a link to that file since it would have the same hash, or does it save all of its contents again separately?
Any enlightenment would be great, thanks!
回答1:
github.com is exactly the same semantics as git, but with a web-based GUI interface wrapped around it.
Storage: "Git stores each revision of a file as a unique blob object"
So each file is stored uniquely, but it uses a SHA-1 hash to determine changes from file to file.
As for github, a fork is essentially a clone. This means that a new fork is a new area of storage on their servers, with a reference to its ORIGIN. It in no way would set up links between the two, because git by nature can track remotes. Each fork knows the upstream.
When you say "if I happen to upload the exact same piece of code as another user", the term "upload" is a bit vague in the "git" sense. If you are working on the same repository and git even allows you to commit the same file, that means it was different and it checked in that revision. But if you mean working on a clone/fork of another repo, it would be the same situation, but also there would be no links made on the filesystem to the other repo.
I can't claim to have any intimate knowledge of what optimizations github might be making under the hood, on their internal system. They could possibly be doing intermediate custom operations to save on disk space. But anything they would be doing would be transparent to you and would not matter much, since effectively it should always operate under expected git semantics.
A developer at github wrote a blog post about how they internally do their own git workflow. While it doesn't relate to your question about how they manage the actual workflow of the service, I think this quote from the conclusion is pretty informative:
Git itself is fairly complex to understand, making the workflow that you use with it more complex than necessary is simply adding more mental overhead to everybody’s day. I would always advocate using the simplest possible system that will work for your team and doing so until it doesn’t work anymore and then adding complexity only as absolutely needed.
What I take away from that, is they acknowledge how complex git is by itself, so most likely they take the lightest touch possible to wrap around it to provide the service, and let git do what it does best natively.
回答2:
According to https://enterprise.github.com/releases/2.2.0/notes GitHub Enterprise (and I assume GitHub) somehow shares objects between forks to reduce disk space usage:
This release changes the way GitHub Enterprise stores repositories, which reduces disk usage by sharing Git objects between forks and improves caching performance when reading repository data.
There's also more details about how they do it at https://githubengineering.com/counting-objects.
回答3:
I don't know how exactly GitHub do it, but here is a possible way. It requires some knowledge of the way git stores its data.
The short answer is that the repos can share the objects
database but each have their own references.
We can even simulate it locally for a proof-of-concept.
In the directory of a bare repo (or in the .git/
subdir if it is not bare) there are three things that are the minimum for a repo to work:
- the
objects/
subdirectory, which stores all the objects (commits, trees, blobs ...). They are stored either individually as files with names equal to the hash of the object or in.pack
files. - the
refs/
subdirectory, which stores simple files likerefs/heads/master
whose contents is the hash of the object it references. - the
HEAD
file, which says what is the current commit. Its value is either a raw hash (which corresponds to a detached head i.e we are not on any named branch) or a textual link to a ref where the actual hash can be found (for exampleref: refs/heads/master
- that would mean we are on branchmaster
)
Let's suppose someone creates his original (not forking) repo orig
at Github.
To simulate, locally we do
$ git init --bare github_orig
We imagine that the above happens at the Github servers. Now there is an empty github rpository. Then we imagine that from our own PC we clone the github repo:
$ git clone github_orig local_orig
Of course in real life instead of github_orig
we will use https://github...
. Now we have cloned the github repo in local_orig
.
$ cd local_orig/
$ echo zzz > file
$ git add file
$ git commit -m initial
$ git push
$ cd ..
After this github_orig
's object
dir will contain our pushed commit object, one blob object for file
and one tree object. The refs/heads/master
file will contain the commit hash.
Now let's image what could be happening when someone hits the Fork
button.
We will create a git repo but by hand:
$ mkdir github_fork
$ cd github_fork/
$ cp ../github_orig/HEAD .
$ cp -r ../github_orig/refs .
$ ln -s ../github_orig/objects
$ cd ..
Notice that we copy HEAD
and refs
but we make a symbolic link for objects
. As we can see making a fork is very cheap. Even if we have tens of branches each of them is simply a file in the refs/heads
directory which contains a simple hexadecimal hash (40 bytes). For objects
we only link to the original objects directory - we do not copy anything!
Now we simulate that the user making the fork, clones the forked repo locally:
$ git clone github_fork local_fork
$ cd local_fork
$ # ls
.git/ file
We can see that we have successfully cloned although the repo that we clone from does not have its own objects
but links to that of the original repo.
Now the forking user may make branches, commits and then push them to github_fork
. Objects will be pushed in the objects
directory which is the same for github_orig
! But refs
and HEAD
will be modified and will no longer match the ones at the github_orig
.
So the bottom line is that all repos that belong to the same forking tree share a common objects pool while each repo contains its own references. Anyone who pushes commits to his own forked repo modifies his own references but puts the objects in a shared pool.
Of course to be really usable some more things must be taken care of - most importantly the git garbage collector must not be invoked unless the repo where it is invoked in has the knowledge of all references - not just it's own. Otherwise it could discard objects in the shared pool which are not reachable from its references but could be reachable from other repos' refs.
来源:https://stackoverflow.com/questions/11974686/explanation-of-github-fork-and-how-they-store-files