Is it possible to store only a checksum of a large file in git?

I'm a bioinformatician currently extracting normal-sized sequences from genomic files. Some genomic files are large enough that I don't want to put them into the main git repository, whereas I'm putting the extracted sequences into git.

Is it possible to tell git "Here's a large file - don't store the whole file, just take its checksum, and let me know if that file is missing or modified."

If that's not possible, I guess I'll have to either git-ignore the large files, or, as suggested in this question, store them in a submodule.

I wrote a script that does this sort of thing. You put file patterns in the .gitattributes file for large media that you don't want going in your git repo and it can store them on S3 instead. It's just a starting point, but I think it's usable if you're interested.

http://github.com/schacon/git-media

Maybe that will help you, or at least show you how it could be done and you can customize it for your specific needs.

Jakub Narębski

In the upcoming release of git there would be 'refs/replace/' mechanism, which I think could be adapted for such purpose (assuming that the number of such large-media files and the number of its version isn't very large.)

In the slim fork of your project you would have (like Seth wrote) 'stub' files in place of your large media files, which as contents would have SHA-1 of a blob of large file (from "git hash-object -t blob <filename>").

Then in full fork of your project you would use "refs/replace/" mechanism to replace those 'stub' files by true contents (using git replace). Some hooks would be required to keep SHA-1 in 'stub' files in sync with actual large-media files.

Then if you want full clone, you fetch also from "refs/replace/" namespace; if you want slim clone, you don't fetch "refs/replace/".

Note: I haven't actually tested such setup; also this isn't yet available in git, unless you run 'master'

How about storing the hashes in a text file, then giving the text file to git? Then you could write a hook that compared hashes, so every time you checked in or checked out, you could be notified of what was missing / different.

Not exactly what you want, and you would still have to maintain the text file manually.

来源：https://stackoverflow.com/questions/1501491/is-it-possible-to-store-only-a-checksum-of-a-large-file-in-git

标签

git

large-files