问题
The scenario
Imagine I am forced to work with some of my files always stored inside .zip
files. Some of the files inside the zip are small text files and change often, while others are larger but luckily rather static (e.g. images).
If I want to place these zip files inside a git
repository, each zip is treated as a blob, so whenever I commit the repository grows by the size of the zip file... even if only one small text file inside changed!
Why this is realistic
MS Word 2007/2010 .docx
and Excel .xlsx
files are ZIP files...
What I want
Is there, by any chance, a way to tell git
to not treat zips as files, but rather as directories and treat their contents as files?
The advantages
- much smaller repo size, i.e. quicker transfer/backup
- Display changes with Git to zip's would automagically work
But it couldn't work, you say?
I realize that without extra metadata this would lead to some amount of ambiguity: on a git checkout
git would have to decide whether to create foo.zip/bar.txt
as a file in a regular directory or a zip file. However this could be solved through config options, I would think.
Two ideas how it could be done (if it doesn't exist yet)
- using a library such as
minizip
orIO::Compress::Zip
inside git - somehow adding a filesystem layer such that git actually sees zip files as directories to start with
回答1:
This doesn't exist, but it could easily exist in the current framework. Just as git acts differently with displaying binary or ascii files when performing a diff, it could be told to offer special treatment to certain file types through the configuration interface.
If you don't want to change the code base (although this is kind of a cool idea you've got), you could also script it for yourself by using pre-commit and post-checkout hooks to unzip and store the files, then return them to their .zip state on checkout. You would have to restrict actions to only those files blobs / indexes that are specified by git add
.
Either way is a bit of work -- it's just a question of whether the other git commends are aware of what's going on and play nicely.
回答2:
Not sure if anyone is still interested in this question. I am facing the same problems and here is my solution that uses git file filter.
Edit: First, I may not state it clear, but this IS an answer to the OP's question! Read the entire sentence before you comment. Moreover, thanks to @Toon Krijthe for the advice to clarify the solution in place.
My solution is to use a filter to "flat" the zip file into an monolithic expanded (may be huge) text file. During git add/commit the zip file will be automatically expanded to this text format for normal text diffing, and during checkout, it is automatically zipped up again.
The text file is composed of records, each represents a file in the zip. So you can thing this text file is a text-based image for the original zip. If the file in the zip is text in deed, it is copied into the text file; otherwise, it is base64 encoded before copied into the text format file. This keeps the text file always a text file.
Although this filter does not make each file in the zip a blob, text file are mapped line to line, which is the unit of the diff, while binary files changes can be represented by updates of their corresponding base64, I think this is equivalent to what the OP imagines.
For details and a prototyping code you can read the following link:
Zippey Git file filter
Also, credit to the place that inspired me about this solution: Description of how file filter works
回答3:
Use bup (presented in details in GitMinutes #24)
It is the only git-like system designed to deal with large (even very very large) files, which means every version of a zip file will only increase the repo from its delta (instead of a full additional copy)
The result is an actual git repo, that a regular Git command can read.
I detail how bup
differs from Git in "git with large files".
Any other workaround (like git-annex) isn't entirely satisfactory, as detailed in "git-annex with large files".
回答4:
http://tante.cc/2010/06/23/managing-zip-based-file-formats-in-git/
(Note: per comment from Ruben, this is only about getting a proper diff though, not about committing unzipped files.)
Open your ~/.gitconfig file (create if not existing already) and add the following stanza:
[diff "zip"] textconv = unzip -c -a
What it does is using “unzip -c -a FILENAME” to convert your zipfile into ASCII text (unzip -c unzips to STDOUT). Next thing is to create/modify the file REPOSITORY/.gitattributes and add the following
*.pptx diff=zip
which tells git to use the zip-diffing description from the config for files mathcing the given mask (in this case everything ending with .pptx). Now git diff automatically unzips the files and diffs the ASCII output which is a little better than just “binary files differ”. On the other hand to to the convoluted mess that the corresponding XML of pptx files is, it doesn’t help a lot but for ZIP-files including text (like for example source code archives) this is actually quite handy.
回答5:
I think you're going to need to mount a zip file to the filesystem. I haven't used it, but consider FUSE:
http://code.google.com/p/fuse-zip/
There is also ZFS for Windows and Linux:
http://users.telenet.be/tfautre/softdev/zfs/
回答6:
Often there are problems with pre-zipped files for applications as they expect the zip compression method and file order to be the one they chose. I believe that open office .odf files have that problem.
That said, if you are simply using any-old-zip as a method for keeping stuff together that you should be able to create a few simple aliases which will unzip and re-zip when required. The very latest Msysgit (aka Git for Windows) now has both zip and unzip on the shell code side so you can use them in aliases.
The project I'm currently working on uses zips as the main local version control / archive, so I'm also trying to get a workable set of aliases for sucking these hundreds of zips into git (and getting them out again ;-) so that the co-workers are happy.
回答7:
Rezip, similar to Zippey by sippey, allows to handle ZIP files in a nicer way with git.
How it works
When adding/committing a ZIP based file, Rezip unpacks it and repacks it without compression, before adding it to the index/commit. In an uncompressed ZIP file, the archived files appear as-is in its content (together with some binary meta-info before each file). If those archived files are plain-text files, this method will play nicely with git.
Benefits
The main benefit of Rezip over Zippey, is that the actual file stored in the repository is still a ZIP file. Thus, in many cases, it will still work as-is with the respective application (for example Open Office), even if it is obtained without going through a re-packing-with-compression filter.
How to use
Install the filter(s) on your system:
mkdir -p ~/bin
cd ~/bin
# Download the filer executable
wget https://github.com/costerwi/rezip/blob/master/Rezip.class
# Install the add/commit filter
git config --global --replace-all filter.rezip.clean "java -cp ~/bin Rezip --store"
# (optionally) Install the checkout filter
git config --global --add filter.rezip.smudge "java -cp ~/bin Rezip"
Use the filter in your repository, by adding lines like these to your <repo-root>/.gitattributes
file:
[attr]textual diff merge text
[attr]rezip filter=rezip textual
# MS Office
*.docx rezip
*.xlsx rezip
*.pptx rezip
# OpenOffice
*.odt rezip
*.ods rezip
*.odp rezip
# Misc
*.mcdx rezip
*.slx rezip
The textual
part is so that these files are actually shown as text files in diffs.
来源:https://stackoverflow.com/questions/8001663/can-git-treat-zip-files-as-directories-and-files-inside-the-zip-as-blobs