Are there any good workarounds to the GitHub 100MB file size limit for text files?

一笑奈何 提交于 2019-11-29 05:42:24

问题


I have a 190 MB plain text file that I want to track on github.

The text file is a pronounciation lexicon file for our text-to-speech engine. We regularly add and modify lines in the text files, and the diffs are fairly small, so it's perfect for git in that sense.

However, GitHub has a strict 100 MB file size limit in place. I have tried the GitHub Large File Storage service, but that uploads a new version of the entire 190 MB file every time it changes - so that would quickly grow to many gigabytes if I go down that path.

I would like to keep the file as one file instead of splitting it because that's how our workflow is currently and it would require some coding to allow multiple text files as input/output in our tools (and we don't have much development resources).

One idea I've had is that maybe it's possible to set up some pre- and post-commit hooks to split and concatenate the big file automatically? Would that be possible?

Other ideas?

Edit: I am aware of the 100 MB file size limitation described in the similar questions here on StackOverflow, but I don't consider my question a duplicate because I'm asking for the specific case where the diffs are small and frequent (I'm not trying to upload a big ZIP file or anything). However, my understanding is that git-lfs is only appropriate for files that rarely change, and that normal git would be the perfect fit for the kind of file I'm describing; except that GitHub has a file size restriction.

Update: I spent yesterday experimenting with creating a small cross-platform program that splits and joins files into smaller files using git hooks. It kind of works but not really satisfactory. You will need to have your big text file excluded by .gitignore, which makes git unaware about whether or not it has changed. The split files are not initially detected by git status or git commit and leads to the same issue as described in this SO question, which is quite annoying: Pre-commit script creates mysqldump file, but "nothing to commit (working directory clean)"? Setting up a cron job (linux) and scheduled task (windows) to automatically regenerate the split files regularly might fix that, but it's not easy to automatically set up, might cause performance issues on the users computer, and is just not a very elegant solution. Some hacky solutions like dynamically modifying .gitignore might also be needed, and in no way would you get a diff of the actual text files, only the split files (although that might be acceptable as they would be very similar).

So, having slept on it, today I think the git hook approach is not a good option after all as it has too many quirks. As has been suggested by @PyRulez, I think I'll have to look at other services than GitHub (unfortunately, since I love github). A hosted solution would be preferable to avoid having to manage our own server. I'd also like it to be publically available...

Update 2: I've looked at some alternatives to GitHub and currently I'm leaning towards using GitLab. I've contacted GitHub support about the possibility of raising the 100MB limit, but if they won't do that I'll just switch to GitLab for this particular project.


回答1:


Clean and Smudge

You can use clean and smudge to compress your file. Normally, this isn't necessary, since git will compress it internally, but since gitHub is acting weird, it may help. The main commands would be like:

git config filter.compress.clean gzip
git config filter.compress.smudge gzip -d

GitHub will see this as a compressed file, but on each computer, it will appear to be a text file.

See https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes for more details.

Alternatively, you could have clean post to an online pastebin, and smudge fetch from the pastebin, such as http://pastebin.com/. Many other combinations are possible with clean and smudge.




回答2:


A very good solution will be to use:

https://git-lfs.github.com/

Its an open source designed to work with Large files.




回答3:


You can create a script/program in any language to divide or unite files.

Here an example to divide a file written in Java (I used Java because I feel more comfortable on Java than any other, but any other would work, some will be better than Java too).

public static void main(String[] args) throws Exception
{
    RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
    long numSplits = 10; //from user input, extract it from args
    long sourceSize = raf.length();
    long bytesPerSplit = sourceSize/numSplits ;
    long remainingBytes = sourceSize % numSplits;

    int maxReadBufferSize = 8 * 1024; //8KB
    for(int destIx=1; destIx <= numSplits; destIx++) {
        BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
        if(bytesPerSplit > maxReadBufferSize) {
            long numReads = bytesPerSplit/maxReadBufferSize;
            long numRemainingRead = bytesPerSplit % maxReadBufferSize;
            for(int i=0; i<numReads; i++) {
                readWrite(raf, bw, maxReadBufferSize);
            }
            if(numRemainingRead > 0) {
                readWrite(raf, bw, numRemainingRead);
            }
        }else {
            readWrite(raf, bw, bytesPerSplit);
        }
        bw.close();
    }
    if(remainingBytes > 0) {
        BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
        readWrite(raf, bw, remainingBytes);
        bw.close();
    }
        raf.close();
}

static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
    byte[] buf = new byte[(int) numBytes];
    int val = raf.read(buf);
    if(val != -1) {
        bw.write(buf);
    }
}

This will cost almost nothing (Time/Money).

Edit: You can create a Java executable and add it to your repository, or even easier, create a Python (Or any other language) script to do this, and save it as plain text on your repository.



来源:https://stackoverflow.com/questions/34723759/are-there-any-good-workarounds-to-the-github-100mb-file-size-limit-for-text-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!