GIT Split Repository directory preserving *move / renames* history

后端 未结 5 880
余生分开走
余生分开走 2020-12-15 08:18

Let\'s say you have the repository:

myCode/megaProject/moduleA
myCode/megaProject/moduleB

Over time (months), you re-organise the project.

相关标签:
5条回答
  • 2020-12-15 08:55

    I'm aware of no simple way to do this, but it can be done.

    The problem with filter-branch is that it works by

    applying custom filters on each revision

    If you can create a filter which won't delete your files they will be tracked between directories. Of course this is likely to be non-trivial for any repository which isn't trivial.

    To start: Let's assume it is a trivial repository. You have never renamed a file, and you have never had files in two modules with the same name. All you need to do is get a list of the files in your module find megaProject/moduleA -type f -printf "%f\n" > preserve and then run your filter using those filenames, and your directory:

    preserve.sh

    cmd="find . -type f ! -name d1"
    while read f; do
      cmd="$cmd ! -name $f"
    done < /path/to/myCode/preserve
    for i in $($cmd)
    do
      rm $i
    done
    

    git filter-branch --prune-empty --tree-filter '/path/to/myCode/preserve.sh' HEAD

    Of course it's renames that make this difficult. One of the nice things that git filter-branch does is gives you the $GIT_COMMIT environment variable. You can then get fancy and use things like:

    for f in megaProject/moduleA
    do
     git log --pretty=format:'%H' --name-only --follow -- $f |  awk '{ if($0 != ""){ printf $0 ":"; next; } print; }'
    done > preserve
    

    to build a filename history, with commits, that could be used in place of the simple preserve file in the trivial example, but the onus is going to be on you to keep track of what files should be present at each commit. This actually shouldn't be too hard to code out, but I haven't seen anybody who's done it yet.

    0 讨论(0)
  • 2020-12-15 09:02

    This is a version based on @rksawyer's scripts, but it uses git-filter-repo instead. I found it was much easier to use and much much faster than git-filter-branch.

    # This script should run in the same folder as the project folder is.
    # This script uses git-filter-repo (https://github.com/newren/git-filter-repo).
    # The list of files and folders that you want to keep should be named <your_repo_folder_name>_KEEP.txt. I should contain a line end in the last line, otherwise the last file/folder will be skipped.
    # The result will be the folder called <your_repo_folder_name>_REWRITE_CLONE. Your original repo won't be changed.
    # Tags are not preserved, see line below to preserve tags.
    # Running subsequent times will backup the last run in <your_repo_folder_name>_REWRITE_CLONE_BKP.
    
    # Define here the name of the folder containing the repo: 
    GIT_REPO="git-test-orig"
    
    clone="$GIT_REPO"_REWRITE_CLONE
    temp=/tmp/git_rewrite_temp
    rm -Rf "$clone"_BKP
    mv "$clone" "$clone"_BKP
    rm -Rf "$temp"
    mkdir "$temp"
    git clone "$GIT_REPO" "$clone"
    cd "$clone"
    git remote remove origin
    open .
    open "$temp"
    
    # Comment line below to preserve tags
    git tag | xargs git tag -d
    
    echo 'Start logging file history...'
    echo "# git log results:\n" > "$temp"/log.txt
    
    while read p
    do
        shopt -s dotglob
        find "$p" -type f > "$temp"/temp
        while read f
        do
            echo "## " "$f" >> "$temp"/log.txt
            # print every file and follow to get any previous renames
            # Then remove blank lines.  Then remove every other line to end up with the list of filenames       
            git log --pretty=format:'%H' --name-only --follow -- "$f" | awk 'NF > 0' | awk 'NR%2==0' | tee -a "$temp"/log.txt
    
            echo "\n\n" >> "$temp"/log.txt
        done < "$temp"/temp
    done < ../"$GIT_REPO"_KEEP.txt > "$temp"/PRESERVE
    
    mv "$temp"/PRESERVE "$temp"/PRESERVE_full
    awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE
    
    sort -o "$temp"/PRESERVE "$temp"/PRESERVE
    
    echo 'Starting filter-branch --------------------------'
    git filter-repo --paths-from-file "$temp"/PRESERVE --force --replace-refs delete-no-add
    echo 'Finished filter-branch --------------------------'
    

    It logs the result of git log into a file in /tmp/git_rewrite_temp/log.txt, so you can get rid of these lines if you don't need a log.txt and want it to run faster.

    0 讨论(0)
  • 2020-12-15 09:10

    We painted ourselves into a much worse corner, with dozens of projects across dozens of branches, with each project dependent on 1-4 others, and 56k commits total. filter-branch was taking up to 24 hours just to split a single directory off.

    I ended up writing a tool in .NET using libgit2sharp and raw file system access to split an arbitrary number of directories per project, and only preserve relevant commits/branches/tags for each project in the new repos. Instead of modifying the source repo, it writes out N other repos with only the configured paths/refs.

    You're welcome to see if this suits your needs, modify it, etc. https://github.com/CurseStaff/GitSplit

    0 讨论(0)
  • 2020-12-15 09:11

    Running git filter-branch --subdirectory-filter in your cloned repository will remove all commits that don't affect content in that subdirectory, which includes those affecting the files before they were moved.

    Instead, you need to use the --index-filter flag with a script to delete all files you're not interested in, and the --prune-empty flag to ignore any commits affecting other content.

    There's a blog post from Kevin Deldycke with a good example of this:

    git filter-branch --prune-empty --tree-filter 'find ./ -maxdepth 1 -not -path "./e107*" -and -not -path "./wordpress-e107*" -and -not -path "./.git" -and -not -path "./" -print -exec rm -rf "{}" \;' -- --all
    

    This command effectively checks out each commit in turn, deletes all uninteresting files from the working directory and, if anything has changed from the last commit then it checks it in (rewriting the history as it goes). You would need to tweak that command to delete all files except those in, say, /moduleA, /megaProject/moduleA and the specific files you want to keep from /megaProject.

    0 讨论(0)
  • 2020-12-15 09:16

    Following on to the answer above. First iterate through all of the files in the directory that is being kept using git log --follow to git the old paths/names from prior moves/renames. Then use filter-branch to iterate through every revision removing any files that were not on the list created in step 1.

    #!/bin/bash
    DIRNAME=dirD
    
    # Catch all files including hidden files
    shopt -s dotglob
    for f in $DIRNAME/*
    do
    # print every file and follow to get any previous renames
    # Then remove blank lines.  Then remove every other line to end up with the list of filenames
     git log --pretty=format:'%H' --name-only --follow -- $f | awk 'NF > 0' | awk 'NR%2==0'
    done > /tmp/PRESERVE
    
    sort -o /tmp/PRESERVE /tmp/PRESERVE
    cat /tmp/PRESERVE
    

    Then create a script (preserve.sh) that filter-branch will call for each revision.

    #!/bin/bash
    DIRNAME=dirD
    
    # Delete everything that's not in the PRESERVE list
    echo 'delete this files:'
    cmd=`find . -type f -not -path './.git/*' -not -path './$DIRNAME/*'`
    echo $cmd > /tmp/ALL
    
    
    # Convert to one filename per line and remove the lead ./
    cat /tmp/ALL | awk '{NF++;while(NF-->1)print $NF}' | cut -c3- > /tmp/ALL2
    sort -o /tmp/ALL2 /tmp/ALL2
    
    #echo 'before:'
    #cat /tmp/ALL2
    
    comm -23 /tmp/ALL2 /tmp/PRESERVE > /tmp/DELETE_THESE
    echo 'delete these:'
    cat /tmp/DELETE_THESE
    #exit 0
    
    while read f; do
      rm $f
    done < /tmp/DELETE_THESE
    

    Now use filter-branch, if all files are removed in the revision, then prune that commit and it's message.

     git filter-branch --prune-empty --tree-filter '/FULL_PATH/preserve.sh' master
    
    0 讨论(0)
提交回复
热议问题