Measuring “closeness” in large source trees

As part of a question I posed earlier about finding the best match between two sources, where one has an active git repo and the other has no git history, I wrote a perl script to find the closest git commit.

I'm in the process of rewriting the script so that you don't have to guess at which branch to use, but it will run through and find the closest match in all branches, then tell you the best commit with the best branch. Unfortunately, I'm finding that the measurement I'm using may not be the best judge of "closeness."

Currently, I use diff -burN -x.git my_git_subtree my_src_subtree | wc -l to determine how close the code trees are. This seems to work more-or-less but I run into cases where entire folders are added or missing, that likely exist or don't exist in another branch.

Is there a better way to determine how close the sources are? I'm envisioning something that compares the directory structures, possibly as well how many lines are different. It could just be a matter of passing different params to diff, or maybe there is another tool out there that does something like that.

To improve on your measurement, why not try 'git diff --shortstat' ? The output looks like this:

 1 file changed, 1 insertion(+), 2 deletions(-)

You can play around with how to prioritize files changes / insertions / deletions, depending on results.

Looking at your perl, I think you're probably not going to be able to make assumptions about the ordering of "closeness" among commits -- you may need to brute force check every commit, or at least make that an option.

I'd also suggest that instead of looking for the closest, you keep a sorted list of (commit, "closeness") pairs and perhaps display the top few and review them by hand. As mentioned below, there is no silver bullet for determining whether two sets of code are close or not simply by looking at the number of changes. That said, number of changes can definitely help you narrow down the list you should review...

UPDATE: I should also mention that another advantage of using git diff is that you don't have to run a hard reset for each commit. Simply symlink the .git/ directory from your unknown tree (the one w/o a git history), and use git reset [--mixed] and it will update the current head pointer but leave your source unchanged (obviously need to backup the unknown source tree before using this method).

来源：https://stackoverflow.com/questions/14718696/measuring-closeness-in-large-source-trees

标签

git

diff

directory-structure