Is there a diff-like algorithm that handles moving block of lines?

烈酒焚心 提交于 2019-11-28 15:47:45
Zoë Peterson

Since you asked for an algorithm and not an application, take a look at "The String-to-String Correction Problem with Block Moves" by Walter Tichy. There are others, but that's the original, so you can look for papers that cite it to find more.

The paper cites Paul Heckel's paper "A technique for isolating differences between files" (mentioned in this answer to this question) and mentions this about its algorithm:

Heckel[3] pointed out similar problems with LCS techniques and proposed a linear-lime algorithm to detect block moves. The algorithm performs adequately if there are few duplicate symbols in the strings. However, the algorithm gives poor results otherwise. For example, given the two strings aabb and bbaa, Heckel's algorithm fails to discover any common substring.

The following method is able to detect block moves:

Paul Heckel: A technique for isolating differences between files
Communications of the ACM 21(4):264 (1978)
http://doi.acm.org/10.1145/359460.359467 (access restricted)
Mirror: http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf (open access)

wikEd diff is a free JavaScript diff library that implements this algorithm and improves on it. It also includes the code to compile a text output with insertions, deletions, moved blocks, and original block positions inserted into the new text version. Please see the project page or the extensively commented code for details. For testing, you can also use the online demo.

Here's a sketch of something that may work. Ignore diff insertations/deletions for the moment for the sake of clarity.

This seems to consist of figuring out the best blocking, similar to text compression. We want to find the common substring of two files. One options is to build a generalized suffix tree and iteratively take the maximal common substring , remove it and repeat until there are no substring of some size $s$. This can be done with a suffix tree in O(N^2) time (https://en.wikipedia.org/wiki/Longest_common_substring_problem#Suffix_tree). Greedily taking the maximal appears to be optimal (as a function of characters compressed) since taking a character sequence from other substring means adding the same number of characters elsewhere.

Each substring would then be replaced by a symbol for that block and displayed once as a sort of 'dictionary'.

$ diff a.txt b.txt 
1,3d0
< $
6a4,6
> $

 $ = 1,2,3 

Now we have to reintroduce diff-like behavior. The simple (possibly non-optimal) answer is to simply run the diff algorithm first, omit all the text that wouldn't be output in the original diff and run the above algorithm.

Our Smart Differencer tools do exactly this when computing differences between source texts of two programs in the same programmming language. Differences are reported in terms of program structures (identifiers, expressions, statements, blocks) precise to line/column number, and in terms of plausible editing operations (delete, insert, move, copy [above and beyond OP's request for mere "copy"], rename-identifier-in-block).

The SmartDifferencers require an structured artifact (e.g., a programming language), so it can't do this for arbitrary text. (We could define structure to be "just lines of text" but didn't think that would be particularly valuable compared to standard diff).

Git 2.16 (Q1 2018) will introduce another possibility, by ignoring some specified moved lines.

"git diff" learned a variant of the "--patience" algorithm, to which the user can specify which 'unique' line to be used as anchoring points.

See commit 2477ab2 (27 Nov 2017) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit d7c6c23, 19 Dec 2017)

diff: support anchoring line(s)

Teach diff a new algorithm, one that attempts to prevent user-specified lines from appearing as a deletion or addition in the end result.
The end user can use this by specifying "--anchored=<text>" one or more times when using Git commands like "diff" and "show".

The documentation for git diff now reads:

--anchored=<text>:

Generate a diff using the "anchored diff" algorithm.

This option may be specified more than once.

If a line exists in both the source and destination, exists only once, and starts with this text, this algorithm attempts to prevent it from appearing as a deletion or addition in the output.
It uses the "patience diff" algorithm internally.

See the tests for some examples:

pre post
 a   c
 b   a
 c   b

normally, c is moved to produce the smallest diff.
But:

 git diff --no-index --anchored=c pre post

Diff would be a.

Kenny Evitt

SemanticMerge, the "semantic scm" tool mentioned in this comment to one of the other answers, includes a "semantic diff" that handles moving a block of lines (for supported programming languages). I haven't found any details about the algorithm but it's possible the diff algorithm itself isn't particular interesting as it's relying on the output of a separate parsing of the programming language source code files themselves. Here's SemanticMerge's documentation on implementing an (external) language parser, which may shed some light on how its diffs work:

I tested it just now and its diff is fantastic. It's significantly better than the one I produced using the demo of the algorithm mentioned in this answer (and that diff was itself much better than what was produced by Git's default diff algorithm) and I suspect still better than one likely to be produced by the algorithm mentioned in this answer.

For this situation in my real life coding, when I actually move a whole block of code to another position in the source, because it makes more sense either logically, or for readability, what I do is this:

  • clean up all the existing diffs and commit them
    • so that the file just requires the move that we are looking for
  • remove the entire block of code from the source
    • save the file
    • and stage that change
  • add the code into the new position
    • save the file
    • and stage that change
  • commit the two staged patches as one commit with a reasonable message
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!