Java library for free-text diff [closed]

纵饮孤独 提交于 2019-11-27 01:08:16

问题


I need to match up two almost-the-same long freetext strings; i.e., to find index-to-index correspondences wherever possible.

Because this is freetext, the comparison should not be line-based as in code diffing.

Any suggestions for Java libraries?

A simple example (In real life , of course, there would not be extra whitespace to line things up, and there may be more complex challenges like entire clauses moved around.)

The quick brown  fox jumped over the  lazy     dog.
||||||||||      |||||||||||||||||||||         |||||
The quick yellow fox jumped over the well-bred dog.

回答1:


This one might be good Diff Match Patch.




回答2:


Depending on your exact requirements, the StringUtils class of the Apache Commons Lang component might be helpful, e.g.:

  • StringUtils#difference: Compares two Strings, and returns the portion where they differ
  • StringUtils#getLevenshteinDistance: Find the Levenshtein distance between two Strings



回答3:


Here's a (lightly-tested) version of code that does what you asked. You can easily traverse the result in parallel with the inputs to locate insertions and deletions.

public class StringDiff {

    private static int   length(String s) { return s == null ? 0 : s.length(); }
    private static char[] chars(String s) { return s == null ? new char[0] : s.toCharArray(); }

    private final String left;
    private final String right;

    private final char[] lccs;
    private final String lcs;

    public StringDiff(String left, String right) {
        this.left = left;
        this.right = right;
        lccs = init();
        lcs = new String(lccs);
    }

    public String getLcs()  { return lcs; }
    public char[] getLccs() { return lccs.clone(); }

    private char[] init() {
        int lLength = length(left);
        int rLength = length(right);
        char[] lChars = chars(left);
        char[] rChars = chars(right);
        int [][] t = new int [lLength + 1][rLength + 1];
        for (int i = lLength - 1; i >= 0; --i) {
            for (int j = rLength - 1; j >= 0; --j) {
                if (lChars[i] == rChars[j]) {
                    t[i][j] = t[i + 1][j + 1] + 1;
                } else {
                    t[i][j] = Math.max(t[i + 1][j], t[i][j + 1]);
                }
            }
        }
        char[] result = new char[t[0][0]];
        int l = 0, r = 0, p = 0;
        while (l < lLength && r < rLength) {
            if (lChars[l] == rChars[r]) {
                result[p++] = lChars[l++];
                r++;
            } else {
                if (t[l + 1][r] > t[l][r + 1]) {
                    ++l;
                } else {
                    ++r;
                }
            }
        }
        return result;
    }

}

According to it, the actual longest subsequence of your original inputs:

The quick brown  fox jumped over the  lazy     dog.
The quick yellow fox jumped over the well-bred dog.

is:

The quick ow fox jumped over the l dog.

(because "brown" and "yellow" have "ow" in common, etc.)

It's relatively straightforward to modify the above to split on whitespace (instead of into char arrays) and substitute String#equals for == to get a version that finds the longest common subsequence of words instead of characters. For your example above that change would produce the obvious result:

found 7 words
    'The'
    'quick'
    'fox'
    'jumped'
    'over'
    'the'
    'dog.'

(Your question implied character comparisons, as you matched the spaces between words.)




回答4:


If you're example is really what you want to do - ie subsequences only match if they start at the same index (which is different from how diffs normally operate) - this is all you need to do:

import java.util.*;

class StringDiff {
    public static List<int[]> from(String s1, String s2) {
        int start = -1;
        int pos = 0;
        LinkedList<int[]> list = new LinkedList<int[]>();

        for(; pos < s1.length() && pos < s2.length(); ++pos) {
            if(s1.charAt(pos) == s2.charAt(pos)) {
                if(start < 0) start = pos;
            }
            else {
                if(start >= 0) list.add(new int[] { start, pos });
                start = -1;
            }
        }

        if(start >= 0) list.add(new int[] { start, pos });

        return list;
    }

    public static void main(String[] args) {
        for(int[] idx : from(args[0], args[1]))
            System.out.println(args[0].substring(idx[0], idx[1]));
    }
}

An actual diff implementation will be far more sophisticated.



来源:https://stackoverflow.com/questions/479654/java-library-for-free-text-diff

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!