Modifying Levenshtein Distance algorithm to not calculate all distances

后端 未结 6 1785
渐次进展
渐次进展 2020-12-31 19:40

I\'m working on a fuzzy search implementation and as part of the implementation, we\'re using Apache\'s StringUtils.getLevenshteinDistance. At the moment, we\'re going for a

6条回答
  •  渐次进展
    2020-12-31 19:53

    The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.

    One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.

    Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:

    public static int levenshtein(String s, String t, int threshold) {
        int slen = s.length();
        int tlen = t.length();
    
        // swap so the smaller string is t; this reduces the memory usage
        // of our buffers
        if(tlen > slen) {
            String stmp = s;
            s = t;
            t = stmp;
            int itmp = slen;
            slen = tlen;
            tlen = itmp;
        }
    
        // p is the previous and d is the current distance array; dtmp is used in swaps
        int[] p = new int[tlen + 1];
        int[] d = new int[tlen + 1];
        int[] dtmp;
    
        // the values necessary for our threshold are written; the ones after
        // must be filled with large integers since the tailing member of the threshold 
        // window in the bottom array will run min across them
        int n = 0;
        for(; n < Math.min(p.length, threshold + 1); ++n)
            p[n] = n;
        Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
        Arrays.fill(d, Integer.MAX_VALUE);
    
        // this is the core of the Levenshtein edit distance algorithm
        // instead of actually building the matrix, two arrays are swapped back and forth
        // the threshold limits the amount of entries that need to be computed if we're 
        // looking for a match within a set distance
        for(int row = 1; row < s.length()+1; ++row) {
            char schar = s.charAt(row-1);
            d[0] = row;
    
            // set up our threshold window
            int min = Math.max(1, row - threshold);
            int max = Math.min(d.length, row + threshold + 1);
    
            // since we're reusing arrays, we need to be sure to wipe the value left of the
            // starting index; we don't have to worry about the value above the ending index
            // as the arrays were initially filled with large integers and we progress to the right
            if(min > 1)
                d[min-1] = Integer.MAX_VALUE;
    
            for(int col = min; col < max; ++col) {
                if(schar == t.charAt(col-1))
                    d[col] = p[col-1];
                else 
                    // min of: diagonal, left, up
                    d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
            }
            // swap our arrays
            dtmp = p;
            p = d;
            d = dtmp;
        }
    
            if(p[tlen] == Integer.MAX_VALUE)
                return -1;
        return p[tlen];
    }
    

提交回复
热议问题