Boyer-Moore good-suffix heuristics

问题

I understand how the bad character heuristics work. When you find the mismatched letter x, just shift the pattern so the rightmost x in the pattern would be aligned with the x in the string. And it's easy to implement in code.

I think I also understand how the good-suffix heuristics work. When we find a good suffix s, find the same suffix in different location in the pattern and shift it so the s in the pattern would be aligned with the s in the string. But I don't understand how to do that in code. How do we find if the same suffix exists in another place in pattern? And how do we know its position? The code:

void bmPreprocess1()
{
    int i=m, j=m+1;
    f[i]=j;
    while (i>0)
    {
        while (j<=m && p[i-1]!=p[j-1])
        {
            if (s[j]==0) s[j]=j-i;
            j=f[j];
        }
        i--; j--;
        f[i]=j;
    }
}

from http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm doesn't make sense to me... Could someone write as simple as possible pseudo code for this task? Or explain somehow?

回答1:

First, I will use p[i] denote a character in the pattern, m the pattern lenght, $ the last character in the pattern, i.e., $ = p[m-1].

There are two scenarios for good suffix heuristics case 1.

Situation 1

Consider the following example,

    leading TEXT cXXXbXXXcXXXcXXX rest of the TEXT
         cXXXbXXXcXXXcXXX
                     ^ 
                     | mismatch here

So the sub string XXX in the pattern cXXXbXXXcXXXcXXX is the good suffix. The mismatch occurs at character c. So after the mismatch, should we shift 4 to the right or 8?

If we shift 4 as in the following, then the same mismath will occur again (b mismathes c),

    leading TEXT cXXXbXXXcXXXcXXX rest of the TEXT
             cXXXbXXXcXXXcXXX
                     ^ 
                     | mismatch occurs here again

So we can actually shift 8 characters to the right in this situation.

Situation 2

Let us look at another example

    leading TEXT cXXXcXXXbXXXcXXX rest of the TEXT
            cXXXXcXXXbXXXcXXX
                         ^ 
                         | mismatch happens here

Can we shift 4 or 8 or more here? Obviously we if we shift 8 or more, we will miss the opportunity to match the pattern with the text. So we can only shift 4 characters to the right in this situation.

So what is the difference between these two situations?

The difference is that in the first case, the mismatched character c plus the good suffix XXX, which is cXXX, is the same as the next (count from the right) match for XXX plus the character before that. While in the second situation, cXXX is not the same as the next match (count from the right) plus the character before that match.

So for any given GOOD SUFFIX (let us call it XXX) we need to figure out the shift in these two situations, (1) the character (let us call it c) before the GOOD SUFFIX plus the GOOD SUFFIX, in the pattern is also match the next (count from the right) match of the good suffix + the character before it , (2) the character plus the GOOD SUFFIX does not match

For situation (1), for example, if you have a pattern, 0XXXcXXXcXXXcXXXcXXXcXXX, if after the first test of c fails, you can shift 20 characters to the right, and align 0XXX with the text that been tested.

This is how the number 20 is calculated,

              1         2
    012345678901234567890123
    0XXXcXXXcXXXcXXXcXXXcXXX
    ^                   ^

The position the mismatch occurs is 20, the shifted sub string 0XXX will take position from 20 to 23. And 0XXX starts with position 0, so 20 - 0 = 20.

For situation (2), like in this example, 0XXXaXXXbXXXcXXX, if after the first test of c fails, you can shift only 4 characters to the right, and align bXXX with the text that been tested.

This is how number 4 is calculated,

    0123456789012345
    0XXXaXXXbXXXcXXX

The position where the mismatch occurs is 12, the next substring to take that place is bXXX which starts with position 8, 12 - 8 = 4. If we set j = 12, and i = 8, then the shift is j - i, which is s[j] = j - i; in the code.

Border

To consider all the good suffix, we first need to understand what is a so called border. A border is a substring which is both a proper prefix and a proper suffix of a string. For example, for a string XXXcXXX, X is a border, XX is a border, XXX too. But XXXc is not. We need to identify the starting point of the widest border of the suffix of the pattern and this info is saved in array f[i].

How to determine f[i] ?

Assume we know f[i] = j and for all other f[k]s with i < k < m, which means the widest border for the suffix starting from position i started at position j. We want to find f[i-1] based on f[i].

For example, for a pattern aabbccaacc, at postion i=4, the suffix is ccaacc, and the widest border for that is cc (p[8]p[9]), so j = f[i=4] = 8. And now we want to know f[i-1] = f[3] based on the info we have for f[4], f[5], ... For f[3], the suffix now is bccaacc. At position, j-1=7, it is a != p[4-1] which is b. So bcc is not a border.

We know any border with width >= 1 of bccaacc has to begin with b plus the border of the suffix starting from positin j = 8, which is cc in this example. cc has the widest border c at position j = f[8] which is 9. So we continue our search with comparing p[4-1] against p[j-1]. And they are not equal again. Now the suffix is p[9] = c and it has only zero length border at position 10. so now j = f[9] and it is 10. So we continue our search with comparing p[4-1] against p[j-1], they are not equal and that is the end of the string. Then f[3] has only zero length border which make it equal to 10.

To describe the process in a more general sense

Therefore, f[i] = j means something like this,

    Position:    012345     i-1  i     j - 1   j       m
    pattern:     abcdef ... @    x ... ?       x ...   $

If character @ which at position i - 1 is the same as character ? at position j - 1, we know that f[i - 1] = j - 1; , or --i; --j; f[i] = j;. The border is suffix @x ... $ starting from position j-1.

But if character @ which at position i - 1 is different from character ? at position j - 1, we have to continue our search to the right. We know two facts: (1) we know now the border width has to be smaller than the one started from position j, i.e, smaller than x...$. Second the border has to be begin with @... and ends with character $ or it could be empty.

Based on these two facts, we continue our search within sub string x ... $ (from position j to m) for a border begin with x. Then the next border should be at j which is equal to f[j];, i.e. j = f[j];. Then we compare character @ with the character before x, which is at j-1. If they are equal, we found the border, if not, continue the process until j > m. This process is shown by the following code,

    while (j<=m && p[i-1]!=p[j-1])
    {
        j=f[j];
    }

    i--; j--;
    f[i]=j;

Now look at condition p[i -1] !=p[j-1], this is what we talked about in situation (2),p[i]matchesp[j], butp[i -1] != p[j-1], so we shift from i to j, that that is s[j] = j - i;.

Now the only thing left not explained is the test if (s[j] == 0) which will occur when a shorter suffix has the same border. For example, you have a pattern

    012345678
    addbddcdd

When you calculate f[i - 1] and i = 4, you will set s[7]. But when you calculate f[i-1] for i = 1, you will set s[7] again if you don't have the test if (s[j] == 0). This means if you have mismatch at position 6, you shift 3 to the right (align bdd to the positions cdd occupied) not 6 (not shift until add to the positions cdd occupied).

The comments for the code

    void bmPreprocess1()
    {
        // initial condition, set f[m] = m+1;
        int i=m, j=m+1;

        f[i]=j;

        // calculate f[i], s[j] from i = m to 0.
        while (i>0)
        {
            // at this line, we know f[i], f[i+1], ... f[m].
            while (j<=m && p[i-1]!=p[j-1]) // calculate the value of f[i-1] and save it to j-1
            {
                if (s[j]==0) s[j]=j-i; // if s[j] is not occupied, set it.
                j=f[j]; // get the start position of the border of suffix p[j] ... p[m-1]
            }
            // assign j-1 to f[i-1]
            i--; j--;
            f[i]=j;
        }
    }

来源：https://stackoverflow.com/questions/19345263/boyer-moore-good-suffix-heuristics

标签

algorithm

boyer-moore