Rabin Karp string matching algorithm

断了今生、忘了曾经 提交于 2019-11-29 03:39:43

About the magic numbers Paul's answer is pretty clear.

As far as the code is concerned, Rabin Karp's principal idea is to perform an hash comparison between a sliding portion of the string and the pattern.

The hash cannot be computed each time on the whole substrings, otherwise the computation complexity would be quadratic O(n^2) instead of linear O(n).

Therefore, a rolling hash function is applied, such as at each iteration only one character is needed to update the hash value of the substring.

So, let's comment your code:

for (int i = 0; i < B.Length; i++)
{
    siga = (siga * D + (ulong)A[i]) % Q;
    sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
    Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
    return;
}

^ This piece computes the hash of pattern B (sigb), and the hashcode of the initial substring of A of the same length of B. Actually it's not completely correct because hash can collide¹ and so, it is necessary to modify the if statement : if (siga == sigb && A.Substring(0, B.Length) == B).

ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
    pow = (pow * D) % Q;

^ Here's computed pow that is necessary to perform the rolling hash.

for (int j = 1; j <= A.Length - B.Length; j++)
{
    siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
    siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
    if (siga == sigb)
    {
        if (A.Substring(j, B.Length) == B)
        {
            Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                A.Substring(j, B.Length),
                                                                A.Substring(j + B.Length)));
            return;
        }
    }
}

^ Finally, the remaining string (i.e. from the second character to end), is scanned updating the hash value of the A substring and compared with the hash of B (computed at the beginning).

If the two hashes are equal, the substring and the pattern are compared¹ and if they're actually equal a message is returned.


¹ Hash values can collide; hence, if two strings have different hash values they're definitely different, but if the two hashes are equal they can be equal or not.

The algorithm uses hashing for fast string comparison. Q and D are magic numbers that the coder probably arrived at with a little bit of trial and error and give a good distribution of hash values for this particular algorithm.

You can see these types of magic numbers used for hashing many places. The example below is the decompiled definition of the GetHashCode function of a .NET 2.0 string type:

[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
public override unsafe int GetHashCode()
{
    char* chrPointer = null;
    int num1;
    int num2;
    fixed (string str = (string)this)
    {
        num1 = 352654597;
        num2 = num1;
        int* numPointer = chrPointer;
        for (int i = this.Length; i > 0; i = i - 4)
        {
            num1 = (num1 << 5) + num1 + (num1 >> 27) ^ numPointer;
            if (i <= 2)
            {
                break;
            }
            num2 = (num2 << 5) + num2 + (num2 >> 27) ^ numPointer + (void*)4;
            numPointer = numPointer + (void*)8;
        }
    }
    return num1 + num2 * 1566083941;
}

Here is another example from a R# generated GetHashcode override function for a sample type:

    public override int GetHashCode()
    {
        unchecked
        {
            int result = (SomeStrId != null ? SomeStrId.GetHashCode() : 0);
            result = (result*397) ^ (Desc != null ? Desc.GetHashCode() : 0);
            result = (result*397) ^ (AnotherId != null ? AnotherId.GetHashCode() : 0);
            return result;
        }
    }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!