[Algo] Rolling hash; Rabin-Karp string search

This is yet another string search algorithm. (Previously my readers have read about Knuth-Morris-Pratt.)

My naive DIY algorithm

The following algorithm searches for the 'hardware' substring:

#include <string.h>
#include <stdint.h>
#include <stdio.h>

int search_for_hardware(const char *s)
{
size_t len=strlen(s);
uint64_t current=0;
for (int i=0; i<len; i++)
{
current=current << 8;      // drop highest byte
// lowest byte cleared already after shifting
current=current | s[i];    // add lowest byte. this is single read from memory
//that will also work: if (memcmp (&current, "erawdrah", 8)==0) // "hardware" string reversed
if (current==0x6861726477617265) // MSB: 'h' ... LSB: 'e'
return i-(8-1); // strlen("hardware")-1
};
};

void main()
{
printf ("%d\n", search_for_hardware("hardware haha"));
printf ("%d\n", search_for_hardware(" hardware haha"));
printf ("%d\n", search_for_hardware("haha hardware"));
printf ("%d\n", search_for_hardware("haha hardware yes"));
printf ("%d\n", search_for_hardware("haha software"));
printf ("%d\n", search_for_hardware("software haha"));
};


It's almost perfect: it just does the job, in linear fashion, like KMP. Only one byte is loaded for each character of string. Couple of more operations executed for each character: bit shift, OR and comparison of two 64-bit values.

After some tweaking you can make it searching for any string of 1..8 length. It has obvious limitation: a substring is limited by 8 characters. But you can extend it to 1..16 characters by using SSE or to 1..32 characters by using AVX. Well, even to 1..64 characters if using AVX-512. But still, you can't treat this algorithm as a general-purpose one.

Rolling hash

How would you fingerprint a piece of data? Using hash function, OK. Maybe not even cryptographically strong one. Maybe CRC32/CRC64 would suffice, why not.

But you need such a function, so that you could modify it's result (or hash). You'll want to add a new character/byte to the hash and drop the highest character/byte from it.

Obviously, this wouldn't be possible with cryptographically secure hash functions like SHA-xxx, etc. But we don't need such complex functions, after all. We just want to fingerprint a substring and find that hash in an input string.

Let's pretend, our CPU registers are of infinite capacity, they can store bignums. Then you can just shift a value 8 bits left at each iteration and drop highest 8 bits.

In practice, we will store bignums in a 32-bit variable, but modulo something, and then we will modify it.

We choose a prime number for modulo: q=2038077073.

Now I'm encoding the "hello" string in Wolfram Mathematica:

In[]:= a=ToCharacterCode["hello"]
Out[]= {104,101,108,108,111}

In[]:= b=Reverse[256^Range[0,Length[a]-1]]
Out[]= {4294967296,16777216,65536,256,1}

In[]:= Total[a*b]
Out[]= 448378203247


This number is just all ASCII codes joined together:

In[]:= BaseForm[Total[a*b],16]
Out[]:= 0x68656c6c6f


That wouldn't fit in a 32-bit varible, so get a remainder from division, or modulo:

In[6]:= Mod[Total[a*b],q]
Out[6]= 1247187


This is the hash of the "hello" string.

Let's say, our algorithm is rolling on a string like "helloworld". Now we'll modify the hash by dropping the 'h' character at the front and adding 'w' character at the end:

In[]:= RM=256^(5-1)
Out[]= 4294967296

In[]:= Mod[(1247187-RM*ToCharacterCode["h"][[1]])*256+ToCharacterCode["w"][[1]],q]
Out[]= 1500326098


This is indeed a hash of the "ellow" string, if calculated from scratch:

In[]:= a=ToCharacterCode["ellow"]
Out[]= {101,108,108,111,119}

In[]:= b=Reverse[256^Range[0,Length[a]-1]]
Out[]= {4294967296,16777216,65536,256,1}

In[]:= Mod[Total[a*b],q]
Out[]= 1500326098


So how it works? The hardest part is dropping the 'h' character:

In[]:= RM=256^(5-1)
Out[]= 4294967296

... (1247187-RM*ToCharacterCode["h"][[1]]) ...


RM is a base for 5th character. If using bignum, we would subtract RM*ascii_code_of_character from the hash. But the same is true for a hash in 'modulo' form.

The next part is shifting a hash 8 bits left:

... *256 ...


And the next part is adding the 'w' character at the end:

... +ToCharacterCode["w"][[1]] ...


It doesn't need to be multiplied by anything: it's already at a lowest base.

The final part is modulo again, or calculating remainder from division:

new hash = Mod[modified hash, q]


That value would always fit in 32-bit variable.

The prime number is just big enough, but smaller than $2^{32}$. It can be random. Why prime? To make hashes more uniform. Prime numbers are used in hash functions for that reason, see: https://cs.stackexchange.com/questions/28019/why-is-the-base-used-to-compute-hashes-in-rabin-karp-always-primes .

Working version from R.Sedgewick's book

The following C code is my commented remake of R.Sedgewick's Java code from his excellent book.

The code is no harder that what I did using Mathematica, but it uses several tricks to fit everything in 64/32-bit variables. The hash() function computes hash just like I did, but uses Horner's method to optimize exponentiation operation out.

Also, you may ask, why $q$ is added to hash before dropping character?

...
txtHash = (txtHash + q - RM*txt[i-m] % q) % q;
...


... to prevent negative result after subtracting.

Also, due to hash collisions, the result must be rechecked using memcmp(), but this function will be called rarely enough, so no problems.

There are two memory reads per one character of the input string: we load a character to be added to hash and the one to be dropped from it. This is may be slightly worse than KMP algorithm. However, Rabin-Karp algo can scale perfectly. No matter, how long substring is, the fingerprint/hash is always 32-bit, while KMP algorithm may construct a large table for a large substring.

#include <string.h>
#include <stdint.h>
#include <stdio.h>

uint64_t hash (const char *s, size_t m, uint64_t q)
{
// Horner’s method, applied to modular hashing
uint64_t h=0;
for (int j = 0; j < m; j++)
h = (256 * h + s[j]) % q;
return h;
};

int search(const char *pat, const char *txt)
{
size_t m=strlen(pat);
size_t n=strlen(txt);
uint64_t q=0x797a9691; // a large prime, small enough to avoid long overflow
uint64_t RM;           // 256^(m-1) % q
uint64_t patHash;

// precompute 256^(m-1) % q for use in removing leading digit
RM = 1;
for (int i = 1; i <= m-1; i++)
RM = (256 * RM) % q;

patHash=hash (pat, m, q);

if (n < m)
return n;
uint64_t txtHash=hash (txt, m, q);
// check for match at offset 0
// additional memcmp() required to prevent hash collision
if (patHash == txtHash && memcmp(txt, pat, m)==0)
return 0;

// check for hash match; if hash match, check for exact match
for (int i = m; i < n; i++)
{
// Remove leading digit, add trailing digit, check for match.
txtHash = (txtHash + q - RM*txt[i-m] % q) % q;
txtHash = (txtHash*256 + txt[i]) % q;

// match
int offset = i - m + 1;
// additional memcmp() required to prevent hash collision
if ((patHash == txtHash) && memcmp(txt+offset, pat, m)==0)
return offset;
}

// no match
return n;
};

void main()
{
const char* pat="hardware";
printf ("%d\n", search(pat,"hardware haha"));
printf ("%d\n", search(pat," hardware haha"));
printf ("%d\n", search(pat,"haha hardware"));
printf ("%d\n", search(pat,"haha hardware yes"));
printf ("%d\n", search(pat,"haha software"));
printf ("%d\n", search(pat,"software haha"));

// another test:
uint64_t q=0x797a9691; // a large prime, small enough to avoid long overflow
// precompute 256^(m-1) % q for use in removing leading digit
uint64_t RM = 1;
for (int i = 1; i <= 5-1; i++)
RM = (256 * RM) % q;

uint64_t h=hash ("hello", 5, q);
printf ("h=%lx\n", h);
// remove the char at the front:
h = (h + q - RM*'h' % q) % q;
printf ("h=%lx\n", h);
// add the char at the end:
h = (h*256 + 'w') % q;
printf ("h=%lx\n", h);

hash ("ellow", 5, q);
printf ("h=%lx\n", h);
};



That code is fully workable and can be copypasted to production code. Unlike R.Sedgewick's version, mine has $q$ prime number fixed. Also, the radix is always 256, because we work on bytes.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.