Knuth-Morris-Pratt string-searching algorithm (part III): DFA-less version

(The following text has been copypasted to the SAT/SMT by example book.)

(See part II.)

DFAs generated by the KMP algorithms are sparse and have regularities we can observe easily. One popular optimization is not using DFA, but rather a small "partial match" table:

// Knuth–Morris–Pratt algorithm
// copypasted from http://cprogramming.com/snippets/source-code/knuthmorrispratt-kmp-string-search-algorithm
char *kmp_search(char *haystack, size_t haystack_size, char *needle, size_t needle_size)
{
	int *T;
	int i, j;
	char *result = NULL;
 
	if (needle_size==0)
		return haystack;
 
	/* Construct the lookup table */
	T = (int*) malloc((needle_size+1) * sizeof(int));
	T[0] = -1;
	for (i=0; i<needle_size; i++)
	{
		T[i+1] = T[i] + 1;
		while (T[i+1] > 0 && needle[i] != needle[T[i+1]-1])
			T[i+1] = T[T[i+1]-1] + 1;
	}

        printf ("restarts table:\n");
	for (i=0; i<needle_size+1; i++)
		printf ("T[%d]=%d\n", i, T[i]);

	/* Perform the search */
	for (i=j=0; i<haystack_size; )
	{
                if (j>=0)
                        print_compare (haystack, i, needle, j);
		if (j < 0 || haystack[i] == needle[j])
		{
			++i, ++j;
			if (j == needle_size)
			{
				result = haystack+i-j;
				break;
			}
		}
		else
                {
			j = T[j];
                        if (j==-1)
                                printf ("Restarting needle at the beginning\n");
                        else
                                print_state_needle(needle, j);
                };
	}
 
	free(T);
	return result;
}

Now let's search for 'eel' in the 'eex eel' string:

restarts table:
T[0]=-1
T[1]=0
T[2]=1
T[3]=0
going to compare: eex eel
                  eel
                  ^
going to compare: eex eel
                  eel
                   ^
going to compare: eex eel
                  eel
                    ^
Restarting needle at: eel
                       ^
going to compare: eex eel
                   eel
                    ^
Restarting needle at: eel
                      ^
going to compare: eex eel
                    eel
                    ^
Restarting needle at the beginning
going to compare: eex eel
                     eel
                     ^
Restarting needle at the beginning
going to compare: eex eel
                      eel
                      ^
going to compare: eex eel
                      eel
                       ^
going to compare: eex eel
                      eel
                        ^

The T[] table is small, which is good for cache memory. But please note: a character is to be loaded twice in case of restart. But this is fine, it's cached anyway at that moment.

Now the interesting thing: if the 'partial match' table is empty (all zeroes), the search function can be implemented naively, like I tried in the part I.

These are the cases of words like 'tesa', 'lee', 'banana'. But a non-empty table must be constructed for words like 'eel', 'oops', 'test', 'elee' -- these are words where a prefix (1-character minimum) is repeated in the word. But this is not true for 'banana' -- there is no repeating prefix.

All the files I used, including all the papers I've found interesting.

Further reading.


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.