Knuth-Morris-Pratt string-searching algorithm (part I) + my homebrew algorithms formally verified using CBMC

(The following text has been copypasted to the SAT/SMT by example book.)

Searching for the "ok" substring

Imagine you want to find an "ok" substring within a string. You would do:

unsigned search_ok (char *s, unsigned len)
{
        if (len<2)
                return len;       // not found
        for (unsigned i=0; i<len-1; i++)
        {
                if (s[i]=='o' && s[i+1]=='k')
                        return i; // found
        };
        return len;               // not found
};

You see, there are two memory accesses per one character of the input string (at average). Hence, the total number of all memory accesses can be len*2 at worst.

Is it possible to reduce that number? Yes. In the following example, we have only one single memory access per character:

unsigned search_ok (char *s, unsigned len)
{
        if (len<2)
                return len;           // not found
        bool seen_o=false;
        for (unsigned i=0; i<len; i++)
        {
                char ch=s[i];         // this is single read operation
                if (ch=='o')
                        seen_o=true;
                else if (seen_o && ch=='k')
                        return i-1;   // found
                else
                        seen_o=false; // reset
        };
        return len; // not found
};

Searching for the "eel" substring, verification using CBMC

Can we extend that method to any 3-character strings? Let's write a function that searches for the 'eel' string. (For the non-English speakers: this is indeed a word. Some kind of fish.)

We will use the 'seen' variable, reflecting, how many characters of the sought string we've already seen:

unsigned search_eel (char *s, unsigned len)
{
        if (len<3)
                return len;         // not found
        unsigned seen=0;
        for (unsigned i=0; i<len; i++)
        {
                char ch=s[i];       // this is single read operation
                if (seen==0 && ch=='e')
                        seen=1;
                else if (seen==1 && ch=='e')
                        seen=2;
                else if (seen==2 && ch=='l')
                        return i-2; // found
                else
                        seen=0;     // reset
        }
        return len;                 // not found
};

But this implementation has a bug. To find it, i'll use CBMC, the excellent tool, that can verify if a C function is equivalent to another function. (To read more about verification: SAT/SMT by Example.)

This is like Test-driven development (TDD).

I'll add a 'reference' (though, slow) function, but always correct:

unsigned search_eel_brute (char *s, unsigned len)
{
        if (len<3)
                return len; // not found
        for (unsigned i=0; i<len-2; i++)
        {
                if (s[i]=='e' && s[i+1]=='e' && s[i+2]=='l')
                        return i;
        };
        return len; // not found
};

...

void check()
{
        unsigned len=LEN;
        char s[len];
        __CPROVER_assert (search_eel_brute(s, len)==search_eel(s, len), "assert");
};
% cbmc --trace --function check -DLEN=4 kmp_eel2.c
...

** Results:
[check.assertion.1] assert: FAILURE

Trace for check.assertion.1:

...

State 21 file kmp_eel2.c line 41 function check thread 0
----------------------------------------------------
  s={ 'e', 'e', 'e', 'l' } ({ 01100101, 01100101, 01100101, 01101100 })
...

It failed for the string "eeel". After some thinking, we can find a problem. If a third character isn't 'l', but 'e', we are in the middle of a long string of 'e' characters. So if seen==2 and the input character isn't 'l', but 'e', we shouldn't advance the 'seen' variable:

...
                if (seen==0 && ch=='e')
                        seen=1;
                else if (seen==1 && ch=='e')
                        seen=2;
                else if (seen==2 && ch=='l')
                        return i-2;           // found
                else if (seen==2 && ch=='e')  // fix
                        seen=2;               // fix
                else
                        seen=0;               // reset
...

Now the verification is successfull up to strings of length 15:

#!/bin/bash
for i in $( seq 0 15 ); do
    #echo $i
    cbmc --trace --function check -DLEN=$i kmp_eel2.c
done

Searching for the "cocos" substring, verification using CBMC

The "cocos" substring is much more problematic. But fixing it with the help of CBMC is a great programming exercise.

unsigned search_cocos_brute (char *s, unsigned len)
{
        if (len<5)
                return len;       // not found
        for (unsigned i=0; i<len-4; i++)
        {
                if (s[i]=='c' &&
                                s[i+1]=='o' &&
                                s[i+2]=='c' &&
                                s[i+3]=='o' &&
                                s[i+4]=='s')
                        return i; // found
        };
        return len;               // not found
};

unsigned search_cocos_naive (char *s, unsigned len)
{
        if (len<5)
                return len;          // not found
        unsigned seen=0;
        for (unsigned i=0; i<len; i++)
        {
                char ch=s[i];        // this is single read operation
                if (seen==0 && ch=='c')
                        seen=1;
                else if (seen==1 && ch=='o')
                        seen=2;
                else if (seen==2 && ch=='c')
                        seen=3;
                else if (seen==3 && ch=='o')
                        seen=4;
                else if (seen==4 && ch=='s')
                        return i-4;  // found
                else
                        seen=0;      // reset
        }
        return len;                  // not found
};

void check()
{
        unsigned len=LEN;
        char s[len];
        __CPROVER_assert (search_cocos_brute(s, len)==search_cocos_naive(s, len), "assert");
};

This is the first bug:

% cbmc --trace --function check -DLEN=6 kmp_cocos.c
...
** Results:
[check.assertion.1] assert: FAILURE

Trace for check.assertion.1:

...

State 21 file kmp_cocos.c line 47 function check thread 0
----------------------------------------------------
  s={ 'c', 'c', 'o', 'c', 'o', 's' } ({ 01100011, 01100011, 01101111, 01100011, 01101111, 01110011 })

...

A test failed for the "ccocos" string.

Here is the fix: if the first 'c' is repeating, we shouldn't advance the 'seen' variable:

                if (seen==0 && ch=='c')
                        seen=1;
                else if (seen==1 && ch=='o')
                        seen=2;
                else if (seen==1 && ch!='o')
                {
                                     // we can be here if the input is 'ccocos'
                        if (ch=='c')
                                seen=1;
                        else
                                seen=0;
                }
                else if (seen==2 && ch=='c')
                        seen=3;
                else if (seen==3 && ch=='o')
                        seen=4;
                else if (seen==4 && ch=='s')
                        return i-4;  // found
                else
                        seen=0;      // reset

CBMC can verify this function for all 6-character strings, OK. But it can find problematic 7-character string:

% cbmc --trace --function check -DLEN=7 kmp_cocos.c
...
** Results:
[check.assertion.1] assert: FAILURE

Trace for check.assertion.1:

...
State 21 file kmp_cocos.c line 79 function check thread 0
----------------------------------------------------
  s={ 'c', 'o', 'c', 'o', 'c', 'o', 's' } ({ 01100011, 01101111, 01100011, 01101111, 01100011, 01101111, 01110011 })
...

That is, "cococos".

And this is my fix:

...
                else if (seen==4 && ch=='s')
                        return i-4;  // found
                else if (seen==4 && ch!='s')
                {
                        // the input string is 'cocoX' where X is not 's'
                        // (current state of ch='X')
                        // but 'X' could be 'c' if the input string is 'cococos'
                        if (ch=='c')
                        {
                                // if the string is 'cococos',
                                // we can say that we have already seen the 'coc' part of it:
                                seen=3;
                        }
                        else
                        {
                                // 'X' is not 'c', so reset
                                seen=0;
                        };
                }
                else
                        seen=0;      // reset
...

Now all 7-characters strings can be tested without a fail. But not 8-character ones:

% cbmc --trace --function check -DLEN=8 kmp_cocos.c
...

** Results:
[check.assertion.1] assert: FAILURE

Trace for check.assertion.1:

...

State 21 file kmp_cocos.c line 96 function check thread 0
----------------------------------------------------
  s={ 'c', 'o', 'c', 'c', 'o', 'c', 'o', 's' } ({ 01100011, 01101111, 01100011, 01100011, 01101111, 01100011, 01101111, 01110011 })

...

It fails with "coccocos". We have to add another check for the repeating second 'c' character.

...
                else if (seen==3 && ch=='o')
                        seen=4;
                else if (seen==3 && ch!='o')
                {
                        // if input='coccocos'
                        if (ch=='c')
                                seen=1;
                        else
                                seen=0; // reset
                }
                else if (seen==4 && ch=='s')
...

Now CBMC can check it all up to 15-character strings. The whole fixed function is:

unsigned search_cocos_fixed (char *s, unsigned len)
{
        if (len<5)
                return len;          // not found
        unsigned seen=0;
        for (unsigned i=0; i<len; i++)
        {
                char ch=s[i];        // this is single read operation
                if (seen==0 && ch=='c')
                        seen=1;
                else if (seen==1 && ch=='o')
                        seen=2;
                else if (seen==1 && ch!='o')
                {
                                     // we can be here if the input is 'ccocos'
                        if (ch=='c')
                                seen=1;
                        else
                                seen=0;
                }
                else if (seen==2 && ch=='c')
                        seen=3;
                else if (seen==3 && ch=='o')
                        seen=4;
                else if (seen==3 && ch!='o')
                {
                        // if input='coccocos'
                        if (ch=='c')
                                seen=1;
                        else
                                seen=0; // reset
                }
                else if (seen==4 && ch=='s')
                        return i-4;  // found
                else if (seen==4 && ch!='s')
                {
                        // the input string is 'cocoX' where X is not 's'
                        // (current state of ch='X')
                        // but 'X' could be 'c' if the input string is 'cococos'
                        if (ch=='c')
                        {
                                // if the string is 'cococos',
                                // we can say that we have already seen the 'coc' part of it:
                                seen=3;
                        }
                        else
                        {
                                // 'X' is not 'c', so reset
                                seen=0;
                        };
                }
                else
                        seen=0;      // reset
        }
        return len;                  // not found
};

It is capable of searching for the 'cocos' substring reading each character of the input string only once, and it is formally verified by CBMC.

Code like that is very hard to test (can you execute these functions with all 15-characters input strings?), but thanks to CBMC, we can be sure it's correct, or at least, equivalent to the simple 'bruteforce' version. I couldn't devise a correct version without it. In fact first versions were written in Python. I rewritten it to pure C so that I can verify them using CBMC.

All the files I used

Part II.


UPD: discussion at HN.


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.