Boyer-Moore string search algorithm explanation and formal verification using CBMC

Previously I did this with Knuth-Morris-Pratt algorithm: part I, part II, part III.

So again, let's try to reinvent string-search algorithm.

Version 1

If a substring would be compared in reverse order, things may be different. For example, we search for a 'CAT' substring in 'HORSE TURTLE CAT' string.

With naive algorithm, we will first compare 'H' and 'C', we see they are unequal characters and we will advance substring one character ahead:

HORSE TURTLE CAT
CAT
  ^

Right, but what if we will start at the end of 'CAT' substring proceeding back? We first compare 'R' with 'T' (last character of 'CAT') and we see that they are not equal. We could advance substring one character ahead, but... Can't we see that 'R' character in the middle of 'HORSE' part is absent in the 'CAT' substring? Can't this tell us something useful? Yes -- since 'R' is not in 'CAT', we can advance 'CAT' substring more, because we know that this will fail for sure:

HORSE TURTLE CAT
 CAT
   ^

This too:

HORSE TURTLE CAT
  CAT
    ^

So if we see a character not in 'CAT', we can advance 3 characters ahead:

HORSE TURTLE CAT
   CAT
     ^

But this will fail too, also, the space (' ') character is absent in 'CAT' too. OK, we advance 3 characters ahead:

HORSE TURTLE CAT
      CAT
        ^

No, 'R' is not in 'CAT', advance again 3 characters ahead:

HORSE TURTLE CAT
         CAT
           ^

Failing again... Adding 3 characters:

HORSE TURTLE CAT
            CAT
              ^

This is important. 'A' != 'T', but 'A' is in the 'CAT' substring, so this time we advance only 1 character ahead, as in naive search algorithm:

HORSE TURTLE CAT
             CAT
               ^

Here we will compare 3 characters and return success.

All this is way faster than naive string search algorith, and is a strong competitor with Knuth-Morris-Pratt algorithm.

This is the source code of a function that is hardcoded to search for 'cat' substring, nothing else:

unsigned search_cat_BM_v1 (char *s, unsigned len)
{
        if (len<3)
                return len; // not found
        unsigned skip;
        for (unsigned i=0; i<len-2; )
        {
                skip=1; // default skip
                char char3=s[i+2]; // read operation
                if (char3=='t')
                {
                        char char2=s[i+1]; // read operation
                        if (char2=='a')
                        {
                                char char1=s[i]; // read operation
                                if (char1=='c')
                                        return i; // found
                                else
                                {
                                        // 1st character isn't 'c'
                                        if ((char1!='c') && (char1!='a') && (char1!='t'))
                                                skip=3;
                                }
                        }
                        else
                        {
                                // 2nd character isn't 'a'
                                if ((char2!='c') && (char2!='a') && (char2!='t'))
                                        skip=3;
                        }
                }
                else
                {
                        // 3rd character isn't 't'
                        if ((char3!='c') && (char3!='a') && (char3!='t'))
                                skip=3;
                }
                i=i+skip;
        }
        return len; // not found
};

As compared with a naitve brute-force string algorithm, it shows a significant advantage:

** cacat
search_cat_brute read_ops=7
search_cat_BM_v1 read_ops=5
** cac cat
search_cat_brute read_ops=10
search_cat_BM_v1 read_ops=5
** cac cac cat
search_cat_brute read_ops=17
search_cat_BM_v1 read_ops=7
** horse turtle cat
search_cat_brute read_ops=16
search_cat_BM_v1 read_ops=8
** horse cat turtle
search_cat_brute read_ops=9
search_cat_BM_v1 read_ops=5

Version 2

Now the next observation. If we search for 'CAT' and 'CACAT', we see that 'C' != 'T', but 'C' is in the 'CAT' substring, so we would advance by one character.

CACAT
CAT
  ^

But... we can also align substring so that the first 'C' character in 'CAT' would match the next 'C' in 'CACAT', by advancing for 2 characters ahead:

CACAT
  CAT
    ^

We add a rule: "if a mismatched character in string exist in substring, align substring so that they would match" or, in other words, "if 'C' in string isn't equal to the last character in 'CAT' substring, advance by 2 characters":

                if (char3=='t')
                {
                        ...
                }
                else
                {
                        // 3rd character isn't 't'
                        if ((char3!='c') && (char3!='a') && (char3!='t'))
                                skip=3;
                        if (char3=='c') // added
                                skip=2;         // added
                }
                i=i+skip;

It may perform even better on some strings:

** cacat                        ** cacat
search_cat_brute read_ops=7     search_cat_brute read_ops=7
search_cat_BM_v1 read_ops=5     search_cat_BM_v2 read_ops=4

As of aligning 'A' character in the middle of 'CAT' substring -- we wouldn't bother, because advancing substring by one character it's the same as is done by default.

Again, our algorithm is hardcoded only for 'CAT'.

And this is the universal algorithm from the R.Sedgewick's book, the first function construct a table of 'skips', the second does the actual search (Java code):

    public BoyerMoore(String pat) {
        this.R = 256;
        this.pat = pat;

        // position of rightmost occurrence of c in the pattern
        right = new int[R];
        for (int c = 0; c < R; c++)
            right[c] = -1;
        for (int j = 0; j < pat.length(); j++)
            right[pat.charAt(j)] = j;
    }
...
    public int search(String txt) {
        int m = pat.length();
        int n = txt.length();
        int skip;
        for (int i = 0; i <= n - m; i += skip) {
            skip = 0;
            for (int j = m-1; j >= 0; j--) {
                if (pat.charAt(j) != txt.charAt(i+j)) {
                    skip = Math.max(1, j - right[txt.charAt(i+j)]);
                    break;
                }
            }
            if (skip == 0) return i;    // found
        }
        return n;                       // not found

( src )

And what if we search for 'COCOS' substring? What if characters are repeating within search pattern? Well, we will count only rightmost characters, ignoring two first 'CO'.

Formal verification using CBMC

I did that already with my homebrew Knith-Morris-Pratt algorithm.

But this time, CMBC have a hard time. It can't finish unwinding:

...
Unwinding loop search_cat_BM_v1.0 iteration 42 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 43 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 44 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 45 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 46 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 47 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
...

I think (not sure) CBMC have a problem with understanding loops like these:

        for (unsigned i=0; i<len-2; )
        {
                skip=1;
                ...
                {
                        if ((char3!='c') && (char3!='a') && (char3!='t'))
                                skip=3;
                }
                i=i+skip;
        }

In other words, where a loop's iterator is modified inside a loop body.

I had to unroll the loop:

unsigned search_cat_brute (char *s, unsigned len)
{
        if (len<3)
                return len; // not found
        for (unsigned i=0; i<len-2; i++)
        {
                if (s[i]=='c' && s[i+1]=='a' && s[i+2]=='t')
                        return i;
        };
        return len; // not found
};

unsigned loop_body (char *s, unsigned *i_ptr)
{
        unsigned i=*i_ptr; // get i
        unsigned skip=1; // default skip
        char char3=s[i+2]; // read operation
        if (char3=='t')
        {
                char char2=s[i+1]; // read operation
                if (char2=='a')
                {
                        char char1=s[i]; // read operation
                        if (char1=='c')
                        {
                                return 1; // found
                        }
                        else
                        {
                                // 1st character isn't 'c'
                                if ((char1!='c') && (char1!='a') && (char1!='t'))
                                        skip=3;
                        }
                }
                else
                {
                        // 2nd character isn't 'a'
                        if ((char2!='c') && (char2!='a') && (char2!='t'))
                                skip=3;
                }
        }
        else
        {
                // 3rd character isn't 't'
                if ((char3!='c') && (char3!='a') && (char3!='t'))
                        skip=3;
                if (char3=='c')
                        skip=2;
        }
        i=i+skip;
        *i_ptr=i; // store back
        return 0;
};

unsigned search_cat_unrolled_v2 (char *s, unsigned len)
{
        if (len<3)
                return len; // not found

        unsigned i=0;

        // for len=[3..15]
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
        if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;

        return len; // not found
};

#ifdef CBMC
void check()
{
        unsigned len=LEN;
        char s[len];
        __CPROVER_assert (search_cat_brute(s, len)==search_cat_unrolled_v2(s, len), "assert");
};
#endif

So crude, but worked!

% time cbmc --trace --function check -DLEN=14 -DCBMC BM_cat_1_v2.c

CBMC version 5.10 (cbmc-5.10) 64-bit x86_64 linux
Parsing BM_cat_1_v2.c
Converting
Type-checking BM_cat_1_v2
Generating GOTO Program
Adding CPROVER library (x86_64)
Removal of function pointers and virtual functions
Generic Property Instrumentation
Running with 8 object bits, 56 offset bits (default)
Starting Bounded Model Checking
Unwinding loop search_cat_brute.0 iteration 1 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 2 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 3 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 4 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 5 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 6 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 7 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 8 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 9 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 10 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 11 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 12 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
size of program expression: 2118 steps
simple slicing removed 2 assignments
Generated 1 VCC(s), 1 remaining after simplification
Passing problem to propositional reduction
converting SSA
Running propositional reduction
Post-processing
Solving with MiniSAT 2.2.1 with simplifier
29262 variables, 75920 clauses
SAT checker: instance is UNSATISFIABLE
Runtime decision procedure: 0.36729s

** Results:
[check.assertion.1] assert: SUCCESS

** 0 of 1 failed (1 iteration)
VERIFICATION SUCCESSFUL

Now important note.

If you modify these lines so that this conditional expression would compare with other characters...

                        {
                                // 1st character isn't 'c'
                                if ((char1!='c') && (char1!='a') && (char1!='t'))
                                        skip=3;
                        }

CBMC will still verify this algorithm and prove its correctness, because a bug here wouldn't make the algorithm incorrect. It will degrade its performance to a level of naive string search algorithm, as if 'skip=3' here would be replaced with 'skip=1'. It will still work correctly, but slower.

But this algorithm is not finished.

I reimplemented it from R.Sedgewick book.

And he states:

Program BoyerMoore.java implements the bad-character rule part of the Boyer-Moore algorithm. It does not implement the strong good suffix rule.

It coincides with 'delta1' rule from the original Boyer-Moore paper. But there is also a 'delta2' rule, which is just another story for another blog post.

Files

Here


UPD: https://news.ycombinator.com/item?id=26900640

UPD2: Follow-up


Please drop me email about bug(s) and/or suggestion(s): blog@yurichev.com. List of other blog posts. BTW, I'm teaching. Follow me in social networks: Twitter, Telegram, GitHub, Discord, Facebook.