## Boyer-Moore string search algorithm explanation and formal verification using CBMC

(The following text has been copypasted to the SAT/SMT by example book.)

Previously I did this with Knuth-Morris-Pratt algorithm: part I, part II, part III.

So again, let's try to reinvent string-search algorithm.

### Version 1

If a substring would be compared in reverse order, things may be different. For example, we search for a 'CAT' substring in 'HORSE TURTLE CAT' string.

With naive algorithm, we will first compare 'H' and 'C', we see they are unequal characters and we will advance substring one character ahead:

HORSE TURTLE CAT
CAT
^


Right, but what if we will start at the end of 'CAT' substring proceeding back? We first compare 'R' with 'T' (last character of 'CAT') and we see that they are not equal. We could advance substring one character ahead, but... Can't we see that 'R' character in the middle of 'HORSE' part is absent in the 'CAT' substring? Can't this tell us something useful? Yes -- since 'R' is not in 'CAT', we can advance 'CAT' substring more, because we know that this will fail for sure:

HORSE TURTLE CAT
CAT
^


This too:

HORSE TURTLE CAT
CAT
^


So if we see a character not in 'CAT', we can advance 3 characters ahead:

HORSE TURTLE CAT
CAT
^


But this will fail too, also, the space (' ') character is absent in 'CAT' too. OK, we advance 3 characters ahead:

HORSE TURTLE CAT
CAT
^


HORSE TURTLE CAT
CAT
^


HORSE TURTLE CAT
CAT
^


This is important. 'A' != 'T', but 'A' is in the 'CAT' substring, so this time we advance only 1 character ahead, as in naive search algorithm:

HORSE TURTLE CAT
CAT
^


Here we will compare 3 characters and return success.

All this is way faster than naive string search algorithm, and is a strong competitor with Knuth-Morris-Pratt algorithm.

This is the source code of a function that is hardcoded to search for 'cat' substring, nothing else:

unsigned search_cat_BM_v1 (char *s, unsigned len)
{
if (len<3)
unsigned skip;
for (unsigned i=0; i<len-2; )
{
skip=1; // default skip
if (char3=='t')
{
if (char2=='a')
{
if (char1=='c')
return i; // found
else
{
// 1st character isn't 'c'
if ((char1!='c') && (char1!='a') && (char1!='t'))
skip=3;
}
}
else
{
// 2nd character isn't 'a'
if ((char2!='c') && (char2!='a') && (char2!='t'))
skip=3;
}
}
else
{
// 3rd character isn't 't'
if ((char3!='c') && (char3!='a') && (char3!='t'))
skip=3;
}
i=i+skip;
}
};


As compared with a naive brute-force string algorithm, it shows a significant advantage:

** cacat
** cac cat
** cac cac cat
** horse turtle cat
** horse cat turtle


### Version 2

Now the next observation. If we search for 'CAT' and 'CACAT', we see that 'C' != 'T', but 'C' is in the 'CAT' substring, so we would advance by one character.

CACAT
CAT
^


But... we can also align substring so that the first 'C' character in 'CAT' would match the next 'C' in 'CACAT', by advancing for 2 characters ahead:

CACAT
CAT
^


We add a rule: "if a mismatched character in string exist in substring, align substring so that they would match" or, in other words, "if 'C' in string isn't equal to the last character in 'CAT' substring, advance by 2 characters":

                if (char3=='t')
{
...
}
else
{
// 3rd character isn't 't'
if ((char3!='c') && (char3!='a') && (char3!='t'))
skip=3;
}
i=i+skip;


It may perform even better on some strings:

** cacat                        ** cacat


As of aligning 'A' character in the middle of 'CAT' substring -- we wouldn't bother, because advancing substring by one character it's the same as is done by default.

Again, our algorithm is hardcoded only for 'CAT'.

And this is the universal algorithm from the R.Sedgewick's book, the first function construct a table of 'skips', the second does the actual search (Java code):

    public BoyerMoore(String pat) {
this.R = 256;
this.pat = pat;

// position of rightmost occurrence of c in the pattern
right = new int[R];
for (int c = 0; c < R; c++)
right[c] = -1;
for (int j = 0; j < pat.length(); j++)
right[pat.charAt(j)] = j;
}
...
public int search(String txt) {
int m = pat.length();
int n = txt.length();
int skip;
for (int i = 0; i <= n - m; i += skip) {
skip = 0;
for (int j = m-1; j >= 0; j--) {
if (pat.charAt(j) != txt.charAt(i+j)) {
skip = Math.max(1, j - right[txt.charAt(i+j)]);
break;
}
}
if (skip == 0) return i;    // found
}


( src )

And what if we search for 'COCOS' substring? What if characters are repeating within search pattern? Well, we will count only rightmost characters, ignoring two first 'CO'.

### Formal verification using CBMC

I did that already with my homebrew Knuth-Morris-Pratt algorithm.

But this time, CMBC have a hard time. It can't finish unwinding:

...
Unwinding loop search_cat_BM_v1.0 iteration 42 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 43 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 44 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 45 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 46 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
Unwinding loop search_cat_BM_v1.0 iteration 47 file BM_cat_v1.c line 36 function search_cat_BM_v1 thread 0
...


I think (not sure) CBMC have a problem with understanding loops like these:

        for (unsigned i=0; i<len-2; )
{
skip=1;
...
{
if ((char3!='c') && (char3!='a') && (char3!='t'))
skip=3;
}
i=i+skip;
}


In other words, where a loop's iterator is modified inside a loop body.

I had to unroll the loop:

unsigned search_cat_brute (char *s, unsigned len)
{
if (len<3)
for (unsigned i=0; i<len-2; i++)
{
if (s[i]=='c' && s[i+1]=='a' && s[i+2]=='t')
return i;
};
};

unsigned loop_body (char *s, unsigned *i_ptr)
{
unsigned i=*i_ptr; // get i
unsigned skip=1; // default skip
if (char3=='t')
{
if (char2=='a')
{
if (char1=='c')
{
return 1; // found
}
else
{
// 1st character isn't 'c'
if ((char1!='c') && (char1!='a') && (char1!='t'))
skip=3;
}
}
else
{
// 2nd character isn't 'a'
if ((char2!='c') && (char2!='a') && (char2!='t'))
skip=3;
}
}
else
{
// 3rd character isn't 't'
if ((char3!='c') && (char3!='a') && (char3!='t'))
skip=3;
if (char3=='c')
skip=2;
}
i=i+skip;
*i_ptr=i; // store back
return 0;
};

unsigned search_cat_unrolled_v2 (char *s, unsigned len)
{
if (len<3)

unsigned i=0;

// for len=[3..15]
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;
if (loop_body(s, &i)) return i; if (i>=(len-2)) return len;

};

#ifdef CBMC
void check()
{
unsigned len=LEN;
char s[len];
__CPROVER_assert (search_cat_brute(s, len)==search_cat_unrolled_v2(s, len), "assert");
};
#endif


So crude, but worked!

% time cbmc --trace --function check -DLEN=14 -DCBMC BM_cat_1_v2.c

CBMC version 5.10 (cbmc-5.10) 64-bit x86_64 linux
Parsing BM_cat_1_v2.c
Converting
Type-checking BM_cat_1_v2
Generating GOTO Program
Removal of function pointers and virtual functions
Generic Property Instrumentation
Running with 8 object bits, 56 offset bits (default)
Starting Bounded Model Checking
Unwinding loop search_cat_brute.0 iteration 1 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 2 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 3 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 4 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 5 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 6 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 7 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 8 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 9 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 10 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 11 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
Unwinding loop search_cat_brute.0 iteration 12 file BM_cat_1_v2.c line 11 function search_cat_brute thread 0
size of program expression: 2118 steps
simple slicing removed 2 assignments
Generated 1 VCC(s), 1 remaining after simplification
Passing problem to propositional reduction
converting SSA
Running propositional reduction
Post-processing
Solving with MiniSAT 2.2.1 with simplifier
29262 variables, 75920 clauses
SAT checker: instance is UNSATISFIABLE
Runtime decision procedure: 0.36729s

** Results:
[check.assertion.1] assert: SUCCESS

** 0 of 1 failed (1 iteration)
VERIFICATION SUCCESSFUL


Now important note.

If you modify these lines so that this conditional expression would compare with other characters...

                        {
// 1st character isn't 'c'
if ((char1!='c') && (char1!='a') && (char1!='t'))
skip=3;
}


CBMC will still verify this algorithm and prove its correctness, because a bug here wouldn't make the algorithm incorrect. It will degrade its performance to a level of naive string search algorithm, as if 'skip=3' here would be replaced with 'skip=1'. It will still work correctly, but slower.

### But this algorithm is not finished.

I reimplemented it from R.Sedgewick book.

And he states:

Program BoyerMoore.java implements the bad-character rule part of the Boyer-Moore algorithm. It does not implement the strong good suffix rule.

It coincides with 'delta1' rule from the original Boyer-Moore paper. But there is also a 'delta2' rule, which is just another story for another blog post.

### Files

Here

UPD2: Follow-up

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.