(The following text has been copypasted to the SAT/SMT by example book.)
Imagine you want to find an "ok" substring within a string. You would do:
unsigned search_ok (char *s, unsigned len) { if (len<2) return len; // not found for (unsigned i=0; i<len-1; i++) { if (s[i]=='o' && s[i+1]=='k') return i; // found }; return len; // not found };
You see, there are two memory accesses per one character of the input string (at average). Hence, the total number of all memory accesses can be len*2 at worst.
Is it possible to reduce that number? Yes. In the following example, we have only one single memory access per character:
unsigned search_ok (char *s, unsigned len) { if (len<2) return len; // not found bool seen_o=false; for (unsigned i=0; i<len; i++) { char ch=s[i]; // this is single read operation if (ch=='o') seen_o=true; else if (seen_o && ch=='k') return i-1; // found else seen_o=false; // reset }; return len; // not found };
Can we extend that method to any 3-character strings? Let's write a function that searches for the 'eel' string. (For the non-English speakers: this is indeed a word. Some kind of fish.)
We will use the 'seen' variable, reflecting, how many characters of the sought string we've already seen:
unsigned search_eel (char *s, unsigned len) { if (len<3) return len; // not found unsigned seen=0; for (unsigned i=0; i<len; i++) { char ch=s[i]; // this is single read operation if (seen==0 && ch=='e') seen=1; else if (seen==1 && ch=='e') seen=2; else if (seen==2 && ch=='l') return i-2; // found else seen=0; // reset } return len; // not found };
But this implementation has a bug. To find it, i'll use CBMC, the excellent tool, that can verify if a C function is equivalent to another function. (To read more about verification: SAT/SMT by Example.)
This is like Test-driven development (TDD).
I'll add a 'reference' (though, slow) function, but always correct:
unsigned search_eel_brute (char *s, unsigned len) { if (len<3) return len; // not found for (unsigned i=0; i<len-2; i++) { if (s[i]=='e' && s[i+1]=='e' && s[i+2]=='l') return i; }; return len; // not found }; ... void check() { unsigned len=LEN; char s[len]; __CPROVER_assert (search_eel_brute(s, len)==search_eel(s, len), "assert"); };
% cbmc --trace --function check -DLEN=4 kmp_eel2.c ... ** Results: [check.assertion.1] assert: FAILURE Trace for check.assertion.1: ... State 21 file kmp_eel2.c line 41 function check thread 0 ---------------------------------------------------- s={ 'e', 'e', 'e', 'l' } ({ 01100101, 01100101, 01100101, 01101100 }) ...
It failed for the string "eeel". After some thinking, we can find a problem. If a third character isn't 'l', but 'e', we are in the middle of a long string of 'e' characters. So if seen==2 and the input character isn't 'l', but 'e', we shouldn't advance the 'seen' variable:
... if (seen==0 && ch=='e') seen=1; else if (seen==1 && ch=='e') seen=2; else if (seen==2 && ch=='l') return i-2; // found else if (seen==2 && ch=='e') // fix seen=2; // fix else seen=0; // reset ...
Now the verification is successfull up to strings of length 15:
#!/bin/bash for i in $( seq 0 15 ); do #echo $i cbmc --trace --function check -DLEN=$i kmp_eel2.c done
The "cocos" substring is much more problematic. But fixing it with the help of CBMC is a great programming exercise.
unsigned search_cocos_brute (char *s, unsigned len) { if (len<5) return len; // not found for (unsigned i=0; i<len-4; i++) { if (s[i]=='c' && s[i+1]=='o' && s[i+2]=='c' && s[i+3]=='o' && s[i+4]=='s') return i; // found }; return len; // not found }; unsigned search_cocos_naive (char *s, unsigned len) { if (len<5) return len; // not found unsigned seen=0; for (unsigned i=0; i<len; i++) { char ch=s[i]; // this is single read operation if (seen==0 && ch=='c') seen=1; else if (seen==1 && ch=='o') seen=2; else if (seen==2 && ch=='c') seen=3; else if (seen==3 && ch=='o') seen=4; else if (seen==4 && ch=='s') return i-4; // found else seen=0; // reset } return len; // not found }; void check() { unsigned len=LEN; char s[len]; __CPROVER_assert (search_cocos_brute(s, len)==search_cocos_naive(s, len), "assert"); };
This is the first bug:
% cbmc --trace --function check -DLEN=6 kmp_cocos.c ... ** Results: [check.assertion.1] assert: FAILURE Trace for check.assertion.1: ... State 21 file kmp_cocos.c line 47 function check thread 0 ---------------------------------------------------- s={ 'c', 'c', 'o', 'c', 'o', 's' } ({ 01100011, 01100011, 01101111, 01100011, 01101111, 01110011 }) ...
A test failed for the "ccocos" string.
Here is the fix: if the first 'c' is repeating, we shouldn't advance the 'seen' variable:
if (seen==0 && ch=='c') seen=1; else if (seen==1 && ch=='o') seen=2; else if (seen==1 && ch!='o') { // we can be here if the input is 'ccocos' if (ch=='c') seen=1; else seen=0; } else if (seen==2 && ch=='c') seen=3; else if (seen==3 && ch=='o') seen=4; else if (seen==4 && ch=='s') return i-4; // found else seen=0; // reset
CBMC can verify this function for all 6-character strings, OK. But it can find problematic 7-character string:
% cbmc --trace --function check -DLEN=7 kmp_cocos.c ... ** Results: [check.assertion.1] assert: FAILURE Trace for check.assertion.1: ... State 21 file kmp_cocos.c line 79 function check thread 0 ---------------------------------------------------- s={ 'c', 'o', 'c', 'o', 'c', 'o', 's' } ({ 01100011, 01101111, 01100011, 01101111, 01100011, 01101111, 01110011 }) ...
That is, "cococos".
And this is my fix:
... else if (seen==4 && ch=='s') return i-4; // found else if (seen==4 && ch!='s') { // the input string is 'cocoX' where X is not 's' // (current state of ch='X') // but 'X' could be 'c' if the input string is 'cococos' if (ch=='c') { // if the string is 'cococos', // we can say that we have already seen the 'coc' part of it: seen=3; } else { // 'X' is not 'c', so reset seen=0; }; } else seen=0; // reset ...
Now all 7-characters strings can be tested without a fail. But not 8-character ones:
% cbmc --trace --function check -DLEN=8 kmp_cocos.c ... ** Results: [check.assertion.1] assert: FAILURE Trace for check.assertion.1: ... State 21 file kmp_cocos.c line 96 function check thread 0 ---------------------------------------------------- s={ 'c', 'o', 'c', 'c', 'o', 'c', 'o', 's' } ({ 01100011, 01101111, 01100011, 01100011, 01101111, 01100011, 01101111, 01110011 }) ...
It fails with "coccocos". We have to add another check for the repeating second 'c' character.
... else if (seen==3 && ch=='o') seen=4; else if (seen==3 && ch!='o') { // if input='coccocos' if (ch=='c') seen=1; else seen=0; // reset } else if (seen==4 && ch=='s') ...
Now CBMC can check it all up to 15-character strings. The whole fixed function is:
unsigned search_cocos_fixed (char *s, unsigned len) { if (len<5) return len; // not found unsigned seen=0; for (unsigned i=0; i<len; i++) { char ch=s[i]; // this is single read operation if (seen==0 && ch=='c') seen=1; else if (seen==1 && ch=='o') seen=2; else if (seen==1 && ch!='o') { // we can be here if the input is 'ccocos' if (ch=='c') seen=1; else seen=0; } else if (seen==2 && ch=='c') seen=3; else if (seen==3 && ch=='o') seen=4; else if (seen==3 && ch!='o') { // if input='coccocos' if (ch=='c') seen=1; else seen=0; // reset } else if (seen==4 && ch=='s') return i-4; // found else if (seen==4 && ch!='s') { // the input string is 'cocoX' where X is not 's' // (current state of ch='X') // but 'X' could be 'c' if the input string is 'cococos' if (ch=='c') { // if the string is 'cococos', // we can say that we have already seen the 'coc' part of it: seen=3; } else { // 'X' is not 'c', so reset seen=0; }; } else seen=0; // reset } return len; // not found };
It is capable of searching for the 'cocos' substring reading each character of the input string only once, and it is formally verified by CBMC.
Code like that is very hard to test (can you execute these functions with all 15-characters input strings?), but thanks to CBMC, we can be sure it's correct, or at least, equivalent to the simple 'bruteforce' version. I couldn't devise a correct version without it. In fact first versions were written in Python. I rewritten it to pure C so that I can verify them using CBMC.
UPD: discussion at HN.
Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.