Testing word counting code using KLEE and CBMC

The exercise from the famous "The C Programming Language" by Brian W. Kernighan, Dennis M. Ritchie:

#include <stdio.h>

#define IN  1 /* inside a word */
#define OUT 0 /* outside a word */

/* count lines, words, and characters in input */
int main()
{
    int c, nl, nw, nc, state;
    state = OUT;
    nl = nw = nc = 0;
    while ((c = getchar()) != EOF) {
        ++nc;
        if (c == '\n')
            ++nl;
        if (c == ' ' || c == '\n' || c == '\t')
            state = OUT;
        else if (state == OUT) {
            state = IN;
            ++nw;
        }
    }
    printf("%d %d %d\n", nl, nw, nc);
}

...

Exercise 1-11. How would you test the word count program? What kinds of input are most
likely to uncover bugs if there are any?

I've rewritten it a bit to accept data from buf rather than console:

#include <stdio.h>
#include <assert.h>

#define IN  1 /* inside a word */
#define OUT 0 /* outside a word */

#define BUFLEN 16

#define MY_GETCHAR (buf_idx==BUFLEN ? EOF : (int)buf[buf_idx++])

/* count lines, words, and characters in input */
int main()
{
    char buf[BUFLEN];
    int buf_idx=0;

    int c, nl, nw, nc, state;
    state = OUT;
    nl = nw = nc = 0;
    while ((c = MY_GETCHAR) != EOF) {
        ++nc;
        if (c == '\n')
            ++nl;
        if ((c == ' ') || (c == '\n') || (c == '\t'))
            state = OUT;
        else if (state == OUT) {
            state = IN;
            ++nw;
        }
    }

    if (nl==2 && nw==3 && nc==BUFLEN)
        assert (0);
}

Now I run:

$ cbmc --trace --function main 1.c

...

State 11 file 1.c function main line 14 thread 0
----------------------------------------------------
  buf={ '\n', '\n', '*', '\t', 'J', '\t', 0, 0, 0, 0, 0, 0, 0, 0, 0, ' ' } ({ 00001010, 00001010, 00101010, 00001001, 01001010, 00001001, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00100000 })

...

As it turns out, a sequence of zero bytes is also considered a word. (Zero byte is considered as a character.) There are: two newlines, word, whitespace, word, whitespace, word, whitespace.

However, Linux's wc also consider zero byte as character:

% printf "\0\0foo\0\0" | wc
      0       1       7

Now we ask for one word in input:

...
    if (nl==0 && nw==1 && nc==BUFLEN)
        assert (0);
...

...
State 11 file 2.c function main line 14 thread 0
----------------------------------------------------
  buf={ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ' ' } ({ 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00000000, 00100000 })
...

Interestingly, a word should ended with whitespace character.

However, the original Unix util can thread an isolated word as a word:

% echo -ne foobar | wc
      0       1       6

Now we ask for 8 words in 16-byte buffer:

...
    if (nl==0 && nw==8 && nc==BUFLEN)
        assert (0);
...

...
State 11 file 3.c function main line 14 thread 0
----------------------------------------------------
  buf={ 0, ' ', 0, 0, ' ', 0, '\t', 0, '\t', 0, '\t', 0, '\t', 0, '\t', 0 } ({ 00000000, 00100000, 00000000, 00000000, 00100000, 00000000, 00001001, 00000000, 00001001, 00000000, 00001001, 00000000, 00001001, 00000000, 00001001, 00000000 })
...

However, 9 words in 16-byte buffer is impossible. CBMC will finish with "VERIFICATION SUCCESSFUL", meaning, it doesn't find a way to trigger assert() function.

Also, the code modified for KLEE:

#include <stdio.h>
#include <assert.h>

#include "klee.h"

#define IN  1 /* inside a word */
#define OUT 0 /* outside a word */

#define BUFLEN 8

#define MY_GETCHAR (buf_idx==BUFLEN ? EOF : (int)buf[buf_idx++])

/* count lines, words, and characters in input */
int main()
{
    char buf[BUFLEN];
    klee_make_symbolic(&buf, sizeof buf, "buf");
    int buf_idx=0;

    int c, nl, nw, nc, state;
    state = OUT;
    nl = nw = nc = 0;
    while ((c = MY_GETCHAR) != EOF) {
        ++nc;
        if (c == '\n')
            ++nl;
        if ((c == ' ') || (c == '\n') || (c == '\t'))
            state = OUT;
        else if (state == OUT) {
            state = IN;
            ++nw;
        }
    }

    if (nl==2 && nw==3 && nc==BUFLEN)
        klee_assert(0);
}

And the result:

$ ktest-tool klee-last/test008031.ktest
ktest file : 'klee-last/test008031.ktest'
args       : ['1.bc']
num objects: 1
object 0: name: 'buf'
object 0: size: 8
object 0: data: b'\n\n\x00  \x00 \x00'
object 0: hex : 0x0a0a002020002000
...

One more thing to add: when I tried to make buf[] as a global array and getchar() function as function, not as macro, CBMC can't work with global arrays somehow. But KLEE can.

Also what about bruteforce? We can try newline, whitespace and zero byte characters for each character in 16-byte buffer, no problem. So the bruteforce is feasible here. But it will require more additional code.

More examples involving KLEE and CBMC.

(the post first published at 20251225.)

List of my other blog posts. Subscribe to my news feed,

If you noticed a typo/bug/error or have any suggestions, do not hesitate to drop me a note: my emails. Or use my zulip for feedback. Thanks in advance!

Also, among my services is writing examples-rich manuals and references. If you like my work and want something similar for your (commercial) product: contact me.

If you enjoy my work, you can support it on patreon.

Some time ago (before 24-Mar-2025) there was Disqus JS script for comments. I dropped it --- it was so motley, distracting, animated, with too much ads. I never liked it. Also, comments din't appeared correctly (Disqus was buggy). Also, my blog is too chamberlike --- not many people write comments here. So I decided to switch to the model I once had at least in 2020 --- send me your comments by email (don't forget to include URL to this blog post) and I will copy&paste it here manually.

Let's party like it's ~1993-1996, in this ultimate, radical and uncompromisingly primitive pre-web1.0-style blog and website. This website is best viewed under lynx/links/elinks/w3m.