[Python][MaxSAT] Making smallest possible test suite

I wrote about it earlier, how to a make smallest test suite using Z3. Click here, Ctrl-F "Making smallest possible test suite using Z3".

But that example was made up. And now I can do it better.

Once upon a time, I wrote my own x86/x64 disassembler. And a code like that is fragile and prone to bugs -- any unnoticed typo can ruin a hour or two. How to test it? Just to be sure, I put as much to tests as possible. I enumerated all possible 2 bytes opcodes, etc. At some point, there were 12531 tests, like:

...
disas_test1(Fuzzy_True, (const byte*)"\xF3\x0F\x58\xF1", 3, "ADDSS XMM6, XMM1");
disas_test1(Fuzzy_True, (const byte*)"\xF3\x0F\x59\xC0", 3, "MULSS XMM0, XMM0");
disas_test1(Fuzzy_True, (const byte*)"\xF3\x0F\x5C\xD7", 3, "SUBSS XMM2, XMM7");
disas_test1(Fuzzy_True, (const byte*)"\xF3\x0F\x5E\xF8", 3, "DIVSS XMM7, XMM0");
disas_test1(Fuzzy_True, (const byte*)"\xF3\x42\x0F\x11\x74\x1D\x98", 3, "MOVSS [RBP+R11-68h], XMM6");
disas_test1(Fuzzy_True, (const byte*)"\xF3\x48\x0F\x2A\xC0", 3, "CVTSI2SS XMM0, RAX");
disas_test1(Fuzzy_True, (const byte*)"\xF3\x48\xAB", 3, "REP STOSQ");
disas_test1(Fuzzy_True, (const byte*)"\xF7\xED", 0x300, "IMUL EBP");
disas_test2_1op(Fuzzy_True, (const byte*)"\xFF\xD0", 0x300, "CALL RAX", 64);
disas_test2_2op(Fuzzy_True, (const byte*)"\x00\x4C\x8B\x94", 0x300, "ADD [RBX+RCX*4-6ch], CL", 8, 8);
disas_test2_2op(Fuzzy_True, (const byte*)"\x0D\x01\xFF\xFF\xFF", 0x48DB13, "OR EAX, 0FFFFFF01h", 32, 32);
disas_test2_2op(Fuzzy_True, (const byte*)"\x0F\xB7\x0D\x4F\x02\x5B\x06", 0x436E96, "MOVZX ECX, [69e70ech]", 32, 16);
disas_test2_2op(Fuzzy_True, (const byte*)"\x0F\xB7\x0D\xDE\x01\x5B\x06", 0x436F07, "MOVZX ECX, [69e70ech]", 32, 16);
...

( https://raw.githubusercontent.com/DennisYurichev/x86_disasm/80f2c7b7137ddce8c9a1168051865ae595a09d56/tests.h )

Basically, these are pairs: opcode + disassembled output.

I modified my test code so that a test number can be set in arguments, and then preprended this to each test:

...
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF3\x0F\x58\xF1", 3, "ADDSS XMM6, XMM1");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF3\x0F\x59\xC0", 3, "MULSS XMM0, XMM0");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF3\x0F\x5C\xD7", 3, "SUBSS XMM2, XMM7");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF3\x0F\x5E\xF8", 3, "DIVSS XMM7, XMM0");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF3\x42\x0F\x11\x74\x1D\x98", 3, "MOVSS [RBP+R11-68h], XMM6");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF3\x48\x0F\x2A\xC0", 3, "CVTSI2SS XMM0, RAX");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF3\x48\xAB", 3, "REP STOSQ");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF7\xED", 0x300, "IMUL EBP");
if (test_all || line==__LINE__) disas_test2_1op(Fuzzy_True, (const byte*)"\xFF\xD0", 0x300, "CALL RAX", 64);
if (test_all || line==__LINE__) disas_test2_2op(Fuzzy_True, (const byte*)"\x00\x4C\x8B\x94", 0x300, "ADD [RBX+RCX*4-6ch], CL", 8, 8);
...

Also any disassembler, including mine, has a "mega-table" or "mother-table" or "master-table" that contains all opcodes, flags, instruction names, etc: https://github.com/DennisYurichev/x86_disasm/blob/acb6751145dac38258c53412fb51fb709b3f3c6f/x86_tbl_entries.h.

And I added a debugging statement: if a table entry is loaded, its number is being printed. And you'll see why.

I'm compiling the code so that it will generate GCOV-statistics:

gcc -fprofile-arcs -ftest-coverage -g -DX86_DISASM_PRINT_INS_TBL_ENTRY x86_disasm_tests.c x86_disas.c x86_register.c -I../octothorpe ../octothorpe/octothorpe.a

And I run it 12531 times. A number in 1..12531 range is passed in arguments:

#!/bin/bash
for i in $(seq 1 12532);
do

        echo $i
        rm *gcda*
        tbl_entry=$(./a.out $i | tail -1)
        echo $tbl_entry
        gcov x86_disas
        mv x86_disas.c.gcov gcovs/$i.$tbl_entry
done

12531 files have been created. Filename constists of number1.number2, where number1 is a test number (or line number in tests.h) and number2 is a table entry loaded during testing. The contents is a typical GCOV's output:

...
        1:  651:                if (IS_SET (p->new_flags, F_REG32_IS_LOWEST_PART_OF_1ST_BYTE))
    #####:  652:                        mask=0xF8;
        1:  653:                if (IS_SET (p->new_flags, F_REG64_IS_LOWEST_PART_OF_1ST_BYTE))
    #####:  654:                        mask=0xF8;
        -:  655:
        1:  656:                if ((opc & mask) != ins_tbl[p->tbl_p].opc)
        -:  657:                {
    #####:  658:                        p->tbl_p++;
    #####:  659:                        continue;
        -:  660:                };
        -:  661:
        -:  662:                // load second opcode if needed
        1:  663:                if (IS_SET (p->new_flags, F_OPC2))
        -:  664:                {
        -:  665:                        uint8_t opc2;
    #####:  666:                        if (Da_stage1_get_next_byte(p, &opc2)==false)
        -:  667:                        {
    #####:  668:                                if (dbg_print)
    #####:  669:                                        printf ("%s() line %d\n", __func__, __LINE__);
    #####:  670:                                return false;
        -:  671:                        };
...

This is how many times each line was executed during run. "#####" means never. "1" means one.

Now the goal: to execute all lines at least once, with the help of as few tests as possible. This is how the problem can be stated in plain English language:

"for the line X, the test Y OR the test Z or the test M must be ran".

We generate such (hard) clauses for each lines.

Also, we say to MaxSAT solver to find such a solution, where as few tests would be True, as possible. And this is my Python program to do so:

#!/usr/bin/env python

import re, sys, os

# read gcov result and return a list of lines executed:
def parse_gcov_file (fname):
    lines=[]
    f=open(fname,"r")
 
    while True:
        l=f.readline().rstrip()
        m = re.search('^ *([0-9]+):  ([0-9]+):.*$', l)
        if m!=None:
            lines.append(int(m.group(2)))
        if len(l)==0:
            break

    f.close()
    return lines

max_test_n=0

# k=line, v=list of tests
lines={}

# enumerate all gcov-files:
for (dirpath, dirnames, filenames) in os.walk ("gcovs"):
    for fname in filenames:
        #print "c fname", fname
        fullfname="gcovs/"+fname
        test_n=int(fname.split(".")[0])
        max_test_n=max(max_test_n, test_n)

        lines_executed=parse_gcov_file (fullfname)

        for line in lines_executed:
            lines[line]=lines.get(line, [])+[test_n]

def list_to_CNF(l):
    return "10000 "+" ".join(map(str, l))+" 0"

print "p wcnf "+str(max_test_n)+" "+str(len(lines)+max_test_n-1)+" 10000"

# hard clauses. each MUST be satisfied
# "test_1 OR test_2 OR ..." for each line
print "c lines:"
for line in lines:
    print "c line=", str(line)
    print list_to_CNF(lines[line])

# soft clauses. as many should be satisfied (be False) as possible
for test_n in range(max_test_n):
    print "1 -"+str(test_n+1)+" 0"


I run it:

python minimize_tests1.py > 1.wcnf

The resulting WCNF file is like:

p wcnf 12532 12832 10000
c lines:
c line= 768
10000 3432 6250 9716 9034 2020 4138 1308 5546 3433 4137 2728 7661 12449 9033 10400 4842 9717 4841 5547 8351 11083 2729 6955 11766 12450 2021 1309 6251 11767 6956 10399 11084 8350 7662 0
c line= 375
10000 1126 0
c line= 521
10000 12411 1796 8139 11130 2620 7728 11380 6631 1774 7685 5381 10996 626 6252 8189 4546 6191 11398 3865 ...
...
1 -1 0
1 -2 0
1 -3 0
1 -4 0
...
1 -12528 0
1 -12529 0
1 -12530 0
1 -12531 0
1 -12532 0

"c" is comment. I'm putting line (of code) number here. So for line 768, any tests among 3432 6250 9716 9034 2020 4138 1308 5546 3433 4137 2728 7661 must be present, at least one. For line 375, a (single) test 1126 must be present.

(By the way, it's quite interesting to see what line of code is triggered by a *single* test. Surely, they are must be present in minimized test suite.)

All test numbers are then enumerated as soft constraints.

Now I'm running Open-WBO MaxSAT solver on the given WCNF file:

open-wbo 1.wcnf > result
c
c Open-WBO:      a Modular MaxSAT Solver -- based on Glucose4.1 (core version)
...

s OPTIMUM FOUND
v -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 ... -1125 1126 -1127 ...
...

The output consists of all variables. But since a test number is mapped to a line number, this number also mapped to a variable's number. Efficiently, Open-WBO reports, which tests are to be picked (like 1126th).

I wrote an utility for that:

import sys

with open("result") as f:
    content = f.readlines()
content = [x.strip() for x in content]

with open("tests.h") as f:
    tests = f.readlines()
tests = [x.strip() for x in tests]

for test in content[-1][2:].split(" "):
    if test.startswith("-"):
        continue
    print tests[int(test)-1]


And we found that to cover (almost) all lines of code in my disassembler, only 11 tests are enough!

if (test_all || line==__LINE__) disas_test1(Fuzzy_False, (const byte*)"\x64\xA3\x12\x34\x56\x78", 0x100, "FS: MOV [78563412h], EAX");
if (test_all || line==__LINE__) disas_test1(Fuzzy_False, (const byte*)"\xD9\xEE", 0x123456, "FLDZ");
if (test_all || line==__LINE__) disas_test1(Fuzzy_False, (const byte*)"\xF0\x66\x0F\xB1\x0B", 0x1234, "LOCK CMPXCHG [EBX], CX");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\x44\xAB", 0x4f5b, "STOSD");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\x44\xBC\x90\x90\x90\x90", 0x507c, "MOV ESP, 90909090h");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\x45\xDB\x90\x90\x90\x90\x90", 0x638b, "FIST [R8-6f6f6f70h]");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\x4B\xC8\x90\x90\x90", 0xc848, "ENTER 0FFFFFFFFFFFF9090h, 90h");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\x4F\xBF\x90\x90\x90\x90\x90\x90\x90\x90", 0x10baf, "MOV R15, 9090909090909090h");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\x66\x0F\x38\x3D\xD0", 3, "PMAXSD XMM2, XMM0");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF2\x0F\x5A\xCE", 0, "CVTSD2SS XMM1, XMM6");
if (test_all || line==__LINE__) disas_test1(Fuzzy_True, (const byte*)"\xF3\x42\x0F\x11\x74\x1D\x98", 3, "MOVSS [RBP+R11-68h], XMM6");

(It's important to know that my "bloated" tests was not perfect, some lines of code like error reporting are like "dead code" now, but it's OK.)

At this point, my readers perhaps could stop reading and reuse my ideas for their own code and tests.

But... As I mentioned, a disassembler has a "mega-table". And we want to touch each its entry during tests at least once, like each line of code. And this is a second version of my program:

#!/usr/bin/env python

import re, sys, os

# read gcov result and return a list of lines executed:
def parse_gcov_file (fname):
    lines=[]
    f=open(fname,"r")
 
    while True:
        l=f.readline().rstrip()
        m = re.search('^ *([0-9]+):  ([0-9]+):.*$', l)
        if m!=None:
            lines.append(int(m.group(2)))
        if len(l)==0:
            break

    f.close()
    return lines

max_test_n=0

# k=line, v=list of tests
lines={}

# k=tbl_entry, v=list of tests
tbl_entries={}

# enumerate all gcov-files:
for (dirpath, dirnames, filenames) in os.walk ("gcovs"):
    for fname in filenames:
        #print "c fname", fname
        fullfname="gcovs/"+fname
        test_n=int(fname.split(".")[0])
        max_test_n=max(max_test_n, test_n)
        tbl_entry=int(fname.split(".")[1])

        tbl_entries[tbl_entry]=tbl_entries.get(tbl_entry, [])+[test_n]

        lines_executed=parse_gcov_file (fullfname)

        for line in lines_executed:
            lines[line]=lines.get(line, [])+[test_n]

def list_to_CNF(l):
    return "10000 "+" ".join(map(str, l))+" 0"

print "p wcnf "+str(max_test_n)+" "+str(len(tbl_entries)+len(lines)+max_test_n-1)+" 10000"

# hard clauses. each MUST be satisfied
# "test_1 OR test_2 OR ..." for table entry
print "c tbl entries:"
for tbl_entry in tbl_entries:
    print "c tbl_entry="+str(tbl_entry)
    print list_to_CNF(tbl_entries[tbl_entry])

print "c lines:"
for line in lines:
    print "c line=", str(line)
    print list_to_CNF(lines[line])

# soft clauses. as many should be satisfied (be False) as possible
for test_n in range(max_test_n):
    print "1 -"+str(test_n+1)+" 0"


Now only 301 tests are enough to cover (almost) all lines in my disassembler and to touch (almost) all entries in the "mega-table". Much better than 12531: https://github.com/DennisYurichev/x86_disasm/blob/acb6751145dac38258c53412fb51fb709b3f3c6f/tests.h.

Also, Open-WBO seems to be a better tool for the job, it works faster than Z3. Or maybe Z3 can be tuned?


→ [list of blog posts]

Please drop me email about any bug(s) and suggestion(s): dennis(@)yurichev.com.