Executable file watermarking/steganography using Lehmer code and factorial number system

TL;DR: how to hide information not in objects, but in *order* of objects.

Almost any binary executable file has text strings like (these are from CRT):

.rdata:0040D398 aR6002FloatingP:
.rdata:0040D398                 text "UTF-16LE", 'R6002',0Dh,0Ah
.rdata:0040D398                 text "UTF-16LE", '- floating point support not loaded',0Dh,0Ah,0
.rdata:0040D3F2                 align 8
.rdata:0040D3F8 aR6008NotEnough:
.rdata:0040D3F8                 text "UTF-16LE", 'R6008',0Dh,0Ah
.rdata:0040D3F8                 text "UTF-16LE", '- not enough space for arguments',0Dh,0Ah,0
.rdata:0040D44C                 align 10h
.rdata:0040D450 aR6009NotEnough:
.rdata:0040D450                 text "UTF-16LE", 'R6009',0Dh,0Ah
.rdata:0040D450                 text "UTF-16LE", '- not enough space for environment',0Dh,0Ah,0
.rdata:0040D4A8 aR6010AbortHasB:
.rdata:0040D4A8                 text "UTF-16LE", 'R6010',0Dh,0Ah
.rdata:0040D4A8                 text "UTF-16LE", '- abort() has been called',0Dh,0Ah,0
.rdata:0040D4EE                 align 10h
.rdata:0040D4F0 aR6016NotEnough:
.rdata:0040D4F0                 text "UTF-16LE", 'R6016',0Dh,0Ah
.rdata:0040D4F0                 text "UTF-16LE", '- not enough space for thread data',0Dh,0Ah,0
.rdata:0040D548 aR6017Unexpecte:
.rdata:0040D548                 text "UTF-16LE", 'R6017',0Dh,0Ah
.rdata:0040D548                 text "UTF-16LE", '- unexpected multithread lock error',0Dh,0Ah,0
.rdata:0040D5A2                 align 8
.rdata:0040D5A8 aR6018Unexpecte:
.rdata:0040D5A8                 text "UTF-16LE", 'R6018',0Dh,0Ah
.rdata:0040D5A8                 text "UTF-16LE", '- unexpected heap error',0Dh,0Ah,0
.rdata:0040D5EA                 align 10h


Can we hide some information there? Not in string themselves, but in *order* of strings? Given the fact, that compiler doesn't guarantee at all, in which order the strings will be stored in object/executable file.

Let's say, we've got 26 text strings, and we can swap them how we want, because their order isn't important at all. All possible permutations of 26 objects is 26! = 403291461126605635584000000

How much information can be stored here? log2(26!)=~88, i.e., 88 bits or 11 bytes!

11 bytes of your data can be converted to a (big) number and back, OK.

What is next? Naive way is: enumerate all permutations of 26 objects, number each, find permutation of the number we've got and permute 26 text strings, store to file and that's it. But we can't iterate over 403291461126605635584000000 permutations.

This is where factorial number system and Lehmer code can be used. In short, for all my non-mathematically inclined readers, this is a way to find a permutation of specific number without use of any significant resources. And back: gived specific permutation, we can find its number.

This piece of code I've copypasted from https://gist.github.com/lukmdo/7049748 and reworked slightly:

# python 3.x

import math

def iter_perm(base, *rargs):
"""
:type base: list
:param rargs: range args [start,] stop[, step]
:rtype: generator
"""
if not rargs:
rargs = [math.factorial(len(base))]
for i in range(*rargs):
yield perm_from_int(base, i)

def int_from_code(code):
"""
:type code: list
:rtype: int
"""
num = 0
for i, v in enumerate(reversed(code), 1):
num *= i
num += v

return num

def code_from_int(size, num):
"""
:type size: int
:type num: int
:rtype: list
"""
code = []
for i in range(size):
num, j = divmod(num, size - i)
code.append(j)

return code

def perm_from_code(base, code):
"""
:type base: list
:type code: list
:rtype: list
"""

perm = base.copy()
for i in range(len(base) - 1):
j = code[i]
if i != i+j:
print ("swapping %d, %d" % (i, i+j))
perm[i], perm[i+j] = perm[i+j], perm[i]

return perm

def perm_from_int(base, num):
"""
:type base: list
:type num: int
:rtype: list
"""
code = code_from_int(len(base), num)
print ("Lehmer code=", code)
return perm_from_code(base, code)

def code_from_perm(base, perm):
"""
:type base: list
:type perm: list
:rtype: list
"""

p = base.copy()
n = len(base)
pos_map = {v: i for i, v in enumerate(base)}

w = []
for i in range(n):
d = pos_map[perm[i]] - i
w.append(d)

if not d:
continue
t = pos_map[perm[i]]
pos_map[p[i]], pos_map[p[t]] = pos_map[p[t]], pos_map[p[i]]
p[i], p[t] = p[t], p[i]

return w

def int_from_perm(base, perm):
"""
:type base: list
:type perm: list
:rtype: int
"""
code = code_from_perm(base, perm)
return int_from_code(code)

def bin_string_to_number(s):
rt=0
for i, c in enumerate(s):
rt=rt<<8
rt=rt+ord(c)
return rt

def number_to_bin_string(n):
rt=""
while True:
r=n & 0xff
rt=rt+chr(r)
n=n>>8
if n==0:
break
return rt[::-1]

s="HelloWorld"
print ("s=", s)
num=bin_string_to_number (s)
print ("num=", num)
perm=perm_from_int(list(range(26)), num)
print ("permutation/order=", perm)

num2=int_from_perm(list(range(26)), [14, 17, 9, 19, 11, 16, 23, 0, 2, 13, 20, 18, 21, 24, 10, 1, 22, 4, 7, 6, 15, 12, 5, 3, 8, 25])
print ("recovered num=", num2)
s2=number_to_bin_string(num2)
print ("recovered s=", s2)



I'm encoding a "HelloWorld" binary string (in fact, any 11 bytes can be used) into a number. Number is then converted into Lehmer code. Then the perm_from_code() function permute initial *order* according to the input Lehmer code:

s= HelloWorld
num= 341881320659934023674980
Lehmer code= [14, 16, 7, 16, 7, 11, 17, 7, 1, 4, 10, 7, 9, 11, 6, 2, 6, 1, 2, 4, 0, 0, 0, 0, 0, 0]
swapping 0, 14
swapping 1, 17
swapping 2, 9
swapping 3, 19
swapping 4, 11
swapping 5, 16
swapping 6, 23
swapping 7, 14
swapping 8, 9
swapping 9, 13
swapping 10, 20
swapping 11, 18
swapping 12, 21
swapping 13, 24
swapping 14, 20
swapping 15, 17
swapping 16, 22
swapping 17, 18
swapping 18, 20
swapping 19, 23
permutation/order= [14, 17, 9, 19, 11, 16, 23, 0, 2, 13, 20, 18, 21, 24, 10, 1, 22, 4, 7, 6, 15, 12, 5, 3, 8, 25]


This is it: [14, 17, 9, 19, 11, 16, 23, 0, 2, 13, 20, 18, 21, 24, 10, 1, 22, 4, 7, 6, 15, 12, 5, 3, 8, 25]. First put 14th string, then 17s string, then 9th one, etc.

Now you've got a binary file from someone and want to read watermark from it. Get an order of strings from it and convert it back to binary string:

recovered num= 341881320659934023674980
recovered s= HelloWorld


If you have more text strings (not unusual for most executable files), you can encode more.

100 strings: log2(100!) = ~524 bits = ~65 bytes.

1000 strings: log2(1000!) = ~8529 bits = ~1066 bytes! You can store some text here!

How would you force a C/C++ compiler to make specific order of text strings? This is crude, but workable:

char blob[]="hello1\0hello2\0";
char *msg1=blob;
char *msg2=blob+8;

printf ("%s\n", msg1);
printf ("%s\n", msg2);


They can be even aligned on 16-byte border.

... or they can be placed into .s/.asm assembly file and compiled into .o/.obj and then linked to your program.

... or you can swap text strings in already compiled executable and correct their addresses in corresponding instructions. If an executable file is not packed/obfuscated, this is possible.

Aside of order of text strings, you can try to hack a linker and reorder object files in the final executable. Of course, no one cares about its order. And go figure out, what is hidden there.

Surely, hidden data can be encrypted, checksum or MAC can be added, etc.

Other ideas to consider: basic blocks reordering within a function, register allocator hacking, etc.

Links I find helpful in understanding factorial number system and Lehmer code, aside of Wikipedia:

Please drop me email about any bug(s) and suggestion(s): dennis(@)yurichev.com.