13-Jun-2015: Modular arithmetic + division by multiplication + reversible LCG (PRNG) + cracking LCG with Z3.

Many practicing reverse engineerings are fully aware that division operation is sometimes replaced by multiplication.

Here is an example:

#include <stdint.h>

uint32_t divide_by_9 (uint32_t a)
{
        return a/9;
};

Optimizing GCC 4.8.2 does this:

divide_by_9:
        mov     edx, 954437177
        mov     eax, edx
        mul     DWORD PTR [esp+4]
        shr     edx
        mov     eax, edx
        ret

The following code can be rewritten into C/C++:

#include <stdint.h>

uint32_t divide_by_9_v2 (uint32_t a)
{
        return ((uint64_t)a * (uint64_t)954437177) >> 33; // 954437177 = 0x38e38e39
};

And it works: you can compile it and check it. Let's see, how.

Quick introduction into modular arithmetic

Modular arithmetic is an environment where all values are limited by some number (modulo). Many textbooks has clock as example. Let's imagine old mechanical analog clock. There hour hand points to one of number in bounds of 0..11 (zero is usually shown as 12). What hour will be if to sum up 10 hours (no matter, AM or PM) and 4 hours? 10+4 is 14 or 2 by modulo 12. Naively you can just sum up numbers and subtract modulo base (12) as long as it's possible.

Modern digital watch shows time in 24 hours format, so hour there is a variable in modulo base 24. But minutes and seconds are in modulo 60 (let's forget about leap seconds for now).

Another example is US imperial system of measurement: human height is measured in feets and inches. There are 12 inches in feet, so when you sum up some lengths, you increase feet variable each time you've got more than 12 inches.

Another example I would recall is password cracking utilities. Often, characters set is defined in such utilities. And when you set all Latin characters plus numbers, you've got 26+10=36 characters in total. If you brute-forcing a 6-characters password, you've got 6 variables, each is limited by 36. And when you increase last variable, it happens in modular arithmetic rules: if you got 36, set last variable to 0 and increase penultimate one. If it's also 36, do the same. If the very first variable is 36, then stop. Modular arithmetic may be very helpful when you write multi-threading (or distributed) password cracking utility and you need to slice all passwords space by even parts.

Now let's recall old mechanical counters which were widespread in pre-digital era:

The picture was stolen from http://www.featurepics.com/ - sorry for it!

This counter has 6 wheels, so it can count from 0 to $10^{6}-1$ or 999999. When you have 999999 and you increase the counter, it will resetting to 000000 - this situation is usually understood by engineers and computer programmers as overflow. And if you have 000000 and you decrease it, the counter will show you 999999. This situation is often called "wrap around". See also: http://en.wikipedia.org/wiki/Integer_overflow.

Modular arithmetic on CPUs

The reason I talk about mechanical counter is that CPU registers acting in the very same way, because this is, perhaps, simplest possible and efficient way to compute using integer numbers.

This implies that almost all operations on integer values on your CPU is happens by modulo $2^{32}$ or $2^{64}$ depending on your CPU. For example, you can sum up 0x87654321 and 0xDEADBABA, which resulting in 0x16612FDDB. This value is too big for 32-bit register, so only 0x6612FDDB is stored, and leading 1 is dropped. If you will multiply these two numbers, the actual result it 0x75C5B266EDA5BFFA, which is also too big, so only low 32-bit part is stored into destination register: 0xEDA5BFFA. This is what happens when you multiply numbers in plain C/C++ language, but some readers may argue: when sum is too big for register, CF (carry flag) is set, and it can be used after. And there is x86 MUL instruction which in fact produces 64-bit result in 32-bit environment (in EDX:EAX registers pair). That's true, but observing just 32-bit registers, this is exactly environment of modulo with base $2^{32}$.

Now that leads to surprising consequence: almost every result of arithmetic operation stored in general purpose register of 32-bit CPU is in fact remainder of division operation: result is always divided by $2^{32}$ and remainder is left in register. For example, 0x16612FDDB is too large for storage, and it's divided by $2^{32}$ (or 0x100000000). The result of division (quotient) is 1 (which is dropped) and remainder is 0x6612FDDB (which is stored as a result). 0x75C5B266EDA5BFFA divided by $2^{32}$ (0x100000000) produces 0x75C5B266 as a result of division (quotient) and 0xEDA5BFFA as a remainder, the latter is stored.

And if your code is 32-bit one in 64-bit environment, CPU registers are bigger, so the whole result can be stored there, but high half is hidden behind the scenes -- because no 32-bit code can access it.

By the way, this is the reason why remainder calculation is often called "division by modulo". C/C++ has percent sign (%) for this operation, but some other PLs like Pascal and Haskell has "mod" operator.

Usually, almost all sane computer programmers works with variables as they never wrapping around and value here is always in some limits which are defined preliminarily. However, this implicit division operation or "wrapping around" can be exploited usefully.

Remainder of division by modulo $2^{n}$

... can be easily computed with AND operation. If you need a random number in range of 0..16, here you go: rand()&0xF. That helps sometimes.

For example, you need a some kind of wrapping counter variable which always should be in 0..16 range. What you do? Programmers often write this:

int counter=0;
...
counter++;
if (counter==16)
    counter=0;

But here is a version without conditional branching:

int counter=0;
...
counter++;
counter=counter&0xF;

As an example, this I found in the git source code:

char *sha1_to_hex(const unsigned char *sha1)
{
	static int bufno;
	static char hexbuffer[4][GIT_SHA1_HEXSZ + 1];
	static const char hex[] = "0123456789abcdef";
	char *buffer = hexbuffer[3 & ++bufno], *buf = buffer;
	int i;

	for (i = 0; i < GIT_SHA1_RAWSZ; i++) {
		unsigned int val = *sha1++;
		*buf++ = hex[val >> 4];
		*buf++ = hex[val & 0xf];
	}
	*buf = '\0';

	return buffer;
}
( https://github.com/git/git/blob/aa1c6fdf478c023180e5ca5f1658b00a72592dc6/hex.c )

This function returns a pointer to the string containing hexadecimal representation of SHA1 digest (like "4e1243bd22c66e76c2ba9eddc1f91394e57f9f83"). But this is plain C and you can calculate SHA1 for some block, get pointer to the string, then calculate SHA1 for another block, get pointer to the string, and both pointers are still points to the same string buffer containing the result of the second calculation. As a solution, it's possible to allocate/deallocate string buffer each time, but more hackish way is to have several buffers (4 are here) and fill the next each time. The bufno variable here is a buffer counter in 0..3 range. Its value increments each time, and its value is also always kept in limits by AND operation (3 & ++bufno).

The author of this piece of code (seemingly Linus Torvalds himself) went even further and forgot (?) to initialize bufno counter variable, which will have random garbage at the function start. Indeed: no matter, which buffer we are starting each time! This can be mistake which isn't affect correctness of the code, or maybe this is left so intentionally -- I don't know.

Getting random numbers

When you write some kind of videogame, you need random numbers, and the standard C/C++ rand() function gives you them in 0..0x7FFF range (MSVC) or in 0..0x7FFFFFFF range (GCC). And when you need a random number in 0..10 range, the common way to do it is:

X_coord_of_something_spawned_somewhere=rand() % 10;
Y_coord_of_something_spawned_somewhere=rand() % 10;

No matter what compiler do you use, you can think about it as 10 is subtraced from rand() result, as long as there is still a number bigger than 10. Hence, result is remainder of division of rand() result by 10.

One nasty consequence is that neither 0x8000 nor 0x80000000 cannot be divided by 10 evenly, so you'll get some numbers slightly more often than others.

I tried to calculate in Mathematica. Here is what you get if you write rand() % 3 and rand() produce numbers in range of 0..0x7FFF (like MSVC):

In[]:= Counts[Map[Mod[#, 3] &, Range[0, 16^^8000 - 1]]]
Out[]= <|0 -> 10923, 1 -> 10923, 2 -> 10922|>

So number 2 happens slightly seldom than others.

Here is a result for rand() % 10:

In[]:= Counts[Map[Mod[#, 10] &, Range[0, 16^^8000 - 1]]]
Out[]= <|0 -> 3277, 1 -> 3277, 2 -> 3277, 3 -> 3277, 4 -> 3277, 
 5 -> 3277, 6 -> 3277, 7 -> 3277, 8 -> 3276, 9 -> 3276|>

Numbers 8 and 9 happens slightly seldom.

Here is a result for rand() % 100:

In[]:= Counts[Map[Mod[#, 100] &, Range[0, 16^^8000 - 1]]]
Out[]= <|0 -> 328, 1 -> 328, 2 -> 328, 3 -> 328, 4 -> 328, 5 -> 328,
  6 -> 328, 7 -> 328, 8 -> 328, 9 -> 328, 10 -> 328, 11 -> 328, 
 12 -> 328, 13 -> 328, 14 -> 328, 15 -> 328, 16 -> 328, 17 -> 328, 
 18 -> 328, 19 -> 328, 20 -> 328, 21 -> 328, 22 -> 328, 23 -> 328, 
 24 -> 328, 25 -> 328, 26 -> 328, 27 -> 328, 28 -> 328, 29 -> 328, 
 30 -> 328, 31 -> 328, 32 -> 328, 33 -> 328, 34 -> 328, 35 -> 328, 
 36 -> 328, 37 -> 328, 38 -> 328, 39 -> 328, 40 -> 328, 41 -> 328, 
 42 -> 328, 43 -> 328, 44 -> 328, 45 -> 328, 46 -> 328, 47 -> 328, 
 48 -> 328, 49 -> 328, 50 -> 328, 51 -> 328, 52 -> 328, 53 -> 328, 
 54 -> 328, 55 -> 328, 56 -> 328, 57 -> 328, 58 -> 328, 59 -> 328, 
 60 -> 328, 61 -> 328, 62 -> 328, 63 -> 328, 64 -> 328, 65 -> 328, 
 66 -> 328, 67 -> 328, 68 -> 327, 69 -> 327, 70 -> 327, 71 -> 327, 
 72 -> 327, 73 -> 327, 74 -> 327, 75 -> 327, 76 -> 327, 77 -> 327, 
 78 -> 327, 79 -> 327, 80 -> 327, 81 -> 327, 82 -> 327, 83 -> 327, 
 84 -> 327, 85 -> 327, 86 -> 327, 87 -> 327, 88 -> 327, 89 -> 327, 
 90 -> 327, 91 -> 327, 92 -> 327, 93 -> 327, 94 -> 327, 95 -> 327, 
 96 -> 327, 97 -> 327, 98 -> 327, 99 -> 327|>

... now larger part of numbers happens slightly seldom, these are 68...99.

This is sometimes called modulo bias. It's perhaps acceptable for videogames, but may be critical for scientific simulations, including Monte Carlo method.

Constructing a PRNG with uniform distribution may be tricky, there are couple of methods: 1, 2.

Multiplicative inverse

Finding multiplicative inverse

From school-level mathematics we may recall there is an easy way to replace multiplication by division. For example, if you need to divide some number by 3, multiply it by $\frac{1}{3}$ (or 0.33333...). So if you've got a lot of numbers you need to divide by 3, and if multiplication on your FPU works faster than division, you can precompute $\frac{1}{3}$ and then multiply all numbers by this one. $\frac{1}{3}$ is called multiplicative inverse or reciprocal. Russian textbook also uses more terse inverse number or inverse value term.

But that works for real numbers only. What about integer ones?

Finding modular multiplicative inverse

First, let's state our task: we need to divide a (unknown at compile time) number by 9.

Our environment has at least these properties:

We can't divide by 9 using bit shifts, but we can divide by $2^{32}$ or by $2^{n}$ in general. What if we would multiply input number to make it much bigger so to compensate difference between 9 and $2^{32}$? Yes!

Our initial task is:

result = input / 9

What we can do:

result = input * coefficient / 2^32

coefficient is the solution of this equation:

9x = 1+k(2^32).

We can solve it in Wolfram Mathematica:

In[]= FindInstance[9 x == 1 + k (2^32), {x, k}, Integers]
Out[]= {{x -> 954437177, k -> 2}}

x (which is modular multiplicative inverse) will be coefficient, k will be another special value, used at the very end.

Let's check it in Mathematica:

In[]:= BaseForm[954437177*90, 16]
Out[]//BaseForm= 140000000a

(BaseForm is the instruction to print result in hexadecimal form).

It has been multiplication, but division by $2^{32}$ or $2^{n}$ is not happened yet. So after division by $2^{32}$, 0x14 will be a result and 0xA is remainder. 0x14 is 20, which twice as large as the result we expect ($\frac{90}{9}$=10). It's because k=2, so final result should also be divided by 2.

That is exactly what the code produced by GCC does:

Two last steps are coalesced into one SHR instruction, which does shifting by 33 bits.

Let's also check relation between modular multiplicative inverse coefficient we've got and $2^{32}$ (modulo base):

In[]:= 954437177 / 2^32 // N
Out[]= 0.222222

0.222... is twice as large than $\frac{1}{9}$. So this number acting like a real $\frac{1}{9}$ number, but on integer ALU!

A little more theory

But why Wikipedia article about it is somewhat harder to grasp? And why we need additional k coefficient? The reason of this is because equation we should solve to get coefficients is in fact diophantine equation, that is equation which allows only integers as it's variables. Hence you see "Integers" in FindInstance Mathematica command: no real numbers are allowed. Mathematica wouldn't be able to find x for k=1 (additional bit shift would not need then), but was able to find it for k=2. Diophantine equation is so important here because we work on integer ALU, after all.

So the coefficient used is in fact modular multiplicative inverse. And when you see such piece of code in some software, Mathematica can find division number easily, just find modular multiplicative inverse of modular multiplicative inverse! It works because $x=\frac{1}{(\frac{1}{x})}$.

In[]:= PowerMod[954437177, -1, 2^32]
Out[]= 9

PowerMod command is so called because it computes $x^{-1}$ by given modulo ($2^{32}$), which is the same thing. Other representations of this algorithm are there: http://rosettacode.org/wiki/Modular_inverse.

So, multiplicative inverse is denoted as $x^{-1}$ and modular multiplicative inverse as $x^{-1} \pmod b$ where b is modulo base.

Remainder?

It can be easily observed that no bit shifting need, just multiply number by modular inverse:

In[]:= Mod[954437177*18, 2^32]
Out[]= 2

The number we've got is in fact remainder of division by $2^{32}$. It is the same as result we are looking for, because diophantine equation we solved has 1 in "1+k...", this 1 is multiplied by result and it is left as remainder.

This is somewhat useless, because this calculation is going crazy when we need to divide some number (like 19) by 9 ($\frac{19}{9}=2.111...$), which should leave remainder (19 % 9 = 1):

In[]:= Mod[954437177*19, 2^32]
Out[]= 954437179

Perhaps, this can be used to detect situations when remainder is also present?

Always coprimes?

As it's stated in many textbooks, to find modular multiplicative inverse, modulo base ($2^{32}$) and initial value (e.g., 9) should be coprime to each other. 9 is coprime to $2^{32}$, so is 7, but not 10. But if you try to compile $\frac{x}{10}$ code, GCC can do it as well:

push   %ebp
mov    %esp,%ebp
mov    0x8(%ebp),%eax
mov    $0xcccccccd,%edx
mul    %edx
mov    %edx,%eax
shr    $0x3,%eax
pop    %ebp
ret

The reason it works is because division by 5 is actually happens here (and 5 is coprime to $2^{32}$), and then the final result is divided by 2 (so there is 3 instead of 2 in the SHR instruction).

Reversible linear congruential generator

LCG is very simple: just multiply seed by some value, add another one and here is a new random number. Here is how it is implemented in MSVC (the source code is not original one and is reconstructed by me):

uint32_t state;

uint32_t rand()
{
	state=state*214013+2531011;
	return (state>>16)&0x7FFF;
};

The last bit shift is attempt to compensate LCG weakness and we may ignore it so far. Will it be possible to make an inverse function to rand(), which can reverse state back? First, let's try to think, what would make this possible? Well, if state internal variable would be some kind of BigInt or BigNum container which can store infinitely big numbers, then, although state is increasing rapidly, it would be possible to reverse the process. But state isn't BigInt/BigNum, it's 32-bit variable, and summing operation is easily reversible on it (just subtract 2531011 at each step). As we may know now, multiplication is also reversible: just multiply the state by modular multiplicative inverse of 214013!

#include <stdio.h>
#include <stdint.h>

uint32_t state;

void next_state()
{
	state=state*214013+2531011;
};

void prev_state()
{
	state=state-2531011; // reverse summing operation
	state=state*3115528533; // reverse multiply operation. 3115528533 is modular inverse of 214013 in 2^32.
};

int main()
{
	state=12345;
	
	printf ("state=%d\n", state);
	next_state();
	printf ("state=%d\n", state);
	next_state();
	printf ("state=%d\n", state);
	next_state();
	printf ("state=%d\n", state);

	prev_state();
	printf ("state=%d\n", state);
	prev_state();
	printf ("state=%d\n", state);
	prev_state();
	printf ("state=%d\n", state);
};

Wow, that works!

state=12345
state=-1650445800
state=1255958651
state=-456978094
state=1255958651
state=-1650445800
state=12345

It's very hard to find a real-world application of reversible LCG, but it was a spectacular demonstration of modular multiplicative inverse, so I added it.

Cracking LCG with Z3 SMT solver

... the text which was here has been moved to https://yurichev.com/writings/SAT_SMT_draft-EN.pdf.

RSA

Modular arithmetic is also used in RSA algorithm in its core. I've written an article about it: //yurichev.com/blog/RSA/.


→ [list of blog posts, my twitter/facebook]

Please drop me email about any bug(s) and suggestion(s): dennis(@)yurichev.com.