(Beginners level) packing 12-bit values into array using bit operations (x64, ARM/ARM64, MIPS)

The note below has been copypasted to the Reverse Engineering for Beginners book

Introduction

File Allocation Table (FAT) was a widely popular filesystem. Hard to believe, but it's still used on flash drives, perhaps, for the reason of simplicity and compatibility. The FAT table itself is array of elements, each of which points to the next cluster number of a file (FAT supports files scattered across the whole disk). That implies that maximum of each element is maximum number of clusters on the disk. In MS-DOS era, most hard disks has FAT16 filesystem, because cluster number could be packed into 16-bit value. Hard disks then become cheaper, and FAT32 emerged, where 32-bit value was allocated for cluster number. But there were also a times, when floppy diskettes were not that cheap and has no much space, so FAT12 were used on them, for the reason of packing all filesystem structures as tight as possible: https://en.wikipedia.org/wiki/File_Allocation_Table#FAT12.

So the FAT table in FAT12 filesystem is an array where each two subsequent 12-bit elements are stored into 3 bytes (triplet). Here is how 6 12-bit values (AAA, BBB, CCC, DDD, EEE and FFF) are packed into 9 bytes:

 +0 +1 +2 +3 +4 +5 +6 +7 +8
|AA|AB|BB|CC|CD|DD|EE|EF|FF|...

Pushing values into array and pulling them back can be good example of bit twiddling operations (in both C/C++ and low-level machine code), so that's why I'll use FAT12 as an example here.

Data structure

We can quickly observe that each byte triplet will store 2 12-bit values: the first one is located at the left side, second one is at right:

 +0 +1 +2
|11|12|22|...

We will pack nibbles (4 bit chunks) in the following way (1 - highest nibble, 3 - lowest):

(Even)

 +0 +1 +2
|12|3.|..|...

(Odd)

 +0 +1 +2
|..|.1|23|...

The algorithm

So the algorithm can be as follows: if the element's index is even, put it at left side, if the index is odd, place it at right side. The middle byte: if the element's index is even, place part of it in high 4 bits, if it's odd, place its part in low 4 bits. But first, find the right triplet, this is easy: $\frac{index}{2}$. Finding the address of right byte in array of bytes is also easy: $\frac{index}{2} \cdot 3$ or $index \cdot \frac{3}{2}$ or just $index \cdot 1.5$.

Pulling values from array: if index is even, get leftmost and middle bytes and combine its parts. If index is odd, get middle and rightmost bytes and combine them. Do not forget to isolate unneeded bits in middle byte.

Pushing values is almost the same, but be careful not to overwrite some other's bits in the middle byte, correcting only yours.

The C/C++ code

#include <stdio.h>
#include <stdint.h>
#include <assert.h>

#define ARRAY_SIZE (0x1000/2*3)
uint8_t array[ARRAY_SIZE]; // big enough array of triplets

unsigned int get_from_array (unsigned int idx)
{
	// find right triplet in array:
	int triplet=(idx>>1);
	int array_idx=triplet*3;
	//assert (array_idx<ARRAY_SIZE);
	
	if (idx&1)
	{
		// this is odd element

		// compose value using middle and rightmost bytes:
		return ((array[array_idx+1]&0xF) << 8)|array[array_idx+2];
	}
	else
	{
		// this is even element

		// compose value using rightmost and middle bytes:
		return array[array_idx]<<4 | ((array[array_idx+1]>>4)&0xF);
	};
};

void put_to_array (unsigned int idx, unsigned int val)
{
	//assert (val<=0xFFF);

	// find right triplet in array:
	int triplet=(idx>>1);
	int array_idx=triplet*3;
	//assert (array_idx<ARRAY_SIZE);

	if (idx&1)
	{
		// this is odd element
		// put value into middle and rightmost bytes:
	
		// decompose value to be stored:
		uint8_t val_lowest_byte=val&0xFF; // isolate lowest 8 bits
		uint8_t val_highest_nibble=val>>8; // no need to apply &0xF, we already know the val<=0xFFF

		// clear low 4 bits in the middle byte:
		array[array_idx+1]=array[array_idx+1]&0xF0;
		
		array[array_idx+1]=array[array_idx+1]|val_highest_nibble;
		array[array_idx+2]=val_lowest_byte;
	}
	else
	{
		// this is even element
		// put value into leftmost and middle bytes:

		// decompose value to be stored:
		uint8_t val_highest_byte=val>>4;
		uint8_t val_lowest_nibble=val&0xF;

		array[array_idx]=val_highest_byte;

		// clear high 4 bits in the middle byte:
		array[array_idx+1]=array[array_idx+1]&0xF;
		array[array_idx+1]=array[array_idx+1]|val_lowest_nibble<<4;
	};
};

int main()
{
	int i;
	
	// test
	for (i=0; i<0x1000; i++)
	{
		put_to_array(i, i);
	};
	
	for (i=0; i<0x1000; i++)
	{
		assert(get_from_array(i)==i);
	};
	//put_to_array(0x1000, 1); // will fail due to assert()
	
	// print triplets:
	for (int i=0;i<0x1000/2;i++)
		printf ("0x%02X%02X%02X\n",array[i*3],array[i*3+1],array[i*3+2]);
};

During test, all 12-bit elements are filled with values in 0..0xFFF range. And here is a dump of all triplets, each line has 3 bytes:

0x000001
0x002003
0x004005
0x006007
0x008009
0x00A00B
0x00C00D
0x00E00F
0x010011
0x012013
0x014015

...

0xFECFED
0xFEEFEF
0xFF0FF1
0xFF2FF3
0xFF4FF5
0xFF6FF7
0xFF8FF9
0xFFAFFB
0xFFCFFD
0xFFEFFF

Here is also GDB byte-level output of 300 bytes (or 100 triplets) started at 512/2*3, i.e., it's address where 512th element (0x200) is beginning. I added square brackets in my text editor to show triplets explicitely. Take a notice at the middle bytes, where the last element is ended and the next is started. In other words, each middle byte has lowest 4 bits of even element and highest 4 bits of odd element.

(gdb) x/300xb array+512/2*3
0x601380 <array+768>: [ 0x20 0x02 0x01 ][ 0x20 0x22 0x03 ][ 0x20 0x42
0x601388 <array+776>: 0x05 ][ 0x20 0x62 0x07 ][ 0x20 0x82 0x09 ][ 0x20
0x601390 <array+784>: 0xa2 0x0b ][ 0x20 0xc2 0x0d ][ 0x20 0xe2 0x0f ]
0x601398 <array+792>: [ 0x21 0x02 0x11 ][ 0x21 0x22 0x13 ][ 0x21 0x42
0x6013a0 <array+800>: 0x15 ][ 0x21 0x62 0x17 ][ 0x21 0x82 0x19 ][ 0x21
0x6013a8 <array+808>: 0xa2 0x1b ][ 0x21 0xc2 0x1d ][ 0x21 0xe2 0x1f ]
0x6013b0 <array+816>: [ 0x22 0x02 0x21 ][ 0x22 0x22 0x23 ][ 0x22 0x42
0x6013b8 <array+824>: 0x25 ][ 0x22 0x62 0x27 ][ 0x22 0x82 0x29 ][ 0x22
0x6013c0 <array+832>: 0xa2 0x2b ][ 0x22 0xc2 0x2d ][ 0x22 0xe2 0x2f ]
0x6013c8 <array+840>: [ 0x23 0x02 0x31 ][ 0x23 0x22 0x33 ][ 0x23 0x42
0x6013d0 <array+848>: 0x35 ][ 0x23 0x62 0x37 ][ 0x23 0x82 0x39 ][ 0x23
0x6013d8 <array+856>: 0xa2 0x3b ][ 0x23 0xc2 0x3d ][ 0x23 0xe2 0x3f ]
0x6013e0 <array+864>: [ 0x24 0x02 0x41 ][ 0x24 0x22 0x43 ][ 0x24 0x42
0x6013e8 <array+872>: 0x45 ][ 0x24 0x62 0x47 ][ 0x24 0x82 0x49 ][ 0x24
0x6013f0 <array+880>: 0xa2 0x4b ][ 0x24 0xc2 0x4d ][ 0x24 0xe2 0x4f ]
0x6013f8 <array+888>: [ 0x25 0x02 0x51 ][ 0x25 0x22 0x53 ][ 0x25 0x42
0x601400 <array+896>: 0x55 ][ 0x25 0x62 0x57 ][ 0x25 0x82 0x59 ][ 0x25
0x601408 <array+904>: 0xa2 0x5b ][ 0x25 0xc2 0x5d ][ 0x25 0xe2 0x5f ]
0x601410 <array+912>: [ 0x26 0x02 0x61 ][ 0x26 0x22 0x63 ][ 0x26 0x42
0x601418 <array+920>: 0x65 ][ 0x26 0x62 0x67 ][ 0x26 0x82 0x69 ][ 0x26
0x601420 <array+928>: 0xa2 0x6b ][ 0x26 0xc2 0x6d ][ 0x26 0xe2 0x6f ]
0x601428 <array+936>: [ 0x27 0x02 0x71 ][ 0x27 0x22 0x73 ][ 0x27 0x42
0x601430 <array+944>: 0x75 ][ 0x27 0x62 0x77 ][ 0x27 0x82 0x79 ][ 0x27
0x601438 <array+952>: 0xa2 0x7b ][ 0x27 0xc2 0x7d ][ 0x27 0xe2 0x7f ]
0x601440 <array+960>: [ 0x28 0x02 0x81 ][ 0x28 0x22 0x83 ][ 0x28 0x42
0x601448 <array+968>: 0x85 ][ 0x28 0x62 0x87 ][ 0x28 0x82 0x89 ][ 0x28
0x601450 <array+976>: 0xa2 0x8b ][ 0x28 0xc2 0x8d ][ 0x28 0xe2 0x8f ]
0x601458 <array+984>: [ 0x29 0x02 0x91 ][ 0x29 0x22 0x93 ][ 0x29 0x42
0x601460 <array+992>: 0x95 ][ 0x29 0x62 0x97 ][ 0x29 0x82 0x99 ][ 0x29
0x601468 <array+1000>: 0xa2 0x9b ][ 0x29 0xc2 0x9d ][ 0x29 0xe2 0x9f ]
0x601470 <array+1008>:[ 0x2a 0x02 0xa1 ][ 0x2a 0x22 0xa3 ][ 0x2a 0x42
0x601478 <array+1016>: 0xa5 ][ 0x2a 0x62 0xa7 ][ 0x2a 0x82 0xa9 ][ 0x2a
0x601480 <array+1024>: 0xa2 0xab ][ 0x2a 0xc2 0xad ][ 0x2a 0xe2 0xaf ]
0x601488 <array+1032>:[ 0x2b 0x02 0xb1 ][ 0x2b 0x22 0xb3 ][ 0x2b 0x42
0x601490 <array+1040>: 0xb5 ][ 0x2b 0x62 0xb7 ][ 0x2b 0x82 0xb9 ][ 0x2b
0x601498 <array+1048>: 0xa2 0xbb ][ 0x2b 0xc2 0xbd ][ 0x2b 0xe2 0xbf ]
0x6014a0 <array+1056>:[ 0x2c 0x02 0xc1 ][ 0x2c 0x22 0xc3 ][ 0x2c 0x42
0x6014a8 <array+1064>: 0xc5 ][ 0x2c 0x62 0xc7 ]

How it works

Let array be a global buffer to make simpler access to it.

Getter

Let's first start at the function getting values from the array, because it's simpler.

The method of finding triplet's number is just division input index by 2, but we can do it just by shifting right by 1 bit. This is a very common way of dividing/multiplication by numbers in form of $2^n$.

I can demonstrate how it works. Let's say, you need to divide 123 by 10. Just drop last digit (3, which is remainder of division) and 12 is left. Division by 2 is just dropping least significant bit. Dropping can be done by shifting right.

Now the functions must decide if the input index even (so 12-bit value is placed at the left) or odd (at the right). Simplest way to do so is to isolate lowest bit (x&1). If it's zero, our value is even, otherwise it's odd.

This fact can be illustrated easily, take a look at the lowest bit:

decimal binary even/odd

0       0000   even
1       0001   odd
2       0010   even
3       0011   odd
4       0100   even
5       0101   odd
6       0110   even
7       0111   odd
8       1000   even
9       1001   odd
10      1010   even
11      1011   odd
12      1100   even
13      1101   odd
14      1110   even
15      1111   odd
...

Zero is also even number, it's so in two’s complement system, where it's located between two odd numbers (-1 and 1).

For math geeks, I could also say that even or odd sign is also remainder of division by 2. Division of a number by 2 is merely dropping the last bit, which is remainder of division. Well, we do not need to do shifts here, just isolate lowest bit.

If the element is odd, we take middle and right bytes ("array[array_idx+1]" and "array[array_idx+2]"). Lowest 4 bits of the middle byte is isolated. Right byte is taken as a whole. Now we need to combine these parts into 12-bit value. To do so, shift 4 bits from the middle byte by 8 bits left, so these 4 bits will be allocated right behind highest 8th bit of byte. Then, using OR operation, we just add these parts.

If the element is even, high 8 bits of 12-bit value is located in left byte, while lowest 4 bits are located in the high 4 bits of middle byte. We isolate highest 4 bits in the middle byte by shifting it 4 bits right and then applying AND operation, just to be sure that nothing is left there. We also take left byte and shift its value 4 bits left, because it has bits from 11th to 4th (inclusive, starting at 0), Using OR operation, we combine these two parts.

Setter

Setter calculates triplet's address in the same way. It also operates on left/right bytes in the same way. But it's not correct just to write to the middle byte, because write operation will destroy the information related to the other element. So the common way is to load byte, drop bits where you'll write, write it there, but leave other part intact. Using AND operation (& in C/C++), we drop everything except our part. Using OR operation (| in C/C++), we then update the middle byte.

Optimizing GCC 4.8.2 for x86-64

Let's see what optimizing GCC 4.8.2 for Linux x64 will do. I added comments. Sometimes readers are confused because instructions order is not logical. It's OK, because optimizing compilers take CPU out-of-order-execution mechanisms into considerations, and sometimes, swapped instructions performing better.

Getter

get_from_array:
; EDI=idx
; make a copy:
        mov     eax, edi
; calculate idx>>1:
        shr     eax
; determine if the element even or odd by isolation of the lowest bit:
        and     edi, 1
; calculate (idx>>1)*3.
; multiplication is slow in geenral, and can be replaced by one shifting and addition operation
; LEA is capable to do both:
        lea     edx, [rax+rax*2]
; EDX now is (idx>>1)*3
; point EAX to the middle byte:
        lea     eax, [rdx+1]
; sign-extend EAX value to RDX:
        cdqe
; get middle byte into EAX:
        movzx   eax, BYTE PTR array[rax]
; finally check the value of the lowest bit in index.
; jump if index is odd (NE is the same as NZ (Not Zero)):
        jne     .L9

; this is even element, go on
; sign-extend EDX to RDX:
	movsx   rdx, edx
; shift middle byte 4 bits right:
        shr     al, 4
; AL now has 4 highest bits of middle byte
; load left byte into EDX:
        movzx   edx, BYTE PTR array[rdx]
; sign-extend AL (where high 4 bits of middle byte are now)
        movzx   eax, al
; EAX has 4 high bits bits of middle byte
; EDX now has left byte
; shift left byte 4 bits left:
        sal     edx, 4
; 4 lowest bits in EDX after shifting is cleared to zero
; finally merge values:
        or      eax, edx
        ret
.L9:
; this is odd element, go on
; calculate address of right byte:
        add     edx, 2
; isolate lowest 4 bits in middle byte:
	and     eax, 15 ; 15=0xF
; sign-extend EDX (where address of right byte is) to RDX:
	movsx   rdx, edx
; shift 4 bits we got from middle bytes 8 bits left:
        sal     eax, 8
; load right byte:
        movzx   edx, BYTE PTR array[rdx]
; merge values:
        or      eax, edx
        ret

Setter

put_to_array:
; EDI=idx
; ESI=val
; copy idx to EAX:
        mov     eax, edi
; calculate idx>>1 and put it to EAX:
        shr     eax
; isolate lowest bit in idx:
        and     edi, 1
; calculate (idx>>2)*3 and put it to EAX:
        lea     eax, [rax+rax*2]
; jump if index is odd (NE is the same as NZ (Not Zero)):
        jne     .L5

; this is even element, go on
; sign-extend address of triplet in EAX to RDX:
        movsx   rdx, eax
; copy val value to ECX:
        mov     ecx, esi
; calculate address of middle byte:
        add     eax, 1
; sign-extend address in EAX to RDX:
        cdqe
; prepare left byte in ECX by shifting it:
        shr     ecx, 4
; prepare 4 bits for middle byte:
        sal     esi, 4
; put left byte:
        mov     BYTE PTR array[rdx], cl
; load middle byte (its address still in RAX):
        movzx   edx, BYTE PTR array[rax]
; drop high 4 bits:
        and     edx, 15 ; 15=0xF
; merge our data and low 4 bits which were there before:
        or      esi, edx
; put middle byte back:
        mov     BYTE PTR array[rax], sil
        ret
.L5:
; this is odd element, go on
; calculate address of middle byte and put it to ECX:
        lea     ecx, [rax+1]
; copy val value from ESI to EDI:
        mov     edi, esi
; calculate address of right byte:
        add     eax, 2
; get high 4 bits of input value by shifting it 8 bits right:
        shr     edi, 8
; sign-extend address in EAX into RAX:
        cdqe
; sign-extend address of middle byte in ECX to RCX:
        movsx   rcx, ecx
; load middle byte into EDX:
        movzx   edx, BYTE PTR array[rcx]
; drop low 4 bits in middle byte:
	and     edx, -16 ; -16=0xF0
; merge data from input val with what was in middle byte before:
        or      edx, edi
; store middle byte:
        mov     BYTE PTR array[rcx], dl
; store right byte. val is still in ESI and SIL is a part of ESI register which has lowest 8 bits:
        mov     BYTE PTR array[rax], sil
        ret

Other comments

All addresses in Linux x64 are 64-bit ones, so during pointer arithmetics, all values should also be 64-bit. The code calculating offsets inside of array operates on 32-bit values (input idx argument has type of int, which has width of 32 bits), and so these values should be converted to 64-bit addresses before actual memory load/store. So there are a lot of sign-extending instructions (like CDQE, MOVSX) used for conversion. Why to extend sign? By C/C++ standards, pointer arithmetics can operate on negative values (it's possible to access array using negative index like array[-123]). Since GCC compiler cannot be sure if all indices are always positive, it adds sign-extending instructions.

Optimizing Keil 5.05 (Thumb mode)

Getter

The following code has final OR operation in the function epilogue. Indeed, it executes at the end of both branches, so it's possible to save some space.

get_from_array PROC
; R0 = idx
        PUSH     {r4,r5,lr}
        LSRS     r1,r0,#1
; R1 = R0>>1 = idx>>1
; R1 is now number of triplet
        LSLS     r2,r1,#1
; R2 = R1<<1 = (R0>>1)<<1 = R0&(~1) = idx&(~1)
; the operation (x>>1)<<1 looks senseless, but it's intended to clear the lowest bit in x (or idx)
        LSLS     r5,r0,#31
; R5 = R0<<31 = idx<<31
; thus, R5 will contain 0x80000000 in case of even idx or zero if odd 
        ADDS     r4,r1,r2
; R4 = R1+R2 = idx>>1 + idx&(~1) = offset of triplet begin (or offset of left byte)
; the expression looks tricky, but it's equal to multiplication by 1.5
        LSRS     r0,r0,#1
; R0 = R0>>1 = idx>>1
; load pointer to array:
        LDR      r3,|array|
; R3 = offset of array table
        LSLS     r1,r0,#1
; R1 = R0<<1 = (idx>>1)<<1 = idx&(~1)
        ADDS     r0,r0,r1
; R0 = idx>>1 + idx&(~1) = idx*1.5 = offset of triple begin
        ADDS     r1,r3,r0
; R1 = R3+R0 = offset of array + idx*1.5
; in other words, R1 now contains absolute address of triplet
; load middle byte (at address R1+1):
        LDRB     r2,[r1,#1]
; R2 = middle byte
; finally check if the idx even or odd:
        CMP      r5,#0
; jump if even:
        BEQ      |L0.92|
; idx is odd, go on:
        LSLS     r0,r2,#28
; R0 = R2<<28 = middle_byte<<28
; load right byte at R1+2:
        LDRB     r1,[r1,#2]
; R1 = right byte
        LSRS     r0,r0,#20
; R0 = R0>>20 = (R2<<28)>>20
; this is the same as R2<<8, but Keil compiler generates more complex code in order to drop all bits behind these 4
        B        |L0.98|
|L0.92|
; idx is even, go on:
; load left byte. R3=array now and R4=address of it
        LDRB     r0,[r3,r4]
; R0 = left byte
        LSLS     r0,r0,#4
; R0 = left_byte<<4
; shift middle_byte in R2 4 bits right:
        LSRS     r1,r2,#4
; R1=middle_byte>>4
|L0.98|
; function epilogue
; current R0 value is shifted left byte or part of middle byte
; R1 is shifted part of middle byte or right byte
; now merge values and leave merged result in R0:
        ORRS     r0,r0,r1
; R0 = R0|R1
        POP      {r4,r5,pc}
        ENDP

There are at least of redundancy: idx*1.5 is calculated twice. As an exercise, reader may try to rework assembly function to make it shorter. Do not forget about testing!

Another thing to mention is that it's hard to generate big constants in 16-bit Thumb instructions, so Keil compiler often generates tricky code using shifting instructions to achieve the same effect. For example, it's tricky to generate AND Rdest, Rsrc, 1 or TST Rsrc, 1 code in Thumb mode, so Keil generates the code which shifts input idx by 31 bits left and then check, if the resulting value zero or not.

Setter

The first half of setter code is very similar to getter, address of triplet is calculated first, then the jump is occurred in order to dispatch to the right handler's code.

put_to_array PROC
        PUSH     {r4,r5,lr}
; R0 = idx
; R1 = val
        LSRS     r2,r0,#1
; R2 = R0>>1 = idx>>1
        LSLS     r3,r2,#1
; R3 = R2<<1 = (idx>>1)<<1 = idx&(~1)
        LSLS     r4,r0,#31
; R4 = R0<<31 = idx<<31
        ADDS     r3,r2,r3
; R3 = R2+R3 = idx>>1 + idx&(~1) = idx*1.5
        LSRS     r0,r0,#1
; R0 = R0>>1 = idx>>1
        LDR      r2,|array|
; R2 = address of array
        LSLS     r5,r0,#1
; R5 = R0<<1 = (idx>>1)<<1 = idx&(~1)
        ADDS     r0,r0,r5
; R0 = R0+R5 = idx>>1 + idx&(~1) = idx*1.5
        ADDS     r0,r2,r0
; R0 = R2+R0 = array + idx*1.5, in other words, this is address of triplet
; finally test shifted lowest bit in idx:
        CMP      r4,#0
; jump if idx is even:
        BEQ      |L0.40|
; idx is odd, go on:
; load middle byte at R0+1:
        LDRB     r3,[r0,#1]
; R3 = middle_byte
        LSRS     r2,r1,#8
; R2 = R1>>8 = val>>8
        LSRS     r3,r3,#4
; R3 = R3>>4 = middle_byte>>4
        LSLS     r3,r3,#4
; R3 = R3<<4 = (middle_byte>>4)<<4
; this two shift operations are used to drop low 4 bits in middle_byte
; merge high 4 bits in middle byte (in R3) with val>>8 (in R2):
        ORRS     r3,r3,r2
; R3 = updated middle byte
; store it at R0+1:
        STRB     r3,[r0,#1]
; store low 8 bits of val (val&0xFF) at R0+2:
        STRB     r1,[r0,#2]
        POP      {r4,r5,pc}
|L0.40|
; idx is even, go on:
        LSRS     r4,r1,#4
; R4 = R1>>4 = val>>4
; store val>>4 at R2+R3 (address of left byte or beginning of triplet):
        STRB     r4,[r2,r3]
; load middle byte at R0+1:
        LDRB     r3,[r0,#1]
; R3 = middle byte
        LSLS     r2,r1,#4
; R2 = R1<<4 = val<<4
        LSLS     r1,r3,#28
; R1 = R3<<28 = middle_byte<<28
        LSRS     r1,r1,#28
; R1 = R1>>28 = (middle_byte<<28)>>28
; these two shifting operation are used to drop all bits in register except lowest 4
; merge lowest 4 bits (in R1) and val<<4 (in R2):
        ORRS     r1,r1,r2
; store it at R0+1:
        STRB     r1,[r0,#1]
        POP      {r4,r5,pc}
        ENDP

Optimizing Keil 5.05 (ARM mode)

Getter

Getter function for ARM mode has no conditional branches at all! Thanks to the suffixes (like -EQ, -NE), which can be supplied to many instructions in ARM mode, so the instruction will be only executed if the corresponding flag(s) are set.

Many arithmetical instructions in ARM mode can have shifting suffix like LSL #1 (it means, the last operand is shifted left by 1 bit).

get_from_array PROC
; R0 = idx
        LSR      r1,r0,#1
; R1 = R0>>1 = idx>>1
; check lowest bit in idx and set flags:
        TST      r0,#1
        ADD      r2,r1,r1,LSL #1
; R2 = R1+R1<<1 = R1+R1*2 = R1*3
; thatnks to shifting suffix, a single instruction in ARM mode can multiplicate by 3
        LDR      r1,|array|
; R1 = address of array
        LSR      r0,r0,#1
; R0 = R0>>1 = idx>>1
        ADD      r0,r0,r0,LSL #1
; R0 = R0+R0<<1 = R0+R0*2 = R0*3 = (idx>>1)*3 = idx*1.5
        ADD      r0,r0,r1
; R0 = R0+R1 = array + idx*1.5, this is absolute address of triplet
; load middle byte at R0+1:
        LDRB     r3,[r0,#1]
; R3 = middle byte

; the following 3 instructions executed if index is odd, otherwise all of them are skipped:
; load right byte at R0+2:
        LDRBNE   r0,[r0,#2]
; R0 = right byte
        ANDNE    r1,r3,#0xf
; R1 = R3&0xF = middle_byte&0xF
        ORRNE    r0,r0,r1,LSL #8
; R0 = R0|(R1<<8) = right_byte | (middle_byte&0xF)<<8
; this is the result returned

; the following 3 instructions executed if index is even, otherwise all of them are skipped:
; load at R1+R2 = array + (idx>>1)*3 = array + idx*1.5
        LDRBEQ   r0,[r1,r2]
; R0 = left byte
        LSLEQ    r0,r0,#4
; R0 = R0<<4 = left_byte << 4
        ORREQ    r0,r0,r3,LSR #4
; R0 = R0 | R3>>4 = left_byte << 4 | middle_byte >> 4
; this is the result returned
        BX       lr
        ENDP

Setter

put_to_array PROC
; R0 = idx
; R1 = val
        LSR      r2,r0,#1
; R2 = R0>>1 = idx>>1
; check the lowest bit of idx and set flags:
        TST      r0,#1
        LDR      r12,|array|
; R12 = address of array
        LSR      r0,r0,#1
; R0 = R0>>1 = idx>>1
        ADD      r0,r0,r0,LSL #1
; R0 = R0+R0<<1 = R0+R0*2 = R0*3 = (idx>>1)*3 = idx/2*3 = idx*1.5
        ADD      r3,r2,r2,LSL #1
; R3 = R2+R2<<1 = R2+R2*2 = R2*3 = (idx>>1)*3 = idx/2*3 = idx*1.5
        ADD      r0,r0,r12
; R0 = R0+R12 = idx*1.5 + array
; jump if idx is even:
        BEQ      |L0.56|
; idx is odd, go on:
; load middle byte at R0+1:
        LDRB     r3,[r0,#1]
; R3 = middle byte
        AND      r3,r3,#0xf0
; R3 = R3&0xF0 = middle_byte&0xF0
        ORR      r2,r3,r1,LSR #8
; R2 = R3 | R1>>8 = middle_byte&0xF0 | val>>8
; store middle_byte&0xF0 | val>>8 at R0+1 (at the place of middle byte):
	STRB     r2,[r0,#1]
; store low 8 bits of val (or val&0xFF) at R0+2 (at the place of right byte):
        STRB     r1,[r0,#2]
        BX       lr
|L0.56|
; idx is even, go on:
        LSR      r2,r1,#4
; R2 = R1>>4 = val>>4
; store val>>4 at R12+R3 or array + idx*1.5 (place of left byte):
        STRB     r2,[r12,r3]
; load byte at R0+1 (middle byte):
        LDRB     r2,[r0,#1]
; R2 = middle_byte
; drop high 4 bits of middle byte:
        AND      r2,r2,#0xf
; R2 = R2&0xF = middle_byte&0xF
; update middle byte:
        ORR      r1,r2,r1,LSL #4
; R1 = R2 | R1<<4 = middle_byte&0xF | val<<4
; store updated middle byte at R0+1:
        STRB     r1,[r0,#1]
        BX       lr
ENDP

Value of idx*1.5 is calculated twice, this is redundancy Keil compiler produced can be eliminated. You can rework assembly function as well to make it shorter. Do not forget about tests!

(32-bit ARM) Comparison of code density in Thumb and ARM modes

Thumb mode in ARM CPUs was introduced to make instructions shorter (16-bits) instead of 32-bit instructions in ARM mode. But as we can see, it's hard to say, if it was worth it: code in ARM mode is always shorter (however, instructions are longer).

Optimizing GCC 4.9.3 for ARM64

Getter

<get_from_array>:
; W0 = idx
   0:           lsr     w2, w0, #1
; W2 = W0>>1 = idx>>1
   4:           lsl     w1, w2, #2
; W1 = W2<<2 = (W0>>1)<<2 = (idx&(~1))<<1
   8:           sub     w1, w1, w2
; W1 = W1-W2 = (idx&(~1))<<1 - idx>>1 = idx*1.5
; now test lowest bit of idx and jump if it is present.
; (ARM64 has single instruction for these operations: TBNZ (Test and Branch Not Zero)).
   c:           tbnz    w0, #0, 30 <get_from_array+0x30>

; idx is even, go on:
  10:           adrp    x2, page of array
  14:           add     w3, w1, #0x1
; W3 = W1+1 = idx*1.5 + 1, i.e., offset of middle byte
  18:           add     x2, x2, offset of array within page
; load left byte at X2+W1 = array + idx*1.5 with sign-extension ("sxtw")
  1c:           ldrb    w0, [x2,w1,sxtw]
; load middle byte at X2+W3 = array + idx*1.5 + 1 with sign-extension ("sxtw")
  20:           ldrb    w1, [x2,w3,sxtw]
; W0 = left byte
; W1 = middle byte
  24:           lsl     w0, w0, #4
; W0 = W0<<4 = left_byte << 4
; merge parts:
  28:           orr     w0, w0, w1, lsr #4
; W0 = W0 | W1>>4 = left_byte << 4 | middle_byte >> 4
; value in W0 is returned
  2c:           ret

; idx is odd, go on:
  30:           adrp    x2, page of array
  34:           add     w0, w1, #0x1
; W0 = W1+1 = idx*1.5+1, i.e., offset of middle byte
  38:           add     x2, x2, address of array within page
  3c:           add     w1, w1, #0x2
; W1 = W1+2 = idx*1.5+2, i.e., offset of right byte
; load middle byte at X2+W0 = array+idx*1.5+1 with sign-extension ("sxtw")
  40:           ldrb    w0, [x2,w0,sxtw]
; load right byte at X2+W1 = array+idx*1.5+2 with sign-extension ("sxtw")
  44:           ldrb    w1, [x2,w1,sxtw]
; W0 = middle byte
; W1 = right byte
  48:           ubfiz   w0, w0, #8, #4
; W0 = middle_byte<<8
; now merge parts:
  4c:           orr     w0, w0, w1
; W0 = W0 | W1 = middle_byte<<8 | right_byte
; value in W0 is returned
  50:           ret

ARM64 has new cool instruction UBFIZ (Unsigned bitfield insert in zero, with zeros to left and right), which can be used to place specified number of bits from one register to another. It's alias of another instruction, UBFM (Unsigned bitfield move, with zeros to left and right). UBFM is the instruction used internally in ARM64 instead of LSL/LSR (bit shifts).

Setter

<put_to_array>:
W0 = idx
W1 = val
  54:           lsr     w3, w0, #1
; W3 = W0>>1 = idx>>1
  58:           lsl     w2, w3, #2
; W2 = W3<<2 = (W0>>1)<<2 = (idx&(~1))<<1
  5c:           sub     w2, w2, w3
; W2 = W2-W3 = (idx&(~1))<<1 - idx>>1 = idx*1.5
; jump if lowest bit in idx is 1:
  60:           tbnz    w0, #0, 94 <put_to_array+0x40>
  
; idx is even, go on:
  64:           adrp    x0, page of array
  68:           add     w3, w2, #0x1
; W3 = W2+1 = idx*1.5+1, i.e., offset of middle byte
  6c:           add     x0, x0, offset of array within page
; X0 = address of array
  70:           lsr     w4, w1, #4
; W4 = W1>>4 = val>>4
  74:           sxtw    x3, w3
; X3 = sign-extended 32-bit W3 (idx*1.5+1) to 64-bit
; sign-extension is needed here because the value will be used as offset within array,
; and negative offsets are possible in standard C/C++
  78:           ubfiz   w1, w1, #4, #4
; W1 = W1<<4 = val<<4
; store W4 (val>>4) at X0+W2 = array + idx*1.5, i.e., address of left byte:
  7c:           strb    w4, [x0,w2,sxtw]
; load middle byte at X0+X3 = array+idx*1.5+1
  80:           ldrb    w2, [x0,x3]
; W2 = middle byte
  84:           and     w2, w2, #0xf
; W2 = W2&0xF = middle_byte&0xF (high 4 bits in middle byte are dropped)
; merge parts of new version of middle byte:
  88:           orr     w1, w2, w1
; W1 = W2|W1 = middle_byte&0xF | val<<4
; store W2 (new middle byte) at X0+X3 = array+idx*1.5+1
  8c:           strb    w1, [x0,x3]
  90:           ret

; idx is odd, go on:
  94:           add     w4, w2, #0x1
; W4 = W2+1 = idx*1.5+1, i.e., offset of middle byte
  98:           adrp    x0, page of array
  9c:           add     x0, x0, offset of array within page
; X0 = address of array
  a0:           add     w2, w2, #0x2
; W2 = W2+2 = idx*1.5+2, i.e., offset of right byte
  a4:           sxtw    x4, w4
; X4 = sign-extended 64-bit version of 32-bit W4
; load at X0+X4 = array+idx*1.5+1:
  a8:           ldrb    w3, [x0,x4]
; W3 = middle byte
  ac:           and     w3, w3, #0xfffffff0
; W3 = W3&0xFFFFFFF0 = middle_byte&0xFFFFFFF0, i.e., clear lowest 4 bits
  b0:           orr     w3, w3, w1, lsr #8
; W3 = W3|W1>>8 = middle_byte&0xFFFFFFF0 | val>>8
; store new version of middle byte at X0+X4=array+idx*1.5+1:
  b4:           strb    w3, [x0,x4]
; now store lowest 8 bits of val (in W1) at X0+W2=array+idx*1.5+2, i.e., place of right byte
; SXTW suffix means W2 will be sign-extended to 64-bit value before summing with X0
  b8:           strb    w1, [x0,w2,sxtw]
  bc:           ret

Optimizing GCC 4.4.5 for MIPS

Needless to keep in mind that each instruction after jump/branch instruction is executed first. It's called "branch delay slot" in RISC CPUs lingo. To make things simpler, just swap instructions (mentally) in each instruction pair which is started with branch or jump instruction.

MIPS has no flags (apparently, to simplify data dependencies), so branch instructions (like BNE) does both comparison and branching.

There is also GP (Global Pointer) set up code in the function prologue, which can be ignored so far.

Getter

get_from_array:
; $4 = idx
        srl     $2,$4,1
; $2 = $4>>1 = idx>>1
        lui     $28,%hi(__gnu_local_gp)
        sll     $3,$2,1
; $3 = $2<<1 = (idx>>1)<<1 = idx&(~1)
        andi    $4,$4,0x1
; $4 = $4&1 = idx&1
        addiu   $28,$28,%lo(__gnu_local_gp)

; jump if $4 (idx&1) is not zero (if idx is odd):
        bne     $4,$0,$L6
; $2 = $3+$2 = idx>>1 + idx&(~1) = idx*1.5
	addu    $2,$3,$2 ; branch delay slot - this instruction executed before BNE

; idx is even, go on:
        lw      $3,%got(array)($28)
; $3 = array
        nop
        addu    $2,$3,$2
; $2 = $3+$2 = array + idx*1.5
; load byte at $2+0 = array + idx*1.5 (left byte):
        lbu     $3,0($2)
; $3 = left byte
; load byte at $2+1 = array + idx*1.5+1 (middle byte):
        lbu     $2,1($2)
; $2 = middle byte
        sll     $3,$3,4
; $3 = $3<<4 = left_byte<<4
        srl     $2,$2,4
; $2 = $2>>4 = middle_byte>>4
        j       $31
        or      $2,$2,$3 ; branch delay slot - this instruction executed before J
; $2 = $2|$3 = middle_byte>>4 | left_byte<<4
; $2=returned result

$L6:
; idx is odd, go on:
        lw      $3,%got(array)($28)
; $3 = array
        nop
        addu    $2,$3,$2
; $2 = $3+$2 = array + idx*1.5
; load byte at $2+1 = array + idx*1.5 + 1 (middle byte)
        lbu     $4,1($2)
; $4 = middle byte
; load byte at $2+1 = array + idx*1.5 + 2 (right byte)
        lbu     $3,2($2)
; $3 = right byte
        andi    $2,$4,0xf
; $2 = $4&0xF = middle_byte&0xF
        sll     $2,$2,8
; $2 = $2<<8 = middle_byte&0xF << 8
        j       $31        
	or      $2,$2,$3 ; branch delay slot - this instruction executed before J
; $2 = $2|$3 = middle_byte&0xF << 8 | right byte
; $2=returned result

Setter

put_to_array:
; $4=idx
; $5=val
        srl     $2,$4,1
; $2 = $4>>1 = idx>>1
        lui     $28,%hi(__gnu_local_gp)
        sll     $3,$2,1
; $3 = $2<<1 = (idx>>1)<<1 = idx&(~1)
        andi    $4,$4,0x1
; $4 = $4&1 = idx&1
        addiu   $28,$28,%lo(__gnu_local_gp)
; jump if $4=idx&1 is not zero (i.e., if idx is odd):
        bne     $4,$0,$L11
        addu    $2,$3,$2 ; branch delay slot, this instruction is executed before BNE
; $2 = $3+$2 = idx&(~1) + idx>>1 = idx*1.5

; idx is even, go on
        lw      $3,%got(array)($28)
; $3 = array
        addiu   $4,$2,1
; $4 = $2+1 = idx*1.5+1, i.e., offset of middle byte in array
        srl     $6,$5,4
; $6 = $5>>4 = val>>4
        addu    $2,$3,$2
; $2 = $3+$2 = array + idx*1.5 (offset of left byte)
; store $6 (val>>4) as left byte:
        sb      $6,0($2)
        addu    $2,$3,$4
; $2 = $3+$4 = array + idx*1.5 + 1 (absolute address of middle byte)
; load middle byte at $2+0 = array + idx*1.5 + 1
        lbu     $3,0($2)
; $3 = middle byte
        andi    $5,$5,0xf
; $5 = $5&0xF = val&0xF
        andi    $3,$3,0xf
; $3 = $3&0xF = middle_byte&0xF
        sll     $5,$5,4
; $5 = $5<<4 = (val&0xF)<<4
        or      $5,$3,$5
; $5 = $3|$5 = middle_byte&0xF | (val&0xF)<<4 (new version of middle byte)
        j       $31
; store $5 (new middle byte) at $2 (array + idx*1.5 + 1)
        sb      $5,0($2) ; branch delay slot, this instruction is executed before J

$L11:
; idx is odd, go on
        lw      $4,%got(array)($28)
; $4 = array
        addiu   $3,$2,1
; $3 = $2+1 = idx*1.5+1 (offset of middle byte)
        addu    $3,$4,$3
; $3 = $4+$3 = array + idx*1.5+1
; load middle byte at $3 (array + idx*1.5+1)
        lbu     $6,0($3)
; $6 = middle byte
        srl     $7,$5,8
; $7 = $5>>8 = val>>8
        andi    $6,$6,0xf0
; $6 = $6&0xF0 = middle_byte&0xF0
        or      $6,$6,$7
; $6 = $6|$7 = middle_byte&0xF0 | val>>8
        addu    $2,$4,$2
; store updated middle byte at $3 (array + idx*1.5+1)
        sb      $6,0($3)
        j       $31
; store low 8 bits of val at $2+2=idx*1.5+2 (place of right byte)
        sb      $5,2($2) ; branch delay slot, this instruction is executed before J

Difference from the real FAT12

The real FAT12 table is slightly different: https://en.wikipedia.org/wiki/Design_of_the_FAT_file_system#Cluster_map:

For even elements:

 +0 +1 +2
|23|.1|..|..

For odd elements:

 +0 +1 +2
|..|3.|12|..

Here are FAT12-related functions in Linux Kernel: fat12_ent_get(), fat12_ent_put().

Nevertheless, I did as I did because values are better visible and recognizable in byte-level GDB dump, for the sake of demonstration.

Exercise

Perhaps, there could be a way to store data in a such way, so getter/setter functions would be faster. If we would place values in this way:

(Even elements)

 +0 +1 +2
|23|1.|..|..

(Odd elements)

 +0 +1 +2
|..|.1|23|..

This schema of storing data will allow to eliminate at least one shift operation. As an exercise, you may rework my C/C++ code in that way and see what compiler will produce.

Summary

Bit shifts ("<<" and ">>" in C/C++, SHL/SHR/SAL/SAR in x86, LSL/LSR in ARM, SLL/SRL in MIPS) are used to place bit(s) to specific place.

AND operation ("&" in C/C++, AND in x86/ARM) is used to drop unneeded bits, also during isolation.

OR operation ("|" in C/C++, OR in x86/ARM) is used to merge or combine several values into one. One input value must have zero space at the place where another value has its information-caring bits.

ARM64 has new instructions UBFM, UFBIZ to move specific bits from one register to another.

Conclusion

FAT12 is hardly used somewhere nowadays, but if you have space constraints and you need to store values limited to 12 bits, you may consider tightly-packed array in the manner it's done in FAT12 table.

Further reading

More examples like this (for x86/x64/ARM/ARM64 and MIPS) can be found in my book: //beginners.re/.


→ [list of blog posts, my twitter/facebook]

The page last updated on 09-October-2016