## (Beginners level) packing 12-bit values into array using bit operations (x64, ARM/ARM64, MIPS)

The note below has been copypasted to the Reverse Engineering for Beginners book

### Introduction

File Allocation Table (FAT) was a widely popular filesystem. Hard to believe, but it's still used on flash drives, perhaps, for the reason of simplicity and compatibility. The FAT table itself is array of elements, each of which points to the next cluster number of a file (FAT supports files scattered across the whole disk). That implies that maximum of each element is maximum number of clusters on the disk. In MS-DOS era, most hard disks has FAT16 filesystem, because cluster number could be packed into 16-bit value. Hard disks then become cheaper, and FAT32 emerged, where 32-bit value was allocated for cluster number. But there were also a times, when floppy diskettes were not that cheap and has no much space, so FAT12 were used on them, for the reason of packing all filesystem structures as tight as possible: https://en.wikipedia.org/wiki/File_Allocation_Table#FAT12.

So the FAT table in FAT12 filesystem is an array where each two subsequent 12-bit elements are stored into 3 bytes (triplet). Here is how 6 12-bit values (AAA, BBB, CCC, DDD, EEE and FFF) are packed into 9 bytes:

 +0 +1 +2 +3 +4 +5 +6 +7 +8
|AA|AB|BB|CC|CD|DD|EE|EF|FF|...


Pushing values into array and pulling them back can be good example of bit twiddling operations (in both C/C++ and low-level machine code), so that's why I'll use FAT12 as an example here.

### Data structure

We can quickly observe that each byte triplet will store 2 12-bit values: the first one is located at the left side, second one is at right:

 +0 +1 +2
|11|12|22|...


We will pack nibbles (4 bit chunks) in the following way (1 - highest nibble, 3 - lowest):

(Even)

 +0 +1 +2
|12|3.|..|...


(Odd)

 +0 +1 +2
|..|.1|23|...


### The algorithm

So the algorithm can be as follows: if the element's index is even, put it at left side, if the index is odd, place it at right side. The middle byte: if the element's index is even, place part of it in high 4 bits, if it's odd, place its part in low 4 bits. But first, find the right triplet, this is easy: $\frac{index}{2}$. Finding the address of right byte in array of bytes is also easy: $\frac{index}{2} \cdot 3$ or $index \cdot \frac{3}{2}$ or just $index \cdot 1.5$.

Pulling values from array: if index is even, get leftmost and middle bytes and combine its parts. If index is odd, get middle and rightmost bytes and combine them. Do not forget to isolate unneeded bits in middle byte.

Pushing values is almost the same, but be careful not to overwrite some other's bits in the middle byte, correcting only yours.

### The C/C++ code

#include <stdio.h>
#include <stdint.h>
#include <assert.h>

#define ARRAY_SIZE (0x1000/2*3)
uint8_t array[ARRAY_SIZE]; // big enough array of triplets

unsigned int get_from_array (unsigned int idx)
{
// find right triplet in array:
int triplet=(idx>>1);
int array_idx=triplet*3;
//assert (array_idx<ARRAY_SIZE);

if (idx&1)
{
// this is odd element

// compose value using middle and rightmost bytes:
return ((array[array_idx+1]&0xF) << 8)|array[array_idx+2];
}
else
{
// this is even element

// compose value using rightmost and middle bytes:
return array[array_idx]<<4 | ((array[array_idx+1]>>4)&0xF);
};
};

void put_to_array (unsigned int idx, unsigned int val)
{
//assert (val<=0xFFF);

// find right triplet in array:
int triplet=(idx>>1);
int array_idx=triplet*3;
//assert (array_idx<ARRAY_SIZE);

if (idx&1)
{
// this is odd element
// put value into middle and rightmost bytes:

// decompose value to be stored:
uint8_t val_lowest_byte=val&0xFF; // isolate lowest 8 bits
uint8_t val_highest_nibble=val>>8; // no need to apply &0xF, we already know the val<=0xFFF

// clear low 4 bits in the middle byte:
array[array_idx+1]=array[array_idx+1]&0xF0;

array[array_idx+1]=array[array_idx+1]|val_highest_nibble;
array[array_idx+2]=val_lowest_byte;
}
else
{
// this is even element
// put value into leftmost and middle bytes:

// decompose value to be stored:
uint8_t val_highest_byte=val>>4;
uint8_t val_lowest_nibble=val&0xF;

array[array_idx]=val_highest_byte;

// clear high 4 bits in the middle byte:
array[array_idx+1]=array[array_idx+1]&0xF;
array[array_idx+1]=array[array_idx+1]|val_lowest_nibble<<4;
};
};

int main()
{
int i;

// test
for (i=0; i<0x1000; i++)
{
put_to_array(i, i);
};

for (i=0; i<0x1000; i++)
{
assert(get_from_array(i)==i);
};
//put_to_array(0x1000, 1); // will fail due to assert()

// print triplets:
for (int i=0;i<0x1000/2;i++)
printf ("0x%02X%02X%02X\n",array[i*3],array[i*3+1],array[i*3+2]);
};


During test, all 12-bit elements are filled with values in 0..0xFFF range. And here is a dump of all triplets, each line has 3 bytes:

0x000001
0x002003
0x004005
0x006007
0x008009
0x00A00B
0x00C00D
0x00E00F
0x010011
0x012013
0x014015

...

0xFECFED
0xFEEFEF
0xFF0FF1
0xFF2FF3
0xFF4FF5
0xFF6FF7
0xFF8FF9
0xFFAFFB
0xFFCFFD
0xFFEFFF


Here is also GDB byte-level output of 300 bytes (or 100 triplets) started at 512/2*3, i.e., it's address where 512th element (0x200) is beginning. I added square brackets in my text editor to show triplets explicitely. Take a notice at the middle bytes, where the last element is ended and the next is started. In other words, each middle byte has lowest 4 bits of even element and highest 4 bits of odd element.

(gdb) x/300xb array+512/2*30x601380 <array+768>: [ 0x20    0x02    0x01 ][ 0x20    0x22    0x03 ][ 0x20    0x420x601388 <array+776>:   0x05 ][ 0x20    0x62    0x07 ][ 0x20    0x82    0x09 ][ 0x200x601390 <array+784>:   0xa2    0x0b ][ 0x20    0xc2    0x0d ][ 0x20    0xe2    0x0f ]0x601398 <array+792>: [ 0x21    0x02    0x11 ][ 0x21    0x22    0x13 ][ 0x21    0x420x6013a0 <array+800>:   0x15 ][ 0x21    0x62    0x17 ][ 0x21    0x82    0x19 ][ 0x210x6013a8 <array+808>:   0xa2    0x1b ][ 0x21    0xc2    0x1d ][ 0x21    0xe2    0x1f ]0x6013b0 <array+816>: [ 0x22    0x02    0x21 ][ 0x22    0x22    0x23 ][ 0x22    0x420x6013b8 <array+824>:   0x25 ][ 0x22    0x62    0x27 ][ 0x22    0x82    0x29 ][ 0x220x6013c0 <array+832>:   0xa2    0x2b ][ 0x22    0xc2    0x2d ][ 0x22    0xe2    0x2f ]0x6013c8 <array+840>: [ 0x23    0x02    0x31 ][ 0x23    0x22    0x33 ][ 0x23    0x420x6013d0 <array+848>:   0x35 ][ 0x23    0x62    0x37 ][ 0x23    0x82    0x39 ][ 0x23  0x6013d8 <array+856>:   0xa2    0x3b ][ 0x23    0xc2    0x3d ][ 0x23    0xe2    0x3f ]0x6013e0 <array+864>: [ 0x24    0x02    0x41 ][ 0x24    0x22    0x43 ][ 0x24    0x42  0x6013e8 <array+872>:   0x45 ][ 0x24    0x62    0x47 ][ 0x24    0x82    0x49 ][ 0x240x6013f0 <array+880>:   0xa2    0x4b ][ 0x24    0xc2    0x4d ][ 0x24    0xe2    0x4f ]0x6013f8 <array+888>: [ 0x25    0x02    0x51 ][ 0x25    0x22    0x53 ][ 0x25    0x420x601400 <array+896>:   0x55 ][ 0x25    0x62    0x57 ][ 0x25    0x82    0x59 ][ 0x25  0x601408 <array+904>:   0xa2    0x5b ][ 0x25    0xc2    0x5d ][ 0x25    0xe2    0x5f ]0x601410 <array+912>: [ 0x26    0x02    0x61 ][ 0x26    0x22    0x63 ][ 0x26    0x420x601418 <array+920>:   0x65 ][ 0x26    0x62    0x67 ][ 0x26    0x82    0x69 ][ 0x26  0x601420 <array+928>:   0xa2    0x6b ][ 0x26    0xc2    0x6d ][ 0x26    0xe2    0x6f ]0x601428 <array+936>: [ 0x27    0x02    0x71 ][ 0x27    0x22    0x73 ][ 0x27    0x42  0x601430 <array+944>:   0x75 ][ 0x27    0x62    0x77 ][ 0x27    0x82    0x79 ][ 0x27  0x601438 <array+952>:   0xa2    0x7b ][ 0x27    0xc2    0x7d ][ 0x27    0xe2    0x7f ]0x601440 <array+960>: [ 0x28    0x02    0x81 ][ 0x28    0x22    0x83 ][ 0x28    0x420x601448 <array+968>:   0x85 ][ 0x28    0x62    0x87 ][ 0x28    0x82    0x89 ][ 0x280x601450 <array+976>:   0xa2    0x8b ][ 0x28    0xc2    0x8d ][ 0x28    0xe2    0x8f ]0x601458 <array+984>: [ 0x29    0x02    0x91 ][ 0x29    0x22    0x93 ][ 0x29    0x420x601460 <array+992>:   0x95 ][ 0x29    0x62    0x97 ][ 0x29    0x82    0x99 ][ 0x290x601468 <array+1000>:  0xa2    0x9b ][ 0x29    0xc2    0x9d ][ 0x29    0xe2    0x9f ]0x601470 <array+1008>:[ 0x2a    0x02    0xa1 ][ 0x2a    0x22    0xa3 ][ 0x2a    0x420x601478 <array+1016>:  0xa5 ][ 0x2a    0x62    0xa7 ][ 0x2a    0x82    0xa9 ][ 0x2a0x601480 <array+1024>:  0xa2    0xab ][ 0x2a    0xc2    0xad ][ 0x2a    0xe2    0xaf ]0x601488 <array+1032>:[ 0x2b    0x02    0xb1 ][ 0x2b    0x22    0xb3 ][ 0x2b    0x420x601490 <array+1040>:  0xb5 ][ 0x2b    0x62    0xb7 ][ 0x2b    0x82    0xb9 ][ 0x2b0x601498 <array+1048>:  0xa2    0xbb ][ 0x2b    0xc2    0xbd ][ 0x2b    0xe2    0xbf ]0x6014a0 <array+1056>:[ 0x2c    0x02    0xc1 ][ 0x2c    0x22    0xc3 ][ 0x2c    0x420x6014a8 <array+1064>:  0xc5 ][ 0x2c    0x62    0xc7 ]


### How it works

Let array be a global buffer to make simpler access to it.

#### Getter

Let's first start at the function getting values from the array, because it's simpler.

The method of finding triplet's number is just division input index by 2, but we can do it just by shifting right by 1 bit. This is a very common way of dividing/multiplication by numbers in form of $2^n$.

I can demonstrate how it works. Let's say, you need to divide 123 by 10. Just drop last digit (3, which is remainder of division) and 12 is left. Division by 2 is just dropping least significant bit. Dropping can be done by shifting right.

Now the functions must decide if the input index even (so 12-bit value is placed at the left) or odd (at the right). Simplest way to do so is to isolate lowest bit (x&1). If it's zero, our value is even, otherwise it's odd.

This fact can be illustrated easily, take a look at the lowest bit:

decimal binary even/odd

0       0000   even
1       0001   odd
2       0010   even
3       0011   odd
4       0100   even
5       0101   odd
6       0110   even
7       0111   odd
8       1000   even
9       1001   odd
10      1010   even
11      1011   odd
12      1100   even
13      1101   odd
14      1110   even
15      1111   odd
...


Zero is also even number, it's so in two’s complement system, where it's located between two odd numbers (-1 and 1).

For math geeks, I could also say that even or odd sign is also remainder of division by 2. Division of a number by 2 is merely dropping the last bit, which is remainder of division. Well, we do not need to do shifts here, just isolate lowest bit.

If the element is odd, we take middle and right bytes ("array[array_idx+1]" and "array[array_idx+2]"). Lowest 4 bits of the middle byte is isolated. Right byte is taken as a whole. Now we need to combine these parts into 12-bit value. To do so, shift 4 bits from the middle byte by 8 bits left, so these 4 bits will be allocated right behind highest 8th bit of byte. Then, using OR operation, we just add these parts.

If the element is even, high 8 bits of 12-bit value is located in left byte, while lowest 4 bits are located in the high 4 bits of middle byte. We isolate highest 4 bits in the middle byte by shifting it 4 bits right and then applying AND operation, just to be sure that nothing is left there. We also take left byte and shift its value 4 bits left, because it has bits from 11th to 4th (inclusive, starting at 0), Using OR operation, we combine these two parts.

#### Setter

Setter calculates triplet's address in the same way. It also operates on left/right bytes in the same way. But it's not correct just to write to the middle byte, because write operation will destroy the information related to the other element. So the common way is to load byte, drop bits where you'll write, write it there, but leave other part intact. Using AND operation (& in C/C++), we drop everything except our part. Using OR operation (| in C/C++), we then update the middle byte.

### Optimizing GCC 4.8.2 for x86-64

Let's see what optimizing GCC 4.8.2 for Linux x64 will do. I added comments. Sometimes readers are confused because instructions order is not logical. It's OK, because optimizing compilers take CPU out-of-order-execution mechanisms into considerations, and sometimes, swapped instructions performing better.

#### Getter

get_from_array:
; EDI=idx
; make a copy:
mov     eax, edi
; calculate idx>>1:
shr     eax
; determine if the element even or odd by isolation of the lowest bit:
and     edi, 1
; calculate (idx>>1)*3.
; multiplication is slow in geenral, and can be replaced by one shifting and addition operation
; LEA is capable to do both:
lea     edx, [rax+rax*2]
; EDX now is (idx>>1)*3
; point EAX to the middle byte:
lea     eax, [rdx+1]
; sign-extend EAX value to RDX:
cdqe
; get middle byte into EAX:
movzx   eax, BYTE PTR array[rax]
; finally check the value of the lowest bit in index.
; jump if index is odd (NE is the same as NZ (Not Zero)):
jne     .L9

; this is even element, go on
; sign-extend EDX to RDX:
movsx   rdx, edx
; shift middle byte 4 bits right:
shr     al, 4
; AL now has 4 highest bits of middle byte
; load left byte into EDX:
movzx   edx, BYTE PTR array[rdx]
; sign-extend AL (where high 4 bits of middle byte are now)
movzx   eax, al
; EAX has 4 high bits bits of middle byte
; EDX now has left byte
; shift left byte 4 bits left:
sal     edx, 4
; 4 lowest bits in EDX after shifting is cleared to zero
; finally merge values:
or      eax, edx
ret
.L9:
; this is odd element, go on
; calculate address of right byte:
; isolate lowest 4 bits in middle byte:
and     eax, 15 ; 15=0xF
; sign-extend EDX (where address of right byte is) to RDX:
movsx   rdx, edx
; shift 4 bits we got from middle bytes 8 bits left:
sal     eax, 8
movzx   edx, BYTE PTR array[rdx]
; merge values:
or      eax, edx
ret


#### Setter

put_to_array:
; EDI=idx
; ESI=val
; copy idx to EAX:
mov     eax, edi
; calculate idx>>1 and put it to EAX:
shr     eax
; isolate lowest bit in idx:
and     edi, 1
; calculate (idx>>2)*3 and put it to EAX:
lea     eax, [rax+rax*2]
; jump if index is odd (NE is the same as NZ (Not Zero)):
jne     .L5

; this is even element, go on
; sign-extend address of triplet in EAX to RDX:
movsx   rdx, eax
; copy val value to ECX:
mov     ecx, esi
; calculate address of middle byte:
; sign-extend address in EAX to RDX:
cdqe
; prepare left byte in ECX by shifting it:
shr     ecx, 4
; prepare 4 bits for middle byte:
sal     esi, 4
; put left byte:
mov     BYTE PTR array[rdx], cl
movzx   edx, BYTE PTR array[rax]
; drop high 4 bits:
and     edx, 15 ; 15=0xF
; merge our data and low 4 bits which were there before:
or      esi, edx
; put middle byte back:
mov     BYTE PTR array[rax], sil
ret
.L5:
; this is odd element, go on
; calculate address of middle byte and put it to ECX:
lea     ecx, [rax+1]
; copy val value from ESI to EDI:
mov     edi, esi
; calculate address of right byte:
; get high 4 bits of input value by shifting it 8 bits right:
shr     edi, 8
; sign-extend address in EAX into RAX:
cdqe
; sign-extend address of middle byte in ECX to RCX:
movsx   rcx, ecx
; load middle byte into EDX:
movzx   edx, BYTE PTR array[rcx]
; drop low 4 bits in middle byte:
and     edx, -16 ; -16=0xF0
; merge data from input val with what was in middle byte before:
or      edx, edi
; store middle byte:
mov     BYTE PTR array[rcx], dl
; store right byte. val is still in ESI and SIL is a part of ESI register which has lowest 8 bits:
mov     BYTE PTR array[rax], sil
ret


All addresses in Linux x64 are 64-bit ones, so during pointer arithmetics, all values should also be 64-bit. The code calculating offsets inside of array operates on 32-bit values (input idx argument has type of int, which has width of 32 bits), and so these values should be converted to 64-bit addresses before actual memory load/store. So there are a lot of sign-extending instructions (like CDQE, MOVSX) used for conversion. Why to extend sign? By C/C++ standards, pointer arithmetics can operate on negative values (it's possible to access array using negative index like array[-123]). Since GCC compiler cannot be sure if all indices are always positive, it adds sign-extending instructions.

### Optimizing Keil 5.05 (Thumb mode)

#### Getter

The following code has final OR operation in the function epilogue. Indeed, it executes at the end of both branches, so it's possible to save some space.

get_from_array PROC
; R0 = idx
PUSH     {r4,r5,lr}
LSRS     r1,r0,#1
; R1 = R0>>1 = idx>>1
; R1 is now number of triplet
LSLS     r2,r1,#1
; R2 = R1<<1 = (R0>>1)<<1 = R0&(~1) = idx&(~1)
; the operation (x>>1)<<1 looks senseless, but it's intended to clear the lowest bit in x (or idx)
LSLS     r5,r0,#31
; R5 = R0<<31 = idx<<31
; thus, R5 will contain 0x80000000 in case of even idx or zero if odd
; R4 = R1+R2 = idx>>1 + idx&(~1) = offset of triplet begin (or offset of left byte)
; the expression looks tricky, but it's equal to multiplication by 1.5
LSRS     r0,r0,#1
; R0 = R0>>1 = idx>>1
LDR      r3,|array|
; R3 = offset of array table
LSLS     r1,r0,#1
; R1 = R0<<1 = (idx>>1)<<1 = idx&(~1)
; R0 = idx>>1 + idx&(~1) = idx*1.5 = offset of triple begin
; R1 = R3+R0 = offset of array + idx*1.5
; in other words, R1 now contains absolute address of triplet
LDRB     r2,[r1,#1]
; R2 = middle byte
; finally check if the idx even or odd:
CMP      r5,#0
; jump if even:
BEQ      |L0.92|
; idx is odd, go on:
LSLS     r0,r2,#28
; R0 = R2<<28 = middle_byte<<28
; load right byte at R1+2:
LDRB     r1,[r1,#2]
; R1 = right byte
LSRS     r0,r0,#20
; R0 = R0>>20 = (R2<<28)>>20
; this is the same as R2<<8, but Keil compiler generates more complex code in order to drop all bits behind these 4
B        |L0.98|
|L0.92|
; idx is even, go on:
LDRB     r0,[r3,r4]
; R0 = left byte
LSLS     r0,r0,#4
; R0 = left_byte<<4
; shift middle_byte in R2 4 bits right:
LSRS     r1,r2,#4
; R1=middle_byte>>4
|L0.98|
; function epilogue
; current R0 value is shifted left byte or part of middle byte
; R1 is shifted part of middle byte or right byte
; now merge values and leave merged result in R0:
ORRS     r0,r0,r1
; R0 = R0|R1
POP      {r4,r5,pc}
ENDP


There are at least of redundancy: idx*1.5 is calculated twice. As an exercise, reader may try to rework assembly function to make it shorter. Do not forget about testing!

Another thing to mention is that it's hard to generate big constants in 16-bit Thumb instructions, so Keil compiler often generates tricky code using shifting instructions to achieve the same effect. For example, it's tricky to generate AND Rdest, Rsrc, 1 or TST Rsrc, 1 code in Thumb mode, so Keil generates the code which shifts input idx by 31 bits left and then check, if the resulting value zero or not.

#### Setter

The first half of setter code is very similar to getter, address of triplet is calculated first, then the jump is occurred in order to dispatch to the right handler's code.

put_to_array PROC
PUSH     {r4,r5,lr}
; R0 = idx
; R1 = val
LSRS     r2,r0,#1
; R2 = R0>>1 = idx>>1
LSLS     r3,r2,#1
; R3 = R2<<1 = (idx>>1)<<1 = idx&(~1)
LSLS     r4,r0,#31
; R4 = R0<<31 = idx<<31
; R3 = R2+R3 = idx>>1 + idx&(~1) = idx*1.5
LSRS     r0,r0,#1
; R0 = R0>>1 = idx>>1
LDR      r2,|array|
; R2 = address of array
LSLS     r5,r0,#1
; R5 = R0<<1 = (idx>>1)<<1 = idx&(~1)
; R0 = R0+R5 = idx>>1 + idx&(~1) = idx*1.5
; R0 = R2+R0 = array + idx*1.5, in other words, this is address of triplet
; finally test shifted lowest bit in idx:
CMP      r4,#0
; jump if idx is even:
BEQ      |L0.40|
; idx is odd, go on:
; load middle byte at R0+1:
LDRB     r3,[r0,#1]
; R3 = middle_byte
LSRS     r2,r1,#8
; R2 = R1>>8 = val>>8
LSRS     r3,r3,#4
; R3 = R3>>4 = middle_byte>>4
LSLS     r3,r3,#4
; R3 = R3<<4 = (middle_byte>>4)<<4
; this two shift operations are used to drop low 4 bits in middle_byte
; merge high 4 bits in middle byte (in R3) with val>>8 (in R2):
ORRS     r3,r3,r2
; R3 = updated middle byte
; store it at R0+1:
STRB     r3,[r0,#1]
; store low 8 bits of val (val&0xFF) at R0+2:
STRB     r1,[r0,#2]
POP      {r4,r5,pc}
|L0.40|
; idx is even, go on:
LSRS     r4,r1,#4
; R4 = R1>>4 = val>>4
; store val>>4 at R2+R3 (address of left byte or beginning of triplet):
STRB     r4,[r2,r3]
; load middle byte at R0+1:
LDRB     r3,[r0,#1]
; R3 = middle byte
LSLS     r2,r1,#4
; R2 = R1<<4 = val<<4
LSLS     r1,r3,#28
; R1 = R3<<28 = middle_byte<<28
LSRS     r1,r1,#28
; R1 = R1>>28 = (middle_byte<<28)>>28
; these two shifting operation are used to drop all bits in register except lowest 4
; merge lowest 4 bits (in R1) and val<<4 (in R2):
ORRS     r1,r1,r2
; store it at R0+1:
STRB     r1,[r0,#1]
POP      {r4,r5,pc}
ENDP


### Optimizing Keil 5.05 (ARM mode)

#### Getter

Getter function for ARM mode has no conditional branches at all! Thanks to the suffixes (like -EQ, -NE), which can be supplied to many instructions in ARM mode, so the instruction will be only executed if the corresponding flag(s) are set.

Many arithmetical instructions in ARM mode can have shifting suffix like LSL #1 (it means, the last operand is shifted left by 1 bit).

get_from_array PROC
; R0 = idx
LSR      r1,r0,#1
; R1 = R0>>1 = idx>>1
; check lowest bit in idx and set flags:
TST      r0,#1
; R2 = R1+R1<<1 = R1+R1*2 = R1*3
; thatnks to shifting suffix, a single instruction in ARM mode can multiplicate by 3
LDR      r1,|array|
; R1 = address of array
LSR      r0,r0,#1
; R0 = R0>>1 = idx>>1
; R0 = R0+R0<<1 = R0+R0*2 = R0*3 = (idx>>1)*3 = idx*1.5
; R0 = R0+R1 = array + idx*1.5, this is absolute address of triplet
; load middle byte at R0+1:
LDRB     r3,[r0,#1]
; R3 = middle byte

; the following 3 instructions executed if index is odd, otherwise all of them are skipped:
; load right byte at R0+2:
LDRBNE   r0,[r0,#2]
; R0 = right byte
ANDNE    r1,r3,#0xf
; R1 = R3&0xF = middle_byte&0xF
ORRNE    r0,r0,r1,LSL #8
; R0 = R0|(R1<<8) = right_byte | (middle_byte&0xF)<<8
; this is the result returned

; the following 3 instructions executed if index is even, otherwise all of them are skipped:
; load at R1+R2 = array + (idx>>1)*3 = array + idx*1.5
LDRBEQ   r0,[r1,r2]
; R0 = left byte
LSLEQ    r0,r0,#4
; R0 = R0<<4 = left_byte << 4
ORREQ    r0,r0,r3,LSR #4
; R0 = R0 | R3>>4 = left_byte << 4 | middle_byte >> 4
; this is the result returned
BX       lr
ENDP


#### Setter

put_to_array PROC
; R0 = idx
; R1 = val
LSR      r2,r0,#1
; R2 = R0>>1 = idx>>1
; check the lowest bit of idx and set flags:
TST      r0,#1
LDR      r12,|array|
; R12 = address of array
LSR      r0,r0,#1
; R0 = R0>>1 = idx>>1
; R0 = R0+R0<<1 = R0+R0*2 = R0*3 = (idx>>1)*3 = idx/2*3 = idx*1.5
; R3 = R2+R2<<1 = R2+R2*2 = R2*3 = (idx>>1)*3 = idx/2*3 = idx*1.5
; R0 = R0+R12 = idx*1.5 + array
; jump if idx is even:
BEQ      |L0.56|
; idx is odd, go on:
; load middle byte at R0+1:
LDRB     r3,[r0,#1]
; R3 = middle byte
AND      r3,r3,#0xf0
; R3 = R3&0xF0 = middle_byte&0xF0
ORR      r2,r3,r1,LSR #8
; R2 = R3 | R1>>8 = middle_byte&0xF0 | val>>8
; store middle_byte&0xF0 | val>>8 at R0+1 (at the place of middle byte):
STRB     r2,[r0,#1]
; store low 8 bits of val (or val&0xFF) at R0+2 (at the place of right byte):
STRB     r1,[r0,#2]
BX       lr
|L0.56|
; idx is even, go on:
LSR      r2,r1,#4
; R2 = R1>>4 = val>>4
; store val>>4 at R12+R3 or array + idx*1.5 (place of left byte):
STRB     r2,[r12,r3]
; load byte at R0+1 (middle byte):
LDRB     r2,[r0,#1]
; R2 = middle_byte
; drop high 4 bits of middle byte:
AND      r2,r2,#0xf
; R2 = R2&0xF = middle_byte&0xF
; update middle byte:
ORR      r1,r2,r1,LSL #4
; R1 = R2 | R1<<4 = middle_byte&0xF | val<<4
; store updated middle byte at R0+1:
STRB     r1,[r0,#1]
BX       lr
ENDP


Value of idx*1.5 is calculated twice, this is redundancy Keil compiler produced can be eliminated. You can rework assembly function as well to make it shorter. Do not forget about tests!

### (32-bit ARM) Comparison of code density in Thumb and ARM modes

Thumb mode in ARM CPUs was introduced to make instructions shorter (16-bits) instead of 32-bit instructions in ARM mode. But as we can see, it's hard to say, if it was worth it: code in ARM mode is always shorter (however, instructions are longer).

### Optimizing GCC 4.9.3 for ARM64

#### Getter

<get_from_array>:
; W0 = idx
0:           lsr     w2, w0, #1
; W2 = W0>>1 = idx>>1
4:           lsl     w1, w2, #2
; W1 = W2<<2 = (W0>>1)<<2 = (idx&(~1))<<1
8:           sub     w1, w1, w2
; W1 = W1-W2 = (idx&(~1))<<1 - idx>>1 = idx*1.5
; now test lowest bit of idx and jump if it is present.
; (ARM64 has single instruction for these operations: TBNZ (Test and Branch Not Zero)).
c:           tbnz    w0, #0, 30 <get_from_array+0x30>

; idx is even, go on:
10:           adrp    x2, page of array
; W3 = W1+1 = idx*1.5 + 1, i.e., offset of middle byte
18:           add     x2, x2, offset of array within page
; load left byte at X2+W1 = array + idx*1.5 with sign-extension ("sxtw")
1c:           ldrb    w0, [x2,w1,sxtw]
; load middle byte at X2+W3 = array + idx*1.5 + 1 with sign-extension ("sxtw")
20:           ldrb    w1, [x2,w3,sxtw]
; W0 = left byte
; W1 = middle byte
24:           lsl     w0, w0, #4
; W0 = W0<<4 = left_byte << 4
; merge parts:
28:           orr     w0, w0, w1, lsr #4
; W0 = W0 | W1>>4 = left_byte << 4 | middle_byte >> 4
; value in W0 is returned
2c:           ret

; idx is odd, go on:
30:           adrp    x2, page of array
; W0 = W1+1 = idx*1.5+1, i.e., offset of middle byte
; W1 = W1+2 = idx*1.5+2, i.e., offset of right byte
; load middle byte at X2+W0 = array+idx*1.5+1 with sign-extension ("sxtw")
40:           ldrb    w0, [x2,w0,sxtw]
; load right byte at X2+W1 = array+idx*1.5+2 with sign-extension ("sxtw")
44:           ldrb    w1, [x2,w1,sxtw]
; W0 = middle byte
; W1 = right byte
48:           ubfiz   w0, w0, #8, #4
; W0 = middle_byte<<8
; now merge parts:
4c:           orr     w0, w0, w1
; W0 = W0 | W1 = middle_byte<<8 | right_byte
; value in W0 is returned
50:           ret


ARM64 has new cool instruction UBFIZ (Unsigned bitfield insert in zero, with zeros to left and right), which can be used to place specified number of bits from one register to another. It's alias of another instruction, UBFM (Unsigned bitfield move, with zeros to left and right). UBFM is the instruction used internally in ARM64 instead of LSL/LSR (bit shifts).

#### Setter

<put_to_array>:
W0 = idx
W1 = val
54:           lsr     w3, w0, #1
; W3 = W0>>1 = idx>>1
58:           lsl     w2, w3, #2
; W2 = W3<<2 = (W0>>1)<<2 = (idx&(~1))<<1
5c:           sub     w2, w2, w3
; W2 = W2-W3 = (idx&(~1))<<1 - idx>>1 = idx*1.5
; jump if lowest bit in idx is 1:
60:           tbnz    w0, #0, 94 <put_to_array+0x40>

; idx is even, go on:
64:           adrp    x0, page of array
; W3 = W2+1 = idx*1.5+1, i.e., offset of middle byte
6c:           add     x0, x0, offset of array within page
; X0 = address of array
70:           lsr     w4, w1, #4
; W4 = W1>>4 = val>>4
74:           sxtw    x3, w3
; X3 = sign-extended 32-bit W3 (idx*1.5+1) to 64-bit
; sign-extension is needed here because the value will be used as offset within array,
; and negative offsets are possible in standard C/C++
78:           ubfiz   w1, w1, #4, #4
; W1 = W1<<4 = val<<4
; store W4 (val>>4) at X0+W2 = array + idx*1.5, i.e., address of left byte:
7c:           strb    w4, [x0,w2,sxtw]
; load middle byte at X0+X3 = array+idx*1.5+1
80:           ldrb    w2, [x0,x3]
; W2 = middle byte
84:           and     w2, w2, #0xf
; W2 = W2&0xF = middle_byte&0xF (high 4 bits in middle byte are dropped)
; merge parts of new version of middle byte:
88:           orr     w1, w2, w1
; W1 = W2|W1 = middle_byte&0xF | val<<4
; store W2 (new middle byte) at X0+X3 = array+idx*1.5+1
8c:           strb    w1, [x0,x3]
90:           ret

; idx is odd, go on:
; W4 = W2+1 = idx*1.5+1, i.e., offset of middle byte
98:           adrp    x0, page of array
9c:           add     x0, x0, offset of array within page
; X0 = address of array
; W2 = W2+2 = idx*1.5+2, i.e., offset of right byte
a4:           sxtw    x4, w4
; X4 = sign-extended 64-bit version of 32-bit W4
; load at X0+X4 = array+idx*1.5+1:
a8:           ldrb    w3, [x0,x4]
; W3 = middle byte
ac:           and     w3, w3, #0xfffffff0
; W3 = W3&0xFFFFFFF0 = middle_byte&0xFFFFFFF0, i.e., clear lowest 4 bits
b0:           orr     w3, w3, w1, lsr #8
; W3 = W3|W1>>8 = middle_byte&0xFFFFFFF0 | val>>8
; store new version of middle byte at X0+X4=array+idx*1.5+1:
b4:           strb    w3, [x0,x4]
; now store lowest 8 bits of val (in W1) at X0+W2=array+idx*1.5+2, i.e., place of right byte
; SXTW suffix means W2 will be sign-extended to 64-bit value before summing with X0
b8:           strb    w1, [x0,w2,sxtw]
bc:           ret


### Optimizing GCC 4.4.5 for MIPS

Needless to keep in mind that each instruction after jump/branch instruction is executed first. It's called "branch delay slot" in RISC CPUs lingo. To make things simpler, just swap instructions (mentally) in each instruction pair which is started with branch or jump instruction.

MIPS has no flags (apparently, to simplify data dependencies), so branch instructions (like BNE) does both comparison and branching.

There is also GP (Global Pointer) set up code in the function prologue, which can be ignored so far.

#### Getter

get_from_array:
; $4 = idx srl$2,$4,1 ;$2 = $4>>1 = idx>>1 lui$28,%hi(__gnu_local_gp)
sll     $3,$2,1
; $3 =$2<<1 = (idx>>1)<<1 = idx&(~1)
andi    $4,$4,0x1
; $4 =$4&1 = idx&1
addiu   $28,$28,%lo(__gnu_local_gp)

; jump if $4 (idx&1) is not zero (if idx is odd): bne$4,$0,$L6
; $2 =$3+$2 = idx>>1 + idx&(~1) = idx*1.5 addu$2,$3,$2 ; branch delay slot - this instruction executed before BNE

; idx is even, go on:
lw      $3,%got(array)($28)
; $3 = array nop addu$2,$3,$2
; $2 =$3+$2 = array + idx*1.5 ; load byte at$2+0 = array + idx*1.5 (left byte):
lbu     $3,0($2)
; $3 = left byte ; load byte at$2+1 = array + idx*1.5+1 (middle byte):
lbu     $2,1($2)
; $2 = middle byte sll$3,$3,4 ;$3 = $3<<4 = left_byte<<4 srl$2,$2,4 ;$2 = $2>>4 = middle_byte>>4 j$31
or      $2,$2,$3 ; branch delay slot - this instruction executed before J ;$2 = $2|$3 = middle_byte>>4 | left_byte<<4
; $2=returned result$L6:
; idx is odd, go on:
lw      $3,%got(array)($28)
; $3 = array nop addu$2,$3,$2
; $2 =$3+$2 = array + idx*1.5 ; load byte at$2+1 = array + idx*1.5 + 1 (middle byte)
lbu     $4,1($2)
; $4 = middle byte ; load byte at$2+1 = array + idx*1.5 + 2 (right byte)
lbu     $3,2($2)
; $3 = right byte andi$2,$4,0xf ;$2 = $4&0xF = middle_byte&0xF sll$2,$2,8 ;$2 = $2<<8 = middle_byte&0xF << 8 j$31
or      $2,$2,$3 ; branch delay slot - this instruction executed before J ;$2 = $2|$3 = middle_byte&0xF << 8 | right byte
; $2=returned result  #### Setter put_to_array: ;$4=idx
; $5=val srl$2,$4,1 ;$2 = $4>>1 = idx>>1 lui$28,%hi(__gnu_local_gp)
sll     $3,$2,1
; $3 =$2<<1 = (idx>>1)<<1 = idx&(~1)
andi    $4,$4,0x1
; $4 =$4&1 = idx&1
addiu   $28,$28,%lo(__gnu_local_gp)
; jump if $4=idx&1 is not zero (i.e., if idx is odd): bne$4,$0,$L11
addu    $2,$3,$2 ; branch delay slot, this instruction is executed before BNE ;$2 = $3+$2 = idx&(~1) + idx>>1 = idx*1.5

; idx is even, go on
lw      $3,%got(array)($28)
; $3 = array addiu$4,$2,1 ;$4 = $2+1 = idx*1.5+1, i.e., offset of middle byte in array srl$6,$5,4 ;$6 = $5>>4 = val>>4 addu$2,$3,$2
; $2 =$3+$2 = array + idx*1.5 (offset of left byte) ; store$6 (val>>4) as left byte:
sb      $6,0($2)
addu    $2,$3,$4 ;$2 = $3+$4 = array + idx*1.5 + 1 (absolute address of middle byte)
; load middle byte at $2+0 = array + idx*1.5 + 1 lbu$3,0($2) ;$3 = middle byte
andi    $5,$5,0xf
; $5 =$5&0xF = val&0xF
andi    $3,$3,0xf
; $3 =$3&0xF = middle_byte&0xF
sll     $5,$5,4
; $5 =$5<<4 = (val&0xF)<<4
or      $5,$3,$5 ;$5 = $3|$5 = middle_byte&0xF | (val&0xF)<<4 (new version of middle byte)
j       $31 ; store$5 (new middle byte) at $2 (array + idx*1.5 + 1) sb$5,0($2) ; branch delay slot, this instruction is executed before J$L11:
; idx is odd, go on
lw      $4,%got(array)($28)
; $4 = array addiu$3,$2,1 ;$3 = $2+1 = idx*1.5+1 (offset of middle byte) addu$3,$4,$3
; $3 =$4+$3 = array + idx*1.5+1 ; load middle byte at$3 (array + idx*1.5+1)
lbu     $6,0($3)
; $6 = middle byte srl$7,$5,8 ;$7 = $5>>8 = val>>8 andi$6,$6,0xf0 ;$6 = $6&0xF0 = middle_byte&0xF0 or$6,$6,$7
; $6 =$6|$7 = middle_byte&0xF0 | val>>8 addu$2,$4,$2
; store updated middle byte at $3 (array + idx*1.5+1) sb$6,0($3) j$31
; store low 8 bits of val at $2+2=idx*1.5+2 (place of right byte) sb$5,2(\$2) ; branch delay slot, this instruction is executed before J


### Difference from the real FAT12

The real FAT12 table is slightly different: https://en.wikipedia.org/wiki/Design_of_the_FAT_file_system#Cluster_map:

For even elements:

 +0 +1 +2
|23|.1|..|..


For odd elements:

 +0 +1 +2
|..|3.|12|..


Here are FAT12-related functions in Linux Kernel: fat12_ent_get(), fat12_ent_put().

Nevertheless, I did as I did because values are better visible and recognizable in byte-level GDB dump, for the sake of demonstration.

### Exercise

Perhaps, there could be a way to store data in a such way, so getter/setter functions would be faster. If we would place values in this way:

(Even elements)

 +0 +1 +2
|23|1.|..|..


(Odd elements)

 +0 +1 +2
|..|.1|23|..


This schema of storing data will allow to eliminate at least one shift operation. As an exercise, you may rework my C/C++ code in that way and see what compiler will produce.

### Summary

Bit shifts ("<<" and ">>" in C/C++, SHL/SHR/SAL/SAR in x86, LSL/LSR in ARM, SLL/SRL in MIPS) are used to place bit(s) to specific place.

AND operation ("&" in C/C++, AND in x86/ARM) is used to drop unneeded bits, also during isolation.

OR operation ("|" in C/C++, OR in x86/ARM) is used to merge or combine several values into one. One input value must have zero space at the place where another value has its information-caring bits.

ARM64 has new instructions UBFM, UFBIZ to move specific bits from one register to another.

### Conclusion

FAT12 is hardly used somewhere nowadays, but if you have space constraints and you need to store values limited to 12 bits, you may consider tightly-packed array in the manner it's done in FAT12 table.