The note below has been copypasted to the Reverse Engineering for Beginners book

File Allocation Table (FAT) was a widely popular filesystem. Hard to believe, but it's still used on flash drives, perhaps, for the reason of simplicity and compatibility. The FAT table itself is array of elements, each of which points to the next cluster number of a file (FAT supports files scattered across the whole disk). That implies that maximum of each element is maximum number of clusters on the disk. In MS-DOS era, most hard disks has FAT16 filesystem, because cluster number could be packed into 16-bit value. Hard disks then become cheaper, and FAT32 emerged, where 32-bit value was allocated for cluster number. But there were also a times, when floppy diskettes were not that cheap and has no much space, so FAT12 were used on them, for the reason of packing all filesystem structures as tight as possible: https://en.wikipedia.org/wiki/File_Allocation_Table#FAT12.

So the FAT table in FAT12 filesystem is an array where each two subsequent 12-bit elements are stored into 3 bytes (triplet). Here is how 6 12-bit values (AAA, BBB, CCC, DDD, EEE and FFF) are packed into 9 bytes:

+0 +1 +2 +3 +4 +5 +6 +7 +8 |AA|AB|BB|CC|CD|DD|EE|EF|FF|...

Pushing values into array and pulling them back can be good example of bit twiddling operations (in both C/C++ and low-level machine code), so that's why I'll use FAT12 as an example here.

We can quickly observe that each byte triplet will store 2 12-bit values: the first one is located at the left side, second one is at right:

+0 +1 +2 |11|12|22|...

We will pack nibbles (4 bit chunks) in the following way (1 - highest nibble, 3 - lowest):

(Even)

+0 +1 +2 |12|3.|..|...

(Odd)

+0 +1 +2 |..|.1|23|...

So the algorithm can be as follows: if the element's index is even, put it at left side, if the index is odd, place it at right side. The middle byte: if the element's index is even, place part of it in high 4 bits, if it's odd, place its part in low 4 bits. But first, find the right triplet, this is easy: $\frac{index}{2}$. Finding the address of right byte in array of bytes is also easy: $\frac{index}{2} \cdot 3$ or $index \cdot \frac{3}{2}$ or just $index \cdot 1.5$.

Pulling values from array: if index is even, get leftmost and middle bytes and combine its parts. If index is odd, get middle and rightmost bytes and combine them. Do not forget to isolate unneeded bits in middle byte.

Pushing values is almost the same, but be careful not to overwrite some *other's* bits in the middle byte, correcting only *yours*.

#include <stdio.h> #include <stdint.h> #include <assert.h> #define ARRAY_SIZE (0x1000/2*3) uint8_t array[ARRAY_SIZE]; // big enough array of triplets unsigned int get_from_array (unsigned int idx) { // find right triplet in array: int triplet=(idx>>1); int array_idx=triplet*3; //assert (array_idx<ARRAY_SIZE); if (idx&1) { // this is odd element // compose value using middle and rightmost bytes: return ((array[array_idx+1]&0xF) << 8)|array[array_idx+2]; } else { // this is even element // compose value using rightmost and middle bytes: return array[array_idx]<<4 | ((array[array_idx+1]>>4)&0xF); }; }; void put_to_array (unsigned int idx, unsigned int val) { //assert (val<=0xFFF); // find right triplet in array: int triplet=(idx>>1); int array_idx=triplet*3; //assert (array_idx<ARRAY_SIZE); if (idx&1) { // this is odd element // put value into middle and rightmost bytes: // decompose value to be stored: uint8_t val_lowest_byte=val&0xFF; // isolate lowest 8 bits uint8_t val_highest_nibble=val>>8; // no need to apply &0xF, we already know the val<=0xFFF // clear low 4 bits in the middle byte: array[array_idx+1]=array[array_idx+1]&0xF0; array[array_idx+1]=array[array_idx+1]|val_highest_nibble; array[array_idx+2]=val_lowest_byte; } else { // this is even element // put value into leftmost and middle bytes: // decompose value to be stored: uint8_t val_highest_byte=val>>4; uint8_t val_lowest_nibble=val&0xF; array[array_idx]=val_highest_byte; // clear high 4 bits in the middle byte: array[array_idx+1]=array[array_idx+1]&0xF; array[array_idx+1]=array[array_idx+1]|val_lowest_nibble<<4; }; }; int main() { int i; // test for (i=0; i<0x1000; i++) { put_to_array(i, i); }; for (i=0; i<0x1000; i++) { assert(get_from_array(i)==i); }; //put_to_array(0x1000, 1); // will fail due to assert() // print triplets: for (int i=0;i<0x1000/2;i++) printf ("0x%02X%02X%02X\n",array[i*3],array[i*3+1],array[i*3+2]); };

During test, all 12-bit elements are filled with values in 0..0xFFF range. And here is a dump of all triplets, each line has 3 bytes:

0x000001 0x002003 0x004005 0x006007 0x008009 0x00A00B 0x00C00D 0x00E00F 0x010011 0x012013 0x014015 ... 0xFECFED 0xFEEFEF 0xFF0FF1 0xFF2FF3 0xFF4FF5 0xFF6FF7 0xFF8FF9 0xFFAFFB 0xFFCFFD 0xFFEFFF

Here is also GDB byte-level output of 300 bytes (or 100 triplets) started at 512/2*3, i.e., it's address where 512th element (0x200) is beginning. I added square brackets in my text editor to show triplets explicitely. Take a notice at the middle bytes, where the last element is ended and the next is started. In other words, each middle byte has lowest 4 bits of even element and highest 4 bits of odd element.

(gdb) x/300xb array+512/2*3

0x601380 <array+768>: [ 0x20 0x02 0x01 ][ 0x20 0x22 0x03 ][ 0x20 0x42

0x601388 <array+776>: 0x05 ][ 0x20 0x62 0x07 ][ 0x20 0x82 0x09 ][ 0x20

0x601390 <array+784>: 0xa2 0x0b ][ 0x20 0xc2 0x0d ][ 0x20 0xe2 0x0f ]

0x601398 <array+792>: [ 0x21 0x02 0x11 ][ 0x21 0x22 0x13 ][ 0x21 0x42

0x6013a0 <array+800>: 0x15 ][ 0x21 0x62 0x17 ][ 0x21 0x82 0x19 ][ 0x21

0x6013a8 <array+808>: 0xa2 0x1b ][ 0x21 0xc2 0x1d ][ 0x21 0xe2 0x1f ]

0x6013b0 <array+816>: [ 0x22 0x02 0x21 ][ 0x22 0x22 0x23 ][ 0x22 0x42

0x6013b8 <array+824>: 0x25 ][ 0x22 0x62 0x27 ][ 0x22 0x82 0x29 ][ 0x22

0x6013c0 <array+832>: 0xa2 0x2b ][ 0x22 0xc2 0x2d ][ 0x22 0xe2 0x2f ]

0x6013c8 <array+840>: [ 0x23 0x02 0x31 ][ 0x23 0x22 0x33 ][ 0x23 0x42

0x6013d0 <array+848>: 0x35 ][ 0x23 0x62 0x37 ][ 0x23 0x82 0x39 ][ 0x23

0x6013d8 <array+856>: 0xa2 0x3b ][ 0x23 0xc2 0x3d ][ 0x23 0xe2 0x3f ]

0x6013e0 <array+864>: [ 0x24 0x02 0x41 ][ 0x24 0x22 0x43 ][ 0x24 0x42

0x6013e8 <array+872>: 0x45 ][ 0x24 0x62 0x47 ][ 0x24 0x82 0x49 ][ 0x24

0x6013f0 <array+880>: 0xa2 0x4b ][ 0x24 0xc2 0x4d ][ 0x24 0xe2 0x4f ]

0x6013f8 <array+888>: [ 0x25 0x02 0x51 ][ 0x25 0x22 0x53 ][ 0x25 0x42

0x601400 <array+896>: 0x55 ][ 0x25 0x62 0x57 ][ 0x25 0x82 0x59 ][ 0x25

0x601408 <array+904>: 0xa2 0x5b ][ 0x25 0xc2 0x5d ][ 0x25 0xe2 0x5f ]

0x601410 <array+912>: [ 0x26 0x02 0x61 ][ 0x26 0x22 0x63 ][ 0x26 0x42

0x601418 <array+920>: 0x65 ][ 0x26 0x62 0x67 ][ 0x26 0x82 0x69 ][ 0x26

0x601420 <array+928>: 0xa2 0x6b ][ 0x26 0xc2 0x6d ][ 0x26 0xe2 0x6f ]

0x601428 <array+936>: [ 0x27 0x02 0x71 ][ 0x27 0x22 0x73 ][ 0x27 0x42

0x601430 <array+944>: 0x75 ][ 0x27 0x62 0x77 ][ 0x27 0x82 0x79 ][ 0x27

0x601438 <array+952>: 0xa2 0x7b ][ 0x27 0xc2 0x7d ][ 0x27 0xe2 0x7f ]

0x601440 <array+960>: [ 0x28 0x02 0x81 ][ 0x28 0x22 0x83 ][ 0x28 0x42

0x601448 <array+968>: 0x85 ][ 0x28 0x62 0x87 ][ 0x28 0x82 0x89 ][ 0x28

0x601450 <array+976>: 0xa2 0x8b ][ 0x28 0xc2 0x8d ][ 0x28 0xe2 0x8f ]

0x601458 <array+984>: [ 0x29 0x02 0x91 ][ 0x29 0x22 0x93 ][ 0x29 0x42

0x601460 <array+992>: 0x95 ][ 0x29 0x62 0x97 ][ 0x29 0x82 0x99 ][ 0x29

0x601468 <array+1000>: 0xa2 0x9b ][ 0x29 0xc2 0x9d ][ 0x29 0xe2 0x9f ]

0x601470 <array+1008>:[ 0x2a 0x02 0xa1 ][ 0x2a 0x22 0xa3 ][ 0x2a 0x42

0x601478 <array+1016>: 0xa5 ][ 0x2a 0x62 0xa7 ][ 0x2a 0x82 0xa9 ][ 0x2a

0x601480 <array+1024>: 0xa2 0xab ][ 0x2a 0xc2 0xad ][ 0x2a 0xe2 0xaf ]

0x601488 <array+1032>:[ 0x2b 0x02 0xb1 ][ 0x2b 0x22 0xb3 ][ 0x2b 0x42

0x601490 <array+1040>: 0xb5 ][ 0x2b 0x62 0xb7 ][ 0x2b 0x82 0xb9 ][ 0x2b

0x601498 <array+1048>: 0xa2 0xbb ][ 0x2b 0xc2 0xbd ][ 0x2b 0xe2 0xbf ]

0x6014a0 <array+1056>:[ 0x2c 0x02 0xc1 ][ 0x2c 0x22 0xc3 ][ 0x2c 0x42

0x6014a8 <array+1064>: 0xc5 ][ 0x2c 0x62 0xc7 ]

Let array be a global buffer to make simpler access to it.

Let's first start at the function getting values from the array, because it's simpler.

The method of finding triplet's number is just division input index by 2, but we can do it just by shifting right by 1 bit. This is a very common way of dividing/multiplication by numbers in form of $2^n$.

I can demonstrate how it works. Let's say, you need to divide 123 by 10. Just drop last digit (3, which is remainder of division) and 12 is left. Division by 2 is just dropping least significant bit. Dropping can be done by shifting right.

Now the functions must decide if the input index even (so 12-bit value is placed at the left) or odd (at the right). Simplest way to do so is to isolate lowest bit (x&1). If it's zero, our value is even, otherwise it's odd.

This fact can be illustrated easily, take a look at the lowest bit:

decimal binary even/odd 0 0000 even 1 0001 odd 2 0010 even 3 0011 odd 4 0100 even 5 0101 odd 6 0110 even 7 0111 odd 8 1000 even 9 1001 odd 10 1010 even 11 1011 odd 12 1100 even 13 1101 odd 14 1110 even 15 1111 odd ...

Zero is also even number, it's so in two’s complement system, where it's located between two odd numbers (-1 and 1).

For math geeks, I could also say that even or odd sign is also remainder of division by 2. Division of a number by 2 is merely dropping the last bit, which is remainder of division. Well, we do not need to do shifts here, just isolate lowest bit.

If the element is odd, we take middle and right bytes ("array[array_idx+1]" and "array[array_idx+2]"). Lowest 4 bits of the middle byte is isolated. Right byte is taken as a whole. Now we need to combine these parts into 12-bit value. To do so, shift 4 bits from the middle byte by 8 bits left, so these 4 bits will be allocated right behind highest 8th bit of byte. Then, using OR operation, we just add these parts.

If the element is even, high 8 bits of 12-bit value is located in left byte, while lowest 4 bits are located in the high 4 bits of middle byte. We isolate highest 4 bits in the middle byte by shifting it 4 bits right and then applying AND operation, just to be sure that nothing is left there. We also take left byte and shift its value 4 bits left, because it has bits from 11th to 4th (inclusive, starting at 0), Using OR operation, we combine these two parts.

Setter calculates triplet's address in the same way.
It also operates on left/right bytes in the same way.
But it's not correct just to write to the middle byte, because write operation will destroy the information related to the other element.
So the common way is to load byte, drop bits where you'll write, write it there, but leave other part intact.
Using AND operation (& in C/C++), we drop everything except *our* part.
Using OR operation (| in C/C++), we then update the middle byte.

Let's see what optimizing GCC 4.8.2 for Linux x64 will do. I added comments. Sometimes readers are confused because instructions order is not logical. It's OK, because optimizing compilers take CPU out-of-order-execution mechanisms into considerations, and sometimes, swapped instructions performing better.

get_from_array: ; EDI=idx ; make a copy: mov eax, edi ; calculate idx>>1: shr eax ; determine if the element even or odd by isolation of the lowest bit: and edi, 1 ; calculate (idx>>1)*3. ; multiplication is slow in geenral, and can be replaced by one shifting and addition operation ; LEA is capable to do both: lea edx, [rax+rax*2] ; EDX now is (idx>>1)*3 ; point EAX to the middle byte: lea eax, [rdx+1] ; sign-extend EAX value to RDX: cdqe ; get middle byte into EAX: movzx eax, BYTE PTR array[rax] ; finally check the value of the lowest bit in index. ; jump if index is odd (NE is the same as NZ (Not Zero)): jne .L9 ; this is even element, go on ; sign-extend EDX to RDX: movsx rdx, edx ; shift middle byte 4 bits right: shr al, 4 ; AL now has 4 highest bits of middle byte ; load left byte into EDX: movzx edx, BYTE PTR array[rdx] ; sign-extend AL (where high 4 bits of middle byte are now) movzx eax, al ; EAX has 4 high bits bits of middle byte ; EDX now has left byte ; shift left byte 4 bits left: sal edx, 4 ; 4 lowest bits in EDX after shifting is cleared to zero ; finally merge values: or eax, edx ret .L9: ; this is odd element, go on ; calculate address of right byte: add edx, 2 ; isolate lowest 4 bits in middle byte: and eax, 15 ; 15=0xF ; sign-extend EDX (where address of right byte is) to RDX: movsx rdx, edx ; shift 4 bits we got from middle bytes 8 bits left: sal eax, 8 ; load right byte: movzx edx, BYTE PTR array[rdx] ; merge values: or eax, edx ret

put_to_array: ; EDI=idx ; ESI=val ; copy idx to EAX: mov eax, edi ; calculate idx>>1 and put it to EAX: shr eax ; isolate lowest bit in idx: and edi, 1 ; calculate (idx>>2)*3 and put it to EAX: lea eax, [rax+rax*2] ; jump if index is odd (NE is the same as NZ (Not Zero)): jne .L5 ; this is even element, go on ; sign-extend address of triplet in EAX to RDX: movsx rdx, eax ; copy val value to ECX: mov ecx, esi ; calculate address of middle byte: add eax, 1 ; sign-extend address in EAX to RDX: cdqe ; prepare left byte in ECX by shifting it: shr ecx, 4 ; prepare 4 bits for middle byte: sal esi, 4 ; put left byte: mov BYTE PTR array[rdx], cl ; load middle byte (its address still in RAX): movzx edx, BYTE PTR array[rax] ; drop high 4 bits: and edx, 15 ; 15=0xF ; merge our data and low 4 bits which were there before: or esi, edx ; put middle byte back: mov BYTE PTR array[rax], sil ret .L5: ; this is odd element, go on ; calculate address of middle byte and put it to ECX: lea ecx, [rax+1] ; copy val value from ESI to EDI: mov edi, esi ; calculate address of right byte: add eax, 2 ; get high 4 bits of input value by shifting it 8 bits right: shr edi, 8 ; sign-extend address in EAX into RAX: cdqe ; sign-extend address of middle byte in ECX to RCX: movsx rcx, ecx ; load middle byte into EDX: movzx edx, BYTE PTR array[rcx] ; drop low 4 bits in middle byte: and edx, -16 ; -16=0xF0 ; merge data from input val with what was in middle byte before: or edx, edi ; store middle byte: mov BYTE PTR array[rcx], dl ; store right byte. val is still in ESI and SIL is a part of ESI register which has lowest 8 bits: mov BYTE PTR array[rax], sil ret

All addresses in Linux x64 are 64-bit ones, so during pointer arithmetics, all values should also be 64-bit.
The code calculating offsets inside of array operates on 32-bit values (input **idx** argument has type of **int**, which has width of 32 bits),
and so these values should be converted to 64-bit addresses before actual memory load/store.
So there are a lot of sign-extending instructions (like **CDQE**, **MOVSX**) used for conversion.
Why to extend sign? By C/C++ standards, pointer arithmetics can operate on negative values
(it's possible to access array using negative index like **array[-123]**).
Since GCC compiler cannot be sure if all indices are always positive, it adds sign-extending instructions.

The following code has final OR operation in the function epilogue. Indeed, it executes at the end of both branches, so it's possible to save some space.

get_from_array PROC ; R0 = idx PUSH {r4,r5,lr} LSRS r1,r0,#1 ; R1 = R0>>1 = idx>>1 ; R1 is now number of triplet LSLS r2,r1,#1 ; R2 = R1<<1 = (R0>>1)<<1 = R0&(~1) = idx&(~1) ; the operation (x>>1)<<1 looks senseless, but it's intended to clear the lowest bit in x (or idx) LSLS r5,r0,#31 ; R5 = R0<<31 = idx<<31 ; thus, R5 will contain 0x80000000 in case of even idx or zero if odd ADDS r4,r1,r2 ; R4 = R1+R2 = idx>>1 + idx&(~1) = offset of triplet begin (or offset of left byte) ; the expression looks tricky, but it's equal to multiplication by 1.5 LSRS r0,r0,#1 ; R0 = R0>>1 = idx>>1 ; load pointer to array: LDR r3,|array| ; R3 = offset of array table LSLS r1,r0,#1 ; R1 = R0<<1 = (idx>>1)<<1 = idx&(~1) ADDS r0,r0,r1 ; R0 = idx>>1 + idx&(~1) = idx*1.5 = offset of triple begin ADDS r1,r3,r0 ; R1 = R3+R0 = offset of array + idx*1.5 ; in other words, R1 now contains absolute address of triplet ; load middle byte (at address R1+1): LDRB r2,[r1,#1] ; R2 = middle byte ; finally check if the idx even or odd: CMP r5,#0 ; jump if even: BEQ |L0.92| ; idx is odd, go on: LSLS r0,r2,#28 ; R0 = R2<<28 = middle_byte<<28 ; load right byte at R1+2: LDRB r1,[r1,#2] ; R1 = right byte LSRS r0,r0,#20 ; R0 = R0>>20 = (R2<<28)>>20 ; this is the same as R2<<8, but Keil compiler generates more complex code in order to drop all bits behind these 4 B |L0.98| |L0.92| ; idx is even, go on: ; load left byte. R3=array now and R4=address of it LDRB r0,[r3,r4] ; R0 = left byte LSLS r0,r0,#4 ; R0 = left_byte<<4 ; shift middle_byte in R2 4 bits right: LSRS r1,r2,#4 ; R1=middle_byte>>4 |L0.98| ; function epilogue ; current R0 value is shifted left byte or part of middle byte ; R1 is shifted part of middle byte or right byte ; now merge values and leave merged result in R0: ORRS r0,r0,r1 ; R0 = R0|R1 POP {r4,r5,pc} ENDP

There are at least of redundancy: *idx*1.5* is calculated twice.
As an exercise, reader may try to rework assembly function to make it shorter. Do not forget about testing!

Another thing to mention is that it's hard to generate big constants in 16-bit Thumb instructions, so Keil compiler often generates
tricky code using shifting instructions to achieve the same effect.
For example, it's tricky to generate **AND Rdest, Rsrc, 1** or **TST Rsrc, 1** code in Thumb mode,
so Keil generates the code which shifts input **idx** by 31 bits left and then check, if the resulting value zero or not.

The first half of setter code is very similar to getter, address of triplet is calculated first, then the jump is occurred in order to dispatch to the right handler's code.

put_to_array PROC PUSH {r4,r5,lr} ; R0 = idx ; R1 = val LSRS r2,r0,#1 ; R2 = R0>>1 = idx>>1 LSLS r3,r2,#1 ; R3 = R2<<1 = (idx>>1)<<1 = idx&(~1) LSLS r4,r0,#31 ; R4 = R0<<31 = idx<<31 ADDS r3,r2,r3 ; R3 = R2+R3 = idx>>1 + idx&(~1) = idx*1.5 LSRS r0,r0,#1 ; R0 = R0>>1 = idx>>1 LDR r2,|array| ; R2 = address of array LSLS r5,r0,#1 ; R5 = R0<<1 = (idx>>1)<<1 = idx&(~1) ADDS r0,r0,r5 ; R0 = R0+R5 = idx>>1 + idx&(~1) = idx*1.5 ADDS r0,r2,r0 ; R0 = R2+R0 = array + idx*1.5, in other words, this is address of triplet ; finally test shifted lowest bit in idx: CMP r4,#0 ; jump if idx is even: BEQ |L0.40| ; idx is odd, go on: ; load middle byte at R0+1: LDRB r3,[r0,#1] ; R3 = middle_byte LSRS r2,r1,#8 ; R2 = R1>>8 = val>>8 LSRS r3,r3,#4 ; R3 = R3>>4 = middle_byte>>4 LSLS r3,r3,#4 ; R3 = R3<<4 = (middle_byte>>4)<<4 ; this two shift operations are used to drop low 4 bits in middle_byte ; merge high 4 bits in middle byte (in R3) with val>>8 (in R2): ORRS r3,r3,r2 ; R3 = updated middle byte ; store it at R0+1: STRB r3,[r0,#1] ; store low 8 bits of val (val&0xFF) at R0+2: STRB r1,[r0,#2] POP {r4,r5,pc} |L0.40| ; idx is even, go on: LSRS r4,r1,#4 ; R4 = R1>>4 = val>>4 ; store val>>4 at R2+R3 (address of left byte or beginning of triplet): STRB r4,[r2,r3] ; load middle byte at R0+1: LDRB r3,[r0,#1] ; R3 = middle byte LSLS r2,r1,#4 ; R2 = R1<<4 = val<<4 LSLS r1,r3,#28 ; R1 = R3<<28 = middle_byte<<28 LSRS r1,r1,#28 ; R1 = R1>>28 = (middle_byte<<28)>>28 ; these two shifting operation are used to drop all bits in register except lowest 4 ; merge lowest 4 bits (in R1) and val<<4 (in R2): ORRS r1,r1,r2 ; store it at R0+1: STRB r1,[r0,#1] POP {r4,r5,pc} ENDP

Getter function for ARM mode has no conditional branches at all!
Thanks to the suffixes (like **-EQ**, **-NE**), which can be supplied to many instructions in ARM mode,
so the instruction will be only executed if the corresponding flag(s) are set.

Many arithmetical instructions in ARM mode can have shifting suffix like **LSL #1** (it means, the last operand is shifted left by 1 bit).

get_from_array PROC ; R0 = idx LSR r1,r0,#1 ; R1 = R0>>1 = idx>>1 ; check lowest bit in idx and set flags: TST r0,#1 ADD r2,r1,r1,LSL #1 ; R2 = R1+R1<<1 = R1+R1*2 = R1*3 ; thatnks to shifting suffix, a single instruction in ARM mode can multiplicate by 3 LDR r1,|array| ; R1 = address of array LSR r0,r0,#1 ; R0 = R0>>1 = idx>>1 ADD r0,r0,r0,LSL #1 ; R0 = R0+R0<<1 = R0+R0*2 = R0*3 = (idx>>1)*3 = idx*1.5 ADD r0,r0,r1 ; R0 = R0+R1 = array + idx*1.5, this is absolute address of triplet ; load middle byte at R0+1: LDRB r3,[r0,#1] ; R3 = middle byte ; the following 3 instructions executed if index is odd, otherwise all of them are skipped: ; load right byte at R0+2: LDRBNE r0,[r0,#2] ; R0 = right byte ANDNE r1,r3,#0xf ; R1 = R3&0xF = middle_byte&0xF ORRNE r0,r0,r1,LSL #8 ; R0 = R0|(R1<<8) = right_byte | (middle_byte&0xF)<<8 ; this is the result returned ; the following 3 instructions executed if index is even, otherwise all of them are skipped: ; load at R1+R2 = array + (idx>>1)*3 = array + idx*1.5 LDRBEQ r0,[r1,r2] ; R0 = left byte LSLEQ r0,r0,#4 ; R0 = R0<<4 = left_byte << 4 ORREQ r0,r0,r3,LSR #4 ; R0 = R0 | R3>>4 = left_byte << 4 | middle_byte >> 4 ; this is the result returned BX lr ENDP

put_to_array PROC ; R0 = idx ; R1 = val LSR r2,r0,#1 ; R2 = R0>>1 = idx>>1 ; check the lowest bit of idx and set flags: TST r0,#1 LDR r12,|array| ; R12 = address of array LSR r0,r0,#1 ; R0 = R0>>1 = idx>>1 ADD r0,r0,r0,LSL #1 ; R0 = R0+R0<<1 = R0+R0*2 = R0*3 = (idx>>1)*3 = idx/2*3 = idx*1.5 ADD r3,r2,r2,LSL #1 ; R3 = R2+R2<<1 = R2+R2*2 = R2*3 = (idx>>1)*3 = idx/2*3 = idx*1.5 ADD r0,r0,r12 ; R0 = R0+R12 = idx*1.5 + array ; jump if idx is even: BEQ |L0.56| ; idx is odd, go on: ; load middle byte at R0+1: LDRB r3,[r0,#1] ; R3 = middle byte AND r3,r3,#0xf0 ; R3 = R3&0xF0 = middle_byte&0xF0 ORR r2,r3,r1,LSR #8 ; R2 = R3 | R1>>8 = middle_byte&0xF0 | val>>8 ; store middle_byte&0xF0 | val>>8 at R0+1 (at the place of middle byte): STRB r2,[r0,#1] ; store low 8 bits of val (or val&0xFF) at R0+2 (at the place of right byte): STRB r1,[r0,#2] BX lr |L0.56| ; idx is even, go on: LSR r2,r1,#4 ; R2 = R1>>4 = val>>4 ; store val>>4 at R12+R3 or array + idx*1.5 (place of left byte): STRB r2,[r12,r3] ; load byte at R0+1 (middle byte): LDRB r2,[r0,#1] ; R2 = middle_byte ; drop high 4 bits of middle byte: AND r2,r2,#0xf ; R2 = R2&0xF = middle_byte&0xF ; update middle byte: ORR r1,r2,r1,LSL #4 ; R1 = R2 | R1<<4 = middle_byte&0xF | val<<4 ; store updated middle byte at R0+1: STRB r1,[r0,#1] BX lr ENDP

Value of *idx*1.5* is calculated twice, this is redundancy Keil compiler produced can be eliminated.
You can rework assembly function as well to make it shorter. Do not forget about tests!

Thumb mode in ARM CPUs was introduced to make instructions shorter (16-bits) instead of 32-bit instructions in ARM mode. But as we can see, it's hard to say, if it was worth it: code in ARM mode is always shorter (however, instructions are longer).

<get_from_array>: ; W0 = idx 0: lsr w2, w0, #1 ; W2 = W0>>1 = idx>>1 4: lsl w1, w2, #2 ; W1 = W2<<2 = (W0>>1)<<2 = (idx&(~1))<<1 8: sub w1, w1, w2 ; W1 = W1-W2 = (idx&(~1))<<1 - idx>>1 = idx*1.5 ; now test lowest bit of idx and jump if it is present. ; (ARM64 has single instruction for these operations: TBNZ (Test and Branch Not Zero)). c: tbnz w0, #0, 30 <get_from_array+0x30> ; idx is even, go on: 10: adrp x2, page of array 14: add w3, w1, #0x1 ; W3 = W1+1 = idx*1.5 + 1, i.e., offset of middle byte 18: add x2, x2, offset of array within page ; load left byte at X2+W1 = array + idx*1.5 with sign-extension ("sxtw") 1c: ldrb w0, [x2,w1,sxtw] ; load middle byte at X2+W3 = array + idx*1.5 + 1 with sign-extension ("sxtw") 20: ldrb w1, [x2,w3,sxtw] ; W0 = left byte ; W1 = middle byte 24: lsl w0, w0, #4 ; W0 = W0<<4 = left_byte << 4 ; merge parts: 28: orr w0, w0, w1, lsr #4 ; W0 = W0 | W1>>4 = left_byte << 4 | middle_byte >> 4 ; value in W0 is returned 2c: ret ; idx is odd, go on: 30: adrp x2, page of array 34: add w0, w1, #0x1 ; W0 = W1+1 = idx*1.5+1, i.e., offset of middle byte 38: add x2, x2, address of array within page 3c: add w1, w1, #0x2 ; W1 = W1+2 = idx*1.5+2, i.e., offset of right byte ; load middle byte at X2+W0 = array+idx*1.5+1 with sign-extension ("sxtw") 40: ldrb w0, [x2,w0,sxtw] ; load right byte at X2+W1 = array+idx*1.5+2 with sign-extension ("sxtw") 44: ldrb w1, [x2,w1,sxtw] ; W0 = middle byte ; W1 = right byte 48: ubfiz w0, w0, #8, #4 ; W0 = middle_byte<<8 ; now merge parts: 4c: orr w0, w0, w1 ; W0 = W0 | W1 = middle_byte<<8 | right_byte ; value in W0 is returned 50: ret

ARM64 has new cool instruction **UBFIZ** (*Unsigned bitfield insert in zero, with zeros to left and right*),
which can be used to place specified number of bits from one register to another.
It's alias of another instruction, **UBFM** (*Unsigned bitfield move, with zeros to left and right*).
**UBFM** is the instruction used internally in ARM64 instead of **LSL/LSR** (bit shifts).

<put_to_array>: W0 = idx W1 = val 54: lsr w3, w0, #1 ; W3 = W0>>1 = idx>>1 58: lsl w2, w3, #2 ; W2 = W3<<2 = (W0>>1)<<2 = (idx&(~1))<<1 5c: sub w2, w2, w3 ; W2 = W2-W3 = (idx&(~1))<<1 - idx>>1 = idx*1.5 ; jump if lowest bit in idx is 1: 60: tbnz w0, #0, 94 <put_to_array+0x40> ; idx is even, go on: 64: adrp x0, page of array 68: add w3, w2, #0x1 ; W3 = W2+1 = idx*1.5+1, i.e., offset of middle byte 6c: add x0, x0, offset of array within page ; X0 = address of array 70: lsr w4, w1, #4 ; W4 = W1>>4 = val>>4 74: sxtw x3, w3 ; X3 = sign-extended 32-bit W3 (idx*1.5+1) to 64-bit ; sign-extension is needed here because the value will be used as offset within array, ; and negative offsets are possible in standard C/C++ 78: ubfiz w1, w1, #4, #4 ; W1 = W1<<4 = val<<4 ; store W4 (val>>4) at X0+W2 = array + idx*1.5, i.e., address of left byte: 7c: strb w4, [x0,w2,sxtw] ; load middle byte at X0+X3 = array+idx*1.5+1 80: ldrb w2, [x0,x3] ; W2 = middle byte 84: and w2, w2, #0xf ; W2 = W2&0xF = middle_byte&0xF (high 4 bits in middle byte are dropped) ; merge parts of new version of middle byte: 88: orr w1, w2, w1 ; W1 = W2|W1 = middle_byte&0xF | val<<4 ; store W2 (new middle byte) at X0+X3 = array+idx*1.5+1 8c: strb w1, [x0,x3] 90: ret ; idx is odd, go on: 94: add w4, w2, #0x1 ; W4 = W2+1 = idx*1.5+1, i.e., offset of middle byte 98: adrp x0, page of array 9c: add x0, x0, offset of array within page ; X0 = address of array a0: add w2, w2, #0x2 ; W2 = W2+2 = idx*1.5+2, i.e., offset of right byte a4: sxtw x4, w4 ; X4 = sign-extended 64-bit version of 32-bit W4 ; load at X0+X4 = array+idx*1.5+1: a8: ldrb w3, [x0,x4] ; W3 = middle byte ac: and w3, w3, #0xfffffff0 ; W3 = W3&0xFFFFFFF0 = middle_byte&0xFFFFFFF0, i.e., clear lowest 4 bits b0: orr w3, w3, w1, lsr #8 ; W3 = W3|W1>>8 = middle_byte&0xFFFFFFF0 | val>>8 ; store new version of middle byte at X0+X4=array+idx*1.5+1: b4: strb w3, [x0,x4] ; now store lowest 8 bits of val (in W1) at X0+W2=array+idx*1.5+2, i.e., place of right byte ; SXTW suffix means W2 will be sign-extended to 64-bit value before summing with X0 b8: strb w1, [x0,w2,sxtw] bc: ret

Needless to keep in mind that each instruction after jump/branch instruction is executed first. It's called "branch delay slot" in RISC CPUs lingo. To make things simpler, just swap instructions (mentally) in each instruction pair which is started with branch or jump instruction.

MIPS has no flags (apparently, to simplify data dependencies), so branch instructions (like **BNE**) does both comparison and branching.

There is also GP (Global Pointer) set up code in the function prologue, which can be ignored so far.

get_from_array: ; $4 = idx srl $2,$4,1 ; $2 = $4>>1 = idx>>1 lui $28,%hi(__gnu_local_gp) sll $3,$2,1 ; $3 = $2<<1 = (idx>>1)<<1 = idx&(~1) andi $4,$4,0x1 ; $4 = $4&1 = idx&1 addiu $28,$28,%lo(__gnu_local_gp) ; jump if $4 (idx&1) is not zero (if idx is odd): bne $4,$0,$L6 ; $2 = $3+$2 = idx>>1 + idx&(~1) = idx*1.5 addu $2,$3,$2 ; branch delay slot - this instruction executed before BNE ; idx is even, go on: lw $3,%got(array)($28) ; $3 = array nop addu $2,$3,$2 ; $2 = $3+$2 = array + idx*1.5 ; load byte at $2+0 = array + idx*1.5 (left byte): lbu $3,0($2) ; $3 = left byte ; load byte at $2+1 = array + idx*1.5+1 (middle byte): lbu $2,1($2) ; $2 = middle byte sll $3,$3,4 ; $3 = $3<<4 = left_byte<<4 srl $2,$2,4 ; $2 = $2>>4 = middle_byte>>4 j $31 or $2,$2,$3 ; branch delay slot - this instruction executed before J ; $2 = $2|$3 = middle_byte>>4 | left_byte<<4 ; $2=returned result $L6: ; idx is odd, go on: lw $3,%got(array)($28) ; $3 = array nop addu $2,$3,$2 ; $2 = $3+$2 = array + idx*1.5 ; load byte at $2+1 = array + idx*1.5 + 1 (middle byte) lbu $4,1($2) ; $4 = middle byte ; load byte at $2+1 = array + idx*1.5 + 2 (right byte) lbu $3,2($2) ; $3 = right byte andi $2,$4,0xf ; $2 = $4&0xF = middle_byte&0xF sll $2,$2,8 ; $2 = $2<<8 = middle_byte&0xF << 8 j $31 or $2,$2,$3 ; branch delay slot - this instruction executed before J ; $2 = $2|$3 = middle_byte&0xF << 8 | right byte ; $2=returned result

put_to_array: ; $4=idx ; $5=val srl $2,$4,1 ; $2 = $4>>1 = idx>>1 lui $28,%hi(__gnu_local_gp) sll $3,$2,1 ; $3 = $2<<1 = (idx>>1)<<1 = idx&(~1) andi $4,$4,0x1 ; $4 = $4&1 = idx&1 addiu $28,$28,%lo(__gnu_local_gp) ; jump if $4=idx&1 is not zero (i.e., if idx is odd): bne $4,$0,$L11 addu $2,$3,$2 ; branch delay slot, this instruction is executed before BNE ; $2 = $3+$2 = idx&(~1) + idx>>1 = idx*1.5 ; idx is even, go on lw $3,%got(array)($28) ; $3 = array addiu $4,$2,1 ; $4 = $2+1 = idx*1.5+1, i.e., offset of middle byte in array srl $6,$5,4 ; $6 = $5>>4 = val>>4 addu $2,$3,$2 ; $2 = $3+$2 = array + idx*1.5 (offset of left byte) ; store $6 (val>>4) as left byte: sb $6,0($2) addu $2,$3,$4 ; $2 = $3+$4 = array + idx*1.5 + 1 (absolute address of middle byte) ; load middle byte at $2+0 = array + idx*1.5 + 1 lbu $3,0($2) ; $3 = middle byte andi $5,$5,0xf ; $5 = $5&0xF = val&0xF andi $3,$3,0xf ; $3 = $3&0xF = middle_byte&0xF sll $5,$5,4 ; $5 = $5<<4 = (val&0xF)<<4 or $5,$3,$5 ; $5 = $3|$5 = middle_byte&0xF | (val&0xF)<<4 (new version of middle byte) j $31 ; store $5 (new middle byte) at $2 (array + idx*1.5 + 1) sb $5,0($2) ; branch delay slot, this instruction is executed before J $L11: ; idx is odd, go on lw $4,%got(array)($28) ; $4 = array addiu $3,$2,1 ; $3 = $2+1 = idx*1.5+1 (offset of middle byte) addu $3,$4,$3 ; $3 = $4+$3 = array + idx*1.5+1 ; load middle byte at $3 (array + idx*1.5+1) lbu $6,0($3) ; $6 = middle byte srl $7,$5,8 ; $7 = $5>>8 = val>>8 andi $6,$6,0xf0 ; $6 = $6&0xF0 = middle_byte&0xF0 or $6,$6,$7 ; $6 = $6|$7 = middle_byte&0xF0 | val>>8 addu $2,$4,$2 ; store updated middle byte at $3 (array + idx*1.5+1) sb $6,0($3) j $31 ; store low 8 bits of val at $2+2=idx*1.5+2 (place of right byte) sb $5,2($2) ; branch delay slot, this instruction is executed before J

The real FAT12 table is slightly different: https://en.wikipedia.org/wiki/Design_of_the_FAT_file_system#Cluster_map:

For even elements:

+0 +1 +2 |23|.1|..|..

For odd elements:

+0 +1 +2 |..|3.|12|..

Here are FAT12-related functions in Linux Kernel: fat12_ent_get(), fat12_ent_put().

Nevertheless, I did as I did because values are better visible and recognizable in byte-level GDB dump, for the sake of demonstration.

Perhaps, there could be a way to store data in a such way, so getter/setter functions would be faster. If we would place values in this way:

(Even elements)

+0 +1 +2 |23|1.|..|..

(Odd elements)

+0 +1 +2 |..|.1|23|..

This schema of storing data will allow to eliminate at least one shift operation. As an exercise, you may rework my C/C++ code in that way and see what compiler will produce.

Bit shifts ("<<" and ">>" in C/C++, SHL/SHR/SAL/SAR in x86, LSL/LSR in ARM, SLL/SRL in MIPS) are used to place bit(s) to specific place.

AND operation ("&" in C/C++, AND in x86/ARM) is used to drop unneeded bits, also during isolation.

OR operation ("|" in C/C++, OR in x86/ARM) is used to merge or combine several values into one. One input value must have zero space at the place where another value has its information-caring bits.

ARM64 has new instructions UBFM, UFBIZ to move specific bits from one register to another.

FAT12 is hardly used somewhere nowadays, but if you have space constraints and you need to store values limited to 12 bits, you may consider tightly-packed array in the manner it's done in FAT12 table.

More examples like this (for x86/x64/ARM/ARM64 and MIPS) can be found in my book: //beginners.re/.

→ [list of blog posts, my twitter/facebook]