You probably already heard that if you want to squeeze every last cycle of performance from your programs you need to be able to tweak them in assembly, right?

Well, unless you know what you are doing jumping into assembly won’t help, and in some cases it might even hurt. As I started playing around with the ARM architecture I am experiencing this.

For instance, most ARM processors don’t include a division instruction. If you need it you need to implement it yourself. But wait a minute, division is straight forward ain’t it? Well, not quite. If you implement a naive method it will be VERY slow. For instance, below there’s a program that divide every number from 1 up to 524,288 (2^19, since ARM constants need to have specific values) by 10, accumulating the results in a result variable.

```
.extern printf
.global main
main:
push {ip, lr}
mov r4, #1
mov r5, #0
loop:
bl division
add r5, r5, r0
add r4, r4, #1
cmp r4, #524288
blt loop
ldr r0, =printf1
mov r1, r5
bl printf
mov r0, #0
pop {ip, pc}
division:
mov r0, #0
mov r1, r4
start:
cmp r1, #10
blt finish
subs r1, r1, #10
add r0, r0, #1
b start
finish:
bx lr
printf1: .asciz "result -> %dn"
```

As you can see I am dividing by repeated subtraction. How bad is that in terms of performance? The program above took 55 seconds to run in a ARM7 processor.

After that I wrote the same program in C, compiled for the ARM plataform and run it in the same ARM7 processor. The execution time was 17 milliseconds.

```
#include <stdio.h>
int main(){
int i;
int result = 0;
int x;
for (i=1;i<524288;i++){
x = i / 10;
result += x;
}
printf("result = %dn",result);
return 0;
}
```

The first thing I tried to optimize was the division method. Instead of doing repeat subtraction I decided to use bit shifts. Instead of diving by 10, for instance, we could multiply the number by 205, and then divide by 2048 (which can be done with shifting 11 bits to the right).

I implemented it, and the execution time fell from 55 seconds to 11 milliseconds, which is even faster than the version created by the GCC compiler. The problem is that the result was not correct, since with this method you lose precision.

For instance, 9999 / 10 should be 999. With this method it will be 1000 though (due to the approximation). Another problem that might happen is overflow when you multiply by 205 (though in our case this was not happening).

So how to fix the precision problem?

The easiest way is to use larger numbers at the multiplication and shift phases. You need to know what range of numbers you are working with, though, to avoid the overflow problem.

One trick you can use is to store the result of the multiplication as a 64 bit integer, and then shift the result by 32 bits to the right. In other words, you just need to consider the most significant 32 bits of the 64-bit number as your 32-bit result. Shifting by 32 bits to the right is the same as diving by 2^32 (4,294,967,296), so in order to obtain a division by 10 we need to multiply the number by (2^32)/10 first (429,496,730).

Notice that if you try to move that constant in ARM Assembly the assembler will complain, as it’s not possible to obtain that number with 8 bits and shifts/rotations. To work around you just need to move the lower 16 bits of the number first and then the upper 16 bits.

The result is the code below, which runs in 12 milliseconds, so still faster than GCC’s version.

```
.extern printf
.global main
main:
push {ip, lr}
mov r4, #1
mov r5, #0
loop:
bl division
add r5, r5, r0
add r4, r4, #1
cmp r4, #524288
blt loop
ldr r0, =printf1
mov r1, r5
bl printf
mov r0, #0
pop {ip, pc}
division:
mov r1, r4
movw r2, #:lower16:429496730
movt r2, #:upper16:429496730
smull r3, r0, r1, r2
bx lr
printf1: .asciz "result -> %dn"
```

Keep in mind that the GCC version is slightly more precise (it uses 2^34 in the division), but given our input it makes no difference.