It's Assembly!

Assembly is a strange thing these days. It used to be mandatory for programs to be fast enough, fit in memory, and otherwise work in the earlier days of programming. These days, it's one of those things that some people use, but won't admit it to their friends. To me this doesn't make sense.

Some people say that with the advent of RISC processing, compiler schedulers have gotten to the point where assembly is no longer needed. Most of these people have never seen the assembly output of a PowerPC compiler. For a matrix concatenation routine, I was able to cut the time spent on processing by roughly half. In C, the code looked like this.

//snip...

matrix3[0][0] = matrix1[0][0] * matrix2[0][0];
matrix3[0][0] += matrix1[0][1] * matrix2[1][0];
matrix3[0][0] += matrix1[0][2] * matrix2[2][0];
matrix3[0][0] += matrix1[0][3] * matrix2[3][0];

//snip...

This is the assembly code that the compiler produced at full optimization for this particular block.

lfd fp1,0(r3)
lfd fp0,0(r4)
fmul fp0,fp1,fp0
stfd fp0,0(r5)
lfd fp2,8(r3)
lfd fp1,32(r4)
lfd fp0,0(r5)
fmadd fp0,fp2,fp1,fp0
stfd fp0,0(r5)
lfd fp2,16(r3)
lfd fp1,64(r4)
lfd fp0,0(r5)
fmadd fp0,fp2,fp1,fp0
stfd fp0,0(r5)
lfd fp2,24(r3)
lfd fp1,96(r4)
lfd fp0,0(r5)
fmadd fp0,fp2,fp1,fp0
stfd fp0,0(r5)

As you can see, the destination matrix element (fp0) is needlessly stored and loaded right back again. Here is the code I wrote.

lfd fp2, 0(r3)
lfd fp1, 0(r4)
fmul fp0, fp2, fp1
lfd fp2, 8(r3)
lfd fp1, 32(r4)
fmadd fp0, fp2, fp1, fp0
lfd fp2, 16(r3)
lfd fp1, 64(r4)
fmadd fp0, fp2, fp1, fp0
lfd fp2, 24(r3)
lfd fp1, 96(r4)
fmadd fp0, fp2, fp1, fp0
stfd fp0, 0(r5)

As you can see, this takes up half the space. This doesn't entirely equate to half the time, since FP multiplies take about 5 cycles and load/stores typically take three. But profiling, it ran about 1/3 faster, which would be a major improvement with a large number of objects.

Some time in the future, I'll update this page, but this is it for now. Thanks for reading!