Lessons in Optimization on Modern X86-64 Silicon

John R. Graham

I was just doing a brush-up on some not-too-recently-used skills, so I set out to write a small function in 64-bit X86 assembly language. To start simply, I chose "count the number of 1 bits in a long integer"). The calling code is written in C++ and in fact I first prototyped the bit counting code in C++ as well.

Here's my trivial callng program (it also running enough iterations of the function to allow for benchmarking):

NeddySeagoon · Posted: Sun Mar 02, 2025 11:10 pm Post subject:

John R. Graham,

I'll have a wee nibble ...

It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy. :)

Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the

sublogic · Posted: Mon Mar 03, 2025 2:11 am Post subject:

(Independently of Neddy) I tried this:

John R. Graham · Posted: Mon Mar 03, 2025 2:24 am Post subject:

John R. Graham · Posted: Mon Mar 03, 2025 2:45 am Post subject:

sublogic · Posted: Mon Mar 03, 2025 2:57 am Post subject:

John R. Graham · Posted: Mon Mar 03, 2025 3:04 am Post subject:

Indeed.

My original hand-crafted code already did.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.

Hu · Administrator Joined: 06 Mar 2007 Posts: 23146

If I were to guess, I would speculate that:

Perhaps sar / shr are relatively slow. In the g++ version, the result of sar isn't needed until the top of the next iteration of the loop, so a pipelined CPU can finish the and/add/sub before it needs the result of that sar to stabilize. In the handmade version, you need the result of shr to feed the very next instruction, so your adc will stall waiting for shr to produce the carried bit.
Perhaps adc, like loop, is just a slow instruction on modern CPUs, so the g++ version that avoids adc and the carry flag fares better. It might be interesting to replace that adc with a setc %dl ; addl %rdx, %eax - although setc itself might be slow. (Remember to zero edx before the loop if you try this, since setc only writes the lowest byte.)

If you have access to a CPU that has little or no pipelining and speculative execution, it could be interesting to benchmark both programs there. That would let us rule in or rule out that g++'s sar benefits from executing in the background while the following instructions are decoded and resolved.

NeddySeagoon · Posted: Mon Mar 03, 2025 10:45 am Post subject:

John R. Graham,

You could do it in 4 byte size chunks with a 256 entry look up table.
The implementation detail is left as an exercise for the reader :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.