Lessons in Optimization on Modern X86-64 Silicon

John R. Graham

I was just doing a brush-up on some not-too-recently-used skills, so I set out to write a small function in 64-bit X86 assembly language. To start simply, I chose "count the number of 1 bits in a long integer"). The calling code is written in C++ and in fact I first prototyped the bit counting code in C++ as well.

Here's my trivial callng program (it also running enough iterations of the function to allow for benchmarking):

NeddySeagoon · Posted: Sun Mar 02, 2025 11:10 pm Post subject:

John R. Graham,

I'll have a wee nibble ...

It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy. :)

Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the

sublogic · Posted: Mon Mar 03, 2025 2:11 am Post subject:

(Independently of Neddy) I tried this:

John R. Graham · Posted: Mon Mar 03, 2025 2:24 am Post subject:

John R. Graham · Posted: Mon Mar 03, 2025 2:45 am Post subject:

sublogic · Posted: Mon Mar 03, 2025 2:57 am Post subject:

John R. Graham · Posted: Mon Mar 03, 2025 3:04 am Post subject:

Indeed.

My original hand-crafted code already did.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.

Hu · Administrator Joined: 06 Mar 2007 Posts: 23173

If I were to guess, I would speculate that:

Perhaps sar / shr are relatively slow. In the g++ version, the result of sar isn't needed until the top of the next iteration of the loop, so a pipelined CPU can finish the and/add/sub before it needs the result of that sar to stabilize. In the handmade version, you need the result of shr to feed the very next instruction, so your adc will stall waiting for shr to produce the carried bit.
Perhaps adc, like loop, is just a slow instruction on modern CPUs, so the g++ version that avoids adc and the carry flag fares better. It might be interesting to replace that adc with a setc %dl ; addl %rdx, %eax - although setc itself might be slow. (Remember to zero edx before the loop if you try this, since setc only writes the lowest byte.)

If you have access to a CPU that has little or no pipelining and speculative execution, it could be interesting to benchmark both programs there. That would let us rule in or rule out that g++'s sar benefits from executing in the background while the following instructions are decoded and resolved.

NeddySeagoon · Posted: Mon Mar 03, 2025 10:45 am Post subject:

John R. Graham,

You could do it in 4 byte size chunks with a 256 entry look up table.
The implementation detail is left as an exercise for the reader :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

John R. Graham · Posted: Fri Mar 07, 2025 2:52 pm Post subject:

I ended up determining that the culprit was the ADCx instruction, not the SHRx, so one of Hu's guesses was spot on. It's just waay slower than ADDx. I also learned that SHRx does set the flags so it can gainfully be used at the end of the loop. I ended up beating the -O2 compiler code with my final hand-optimized bit-at-a-time code, but only by a small amount: 1.6%.

These results were with Neddy's recommended "don't continue if there are no more one bits" algorithm, both in C++ and assembler. The final C++ code is:

NeddySeagoon · Posted: Fri Mar 07, 2025 7:52 pm Post subject:

John R. Graham,

You use a 64 bit register for the counter. An 8 bit register would be plenty.
Does that help speed wise?
In theory, no, its a 64 bit CPU, In practice, its a microcoded RISC machine, so nothing surprises me.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

John R. Graham · Posted: Fri Mar 07, 2025 8:10 pm Post subject:

No, it's only a 32-bit register: eax. 64-bit would be rax. My intuition is that it won't make a difference, but I'll check.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.

Hu · Administrator Joined: 06 Mar 2007 Posts: 23173

If you're still testing and tweaking, I wonder whether you would be better off using a literal andl $1,%ecx instead of andl %edx,%ecx where %edx always contains a 1. The CPU might take a different path when the mask is a directly embedded constant, versus being a register that it must predict is 1.

As regards do { } while: that form removes the initial test+jump that a while would require. This helps code size, and may help in other ways.

John R. Graham · Posted: Fri Mar 07, 2025 10:44 pm Post subject:

logrusx · Advocate Joined: 22 Feb 2018 Posts: 2819

John R. Graham · Posted: Sun Mar 09, 2025 8:01 pm Post subject:

Completed the byte-at-a-time implementation (another of Neddy's suggestions). C++ code is (bit count table omitted for brevity):

szatox · Advocate Joined: 27 Aug 2013 Posts: 3545

One trick I saw in xxhash was making the code suitable for reordering by the CPU.

In this case it would probably mean unrolling that loop though, so you can create a few blocks of instructions independent of each other.
E.g. instead of: shl 1, and 1, add couter, loop
make i: shl 1, shl 2, shl 3, shl 4, to 4 different output registers, then bitwise AND 1 each of them into next set of 4 registers, and then add each resulting bit to its respective counter, and then loop back using the value from shl 4 as input.

I know it's not a perfect example, but should be good enough to explain the idea; always have a few other things to do between calculating a value and using it.
And a part of why it's not perfect is because at this point you could probably MMX the hell out of it :lol:

_________________
Make Computing Fun Again

eccerr0r · Posted: Sun Mar 09, 2025 11:44 pm Post subject:

What -march are you all using?

I'm surprised gcc and clang aren't optimizing it to one instruction and no loops, as long as you're not using base x86-64.

---

After a bit of research, it looks like if you rewrite your loop to:

John R. Graham · Posted: Mon Mar 10, 2025 2:26 am Post subject:

Well, -march=native, which I'm using, on my Intel(R) Xeon(R) W-2135 CPU evaluates to:

eccerr0r · Posted: Mon Mar 10, 2025 2:54 am Post subject:

I made a post edit above...

It's not MMX/SIMD, it's just an x86/x86_64 extension, not sure where I found it, but it's a neat instruction indeed.

I think newer arm also has some analog, mainly because now a lot of cpus are trying to optimize for encryption and ECC.

Oh I guess "modern" I mean something newer than a Core2 CPU... so even old nehalems can do it. So I think the excluded x86_64's are pentium4s, original K8s, Core2 quad/duo, there might be more that don't have it.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

no101 · n00b Joined: 10 Oct 2022 Posts: 15 Location: Piney Woods

Given the nature of your post, I assumed that you were aware of popcount but wanted to achieve the same result on your own. Here's a few links.

C++20 has std::popcount
https://en.cppreference.com/w/cpp/numeric/popcount

GCC and Clang both have __builtin_popcount
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Wikipedia says that AMD considers POPCNT part of BMI1 while Intel considers it part of SSE4.2
It's under ABM at
https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set

eccerr0r · Posted: Mon Mar 10, 2025 5:57 am Post subject:

I assumed otherwise because I'd think a good optimizing compiler should have optimized popcount ... and it appears a specific method of writing the C implementation would popcount automatically (meaning, you don't need to use a __builtin or std::) where none of the assembly dumps did so... hence was quite interesting that it didn't.

I don't consider an instruction to be a MMX or SIMD instruction if it doesn't use the MMX/SIMD registers, this popcount works on the base architectural registers. I suspect there is a variant that will work on simd/mmx registers... or definitely avx512 registers...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

NeddySeagoon · Posted: Mon Mar 10, 2025 10:24 am Post subject:

John R. Graham,

John R. Graham · Posted: Mon Mar 10, 2025 11:51 am Post subject:

Hu · Administrator Joined: 06 Mar 2007 Posts: 23173

You may like the related trick that value & (value - 1) is false when value is a power of 2, and true otherwise.