View previous topic :: View next topic |
Author |
Message |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10739 Location: Somewhere over Atlanta, Georgia
|
Posted: Sun Mar 02, 2025 10:49 pm Post subject: Lessons in Optimization on Modern X86-64 Silicon |
|
|
I was just doing a brush-up on some not-too-recently-used skills, so I set out to write a small function in 64-bit X86 assembly language. To start simply, I chose "count the number of 1 bits in a long integer"). The calling code is written in C++ and in fact I first prototyped the bit counting code in C++ as well.
Here's my trivial callng program (it also running enough iterations of the function to allow for benchmarking): Code: | #include <iostream>
#include <fmt/core.h>
#include "count-bits.h"
void print_bit_count(long number) {
fmt::print("The number {0:10d} (0x{0:016x}) has {1:2d} one bits.\n", number, count_one_bits(number));
}
int main(void) {
fmt::print("Hello, world!\n");
print_bit_count(1000l);
print_bit_count(5000l);
print_bit_count(1000000l);
print_bit_count(1000000000l);
// Benchmark by calling the count_one_bits() function a billion times:
auto result = 0;
for (auto i = 0; i < 100000000; i++) {
result += count_one_bits(1000000000l);
}
return 0;
} | and here's the C++ bit counting function I created: Code: | int count_one_bits(long number) {
int count = 0;
for (int i = 0; i < sizeof(long) * 8; i++) {
if ((number & 1) != 0)
count++;
number >>= 1;
}
return count;
} | Initially, I looked at the assembler produced from the C++ version of the function. Here's a snippet, produced with g++ -O2: Code: | 6 .text
7 .p2align 4
8 .globl _Z14count_one_bitsl
9 .type _Z14count_one_bitsl, @function
10 _Z14count_one_bitsl:
11 .LFB0:
12 .cfi_startproc
13 0000 F30F1EFA endbr64
14 # count-bits.cpp:2: int count_one_bits(long number) {
15 0004 B8400000 movl $64, %eax #, ivtmp_4
15 00
16 # count-bits.cpp:3: int count = 0;
17 0009 31D2 xorl %edx, %edx # <retval>
18 000b 0F1F4400 .p2align 4
18 00
19 .p2align 4
20 .p2align 3
21 .L2:
22 # count-bits.cpp:6: if ((number & 1) != 0)
23 0010 4889F9 movq %rdi, %rcx # number, _1
24 # count-bits.cpp:8: number >>= 1;
25 0013 48D1FF sarq %rdi # number
26 # count-bits.cpp:6: if ((number & 1) != 0)
27 0016 83E101 andl $1, %ecx #, _1
28 # count-bits.cpp:6: if ((number & 1) != 0)
29 0019 01CA addl %ecx, %edx # _1, <retval>
30 # count-bits.cpp:5: for (int i = 0; i < sizeof(long) * 8; i++) {
31 001b 83E801 subl $1, %eax #, ivtmp_4
32 001e 75F0 jne .L2 #,
33 # count-bits.cpp:12: }
34 0020 89D0 movl %edx, %eax # <retval>,
35 0022 C3 ret
36 .cfi_endproc | Actually, that didn't look too bad to me. I see evidence of some cleverness from the optimizer. (I won't even bother to show the assembly language produced by -O0; it's not worth our time.) But I thought I could do better. Here's what I came up with: Code: | 6 .text
7 .p2align 4
8 .globl _Z14count_one_bitsl
9 .type _Z14count_one_bitsl, @function
10 _Z14count_one_bitsl:
11 .LFB0:
12 .cfi_startproc
13 0000 F30F1EFA endbr64
14
15 0004 B9400000 movl $64, %ecx
15 00
16 0009 31C0 xorl %eax, %eax # <retval>
17 000b 0F1F4400 .p2align 4
17 00
18 .L2:
19 0010 48D1EF shrq %rdi # number
20 0013 83D000 adcl $0, %eax # _1, <retval>
21 0016 FFC9 decl %ecx
22 0018 75F6 jnz .L2
23 001a C3 ret
24 .cfi_endproc | I managed to remove a redundant copy and a logical AND from the loop by talking advantage of the carry bit (where the SHR instructions place the bit that's shifted off) and a redundant copy of the sum (but that's outside the loop so just makes smaller—but not particularly faster—code. I was surprised (but I guess not completely astonished) to learn that my shorter, "better" code was about 13% slower that the g++ -O2 generated code. So the GCC optimizer knows some things I don't. I tried several variations on the code without being able to ever achieve the performance of the g++-generated code. I don't think I broke any of the obvious rules (foremost among them, don't stall the pipeline unnecessarily). I didn't try unrolling the loop, because g++ didn't and I thought that would be cheating, anyway.
Couple of things I discovered (or, at least was reminded of) along the way:- That ".p2align 4" psuedo-op (and the fancy multi-byte NOP it concocts) is important. Aligning the loop on a 16-byte boundary increased the speed of the loop by about 30%.
- At one point I tried replacing the decl/jnz loop counter management with a single "loop" opcode. That was a bad idea, decreasing performance by over 300%. I remember reading somewhere that modern silicon schedulers don't try to optimize the loop instruction (because it's hardly ever used) and boy does that seem to be true!
This isn't really a support request. I just thought some of you might be interested. (Although if you have ideas, I'd love to hear them.)
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 54956 Location: 56N 3W
|
Posted: Sun Mar 02, 2025 11:10 pm Post subject: |
|
|
John R. Graham,
I'll have a wee nibble ...
It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy. :)
Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the is redundant. No need to count bits being shifted. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
sublogic Guru


Joined: 21 Mar 2022 Posts: 318 Location: Pennsylvania, USA
|
Posted: Mon Mar 03, 2025 2:11 am Post subject: |
|
|
(Independently of Neddy) I tried this: Code: | int count_one_bits(unsigned long number)
{
int count;
for(count= 0; number; number >>= 1) {
count+= number&1;
}
return count;
} |
The assembler looks like this: Code: | _Z14count_one_bitsm:
.LFB2086:
.cfi_startproc
endbr64
xorl %eax, %eax
testq %rdi, %rdi
je .L4
.p2align 4
.p2align 4
.p2align 3
.L3:
movl %edi, %edx
andl $1, %edx
addl %edx, %eax
shrq %rdi
jne .L3
ret
.p2align 4,,10
.p2align 3
.L4:
ret
.cfi_endproc |
It looks like %eax holds "count" and %rdi holds "number". The "for" loop is at label L3.
(Hey, how do you get g++ to copy the source lines in the assembly listing?)
I didn't benchmark it, but it looks more efficient... (I did run your tests and the function is correct.) |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10739 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Mar 03, 2025 2:24 am Post subject: |
|
|
NeddySeagoon wrote: | John R. Graham,
I'll have a wee nibble ...
It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy.
Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the is redundant. No need to count bits being shifted. | Interesting. I think that requires an additional test because I don't believe that SRL sets all the flags. It might make the worst case performance worse. Let me check. Thanks!
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10739 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Mar 03, 2025 2:45 am Post subject: |
|
|
sublogic wrote: | The assembler looks like this: ...
I didn't benchmark it, but it looks more efficient... (I did run your tests and the function is correct.) | Thanks! I'll check. I'm going to run worst case, though, testing against ~0l.
sublogic wrote: | (Hey, how do you get g++ to copy the source lines in the assembly listing?) | That'd be the --verbose-asm command line option.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
sublogic Guru


Joined: 21 Mar 2022 Posts: 318 Location: Pennsylvania, USA
|
Posted: Mon Mar 03, 2025 2:57 am Post subject: |
|
|
John R. Graham wrote: | Thanks! I'll check. I'm going to run worst case, though, testing against ~0l. | Make sure it's using logical shifts, not arithmetic shifts, otherwise it could get very slow
(I made mine take an unsigned long instead of just long, for that reason.) |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10739 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Mar 03, 2025 3:04 am Post subject: |
|
|
Indeed. My original hand-crafted code already did.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
Hu Administrator

Joined: 06 Mar 2007 Posts: 23147
|
Posted: Mon Mar 03, 2025 3:17 am Post subject: |
|
|
If I were to guess, I would speculate that:- Perhaps sar / shr are relatively slow. In the g++ version, the result of sar isn't needed until the top of the next iteration of the loop, so a pipelined CPU can finish the and/add/sub before it needs the result of that sar to stabilize. In the handmade version, you need the result of shr to feed the very next instruction, so your adc will stall waiting for shr to produce the carried bit.
- Perhaps adc, like loop, is just a slow instruction on modern CPUs, so the g++ version that avoids adc and the carry flag fares better. It might be interesting to replace that adc with a setc %dl ; addl %rdx, %eax - although setc itself might be slow. (Remember to zero edx before the loop if you try this, since setc only writes the lowest byte.)
If you have access to a CPU that has little or no pipelining and speculative execution, it could be interesting to benchmark both programs there. That would let us rule in or rule out that g++'s sar benefits from executing in the background while the following instructions are decoded and resolved. |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 54956 Location: 56N 3W
|
Posted: Mon Mar 03, 2025 10:45 am Post subject: |
|
|
John R. Graham,
You could do it in 4 byte size chunks with a 256 entry look up table.
The implementation detail is left as an exercise for the reader :) _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|