View previous topic :: View next topic |
Author |
Message |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Sun Mar 02, 2025 10:49 pm Post subject: Lessons in Optimization on Modern X86-64 Silicon |
|
|
I was just doing a brush-up on some not-too-recently-used skills, so I set out to write a small function in 64-bit X86 assembly language. To start simply, I chose "count the number of 1 bits in a long integer"). The calling code is written in C++ and in fact I first prototyped the bit counting code in C++ as well.
Here's my trivial callng program (it also running enough iterations of the function to allow for benchmarking): Code: | #include <iostream>
#include <fmt/core.h>
#include "count-bits.h"
void print_bit_count(long number) {
fmt::print("The number {0:10d} (0x{0:016x}) has {1:2d} one bits.\n", number, count_one_bits(number));
}
int main(void) {
fmt::print("Hello, world!\n");
print_bit_count(1000l);
print_bit_count(5000l);
print_bit_count(1000000l);
print_bit_count(1000000000l);
// Benchmark by calling the count_one_bits() function a billion times:
auto result = 0;
for (auto i = 0; i < 100000000; i++) {
result += count_one_bits(1000000000l);
}
return 0;
} | and here's the C++ bit counting function I created: Code: | int count_one_bits(long number) {
int count = 0;
for (int i = 0; i < sizeof(long) * 8; i++) {
if ((number & 1) != 0)
count++;
number >>= 1;
}
return count;
} | Initially, I looked at the assembler produced from the C++ version of the function. Here's a snippet, produced with g++ -O2: Code: | 6 .text
7 .p2align 4
8 .globl _Z14count_one_bitsl
9 .type _Z14count_one_bitsl, @function
10 _Z14count_one_bitsl:
11 .LFB0:
12 .cfi_startproc
13 0000 F30F1EFA endbr64
14 # count-bits.cpp:2: int count_one_bits(long number) {
15 0004 B8400000 movl $64, %eax #, ivtmp_4
15 00
16 # count-bits.cpp:3: int count = 0;
17 0009 31D2 xorl %edx, %edx # <retval>
18 000b 0F1F4400 .p2align 4
18 00
19 .p2align 4
20 .p2align 3
21 .L2:
22 # count-bits.cpp:6: if ((number & 1) != 0)
23 0010 4889F9 movq %rdi, %rcx # number, _1
24 # count-bits.cpp:8: number >>= 1;
25 0013 48D1FF sarq %rdi # number
26 # count-bits.cpp:6: if ((number & 1) != 0)
27 0016 83E101 andl $1, %ecx #, _1
28 # count-bits.cpp:6: if ((number & 1) != 0)
29 0019 01CA addl %ecx, %edx # _1, <retval>
30 # count-bits.cpp:5: for (int i = 0; i < sizeof(long) * 8; i++) {
31 001b 83E801 subl $1, %eax #, ivtmp_4
32 001e 75F0 jne .L2 #,
33 # count-bits.cpp:12: }
34 0020 89D0 movl %edx, %eax # <retval>,
35 0022 C3 ret
36 .cfi_endproc | Actually, that didn't look too bad to me. I see evidence of some cleverness from the optimizer. (I won't even bother to show the assembly language produced by -O0; it's not worth our time.) But I thought I could do better. Here's what I came up with: Code: | 6 .text
7 .p2align 4
8 .globl _Z14count_one_bitsl
9 .type _Z14count_one_bitsl, @function
10 _Z14count_one_bitsl:
11 .LFB0:
12 .cfi_startproc
13 0000 F30F1EFA endbr64
14
15 0004 B9400000 movl $64, %ecx
15 00
16 0009 31C0 xorl %eax, %eax # <retval>
17 000b 0F1F4400 .p2align 4
17 00
18 .L2:
19 0010 48D1EF shrq %rdi # number
20 0013 83D000 adcl $0, %eax # _1, <retval>
21 0016 FFC9 decl %ecx
22 0018 75F6 jnz .L2
23 001a C3 ret
24 .cfi_endproc | I managed to remove a redundant copy and a logical AND from the loop by talking advantage of the carry bit (where the SHR instructions place the bit that's shifted off) and a redundant copy of the sum (but that's outside the loop so just makes smaller—but not particularly faster—code. I was surprised (but I guess not completely astonished) to learn that my shorter, "better" code was about 13% slower that the g++ -O2 generated code. So the GCC optimizer knows some things I don't. I tried several variations on the code without being able to ever achieve the performance of the g++-generated code. I don't think I broke any of the obvious rules (foremost among them, don't stall the pipeline unnecessarily). I didn't try unrolling the loop, because g++ didn't and I thought that would be cheating, anyway.
Couple of things I discovered (or, at least was reminded of) along the way:- That ".p2align 4" psuedo-op (and the fancy multi-byte NOP it concocts) is important. Aligning the loop on a 16-byte boundary increased the speed of the loop by about 30%.
- At one point I tried replacing the decl/jnz loop counter management with a single "loop" opcode. That was a bad idea, decreasing performance by over 300%. I remember reading somewhere that modern silicon schedulers don't try to optimize the loop instruction (because it's hardly ever used) and boy does that seem to be true!
This isn't really a support request. I just thought some of you might be interested. (Although if you have ideas, I'd love to hear them.)
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 54990 Location: 56N 3W
|
Posted: Sun Mar 02, 2025 11:10 pm Post subject: |
|
|
John R. Graham,
I'll have a wee nibble ...
It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy. :)
Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the is redundant. No need to count bits being shifted. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
sublogic Guru


Joined: 21 Mar 2022 Posts: 318 Location: Pennsylvania, USA
|
Posted: Mon Mar 03, 2025 2:11 am Post subject: |
|
|
(Independently of Neddy) I tried this: Code: | int count_one_bits(unsigned long number)
{
int count;
for(count= 0; number; number >>= 1) {
count+= number&1;
}
return count;
} |
The assembler looks like this: Code: | _Z14count_one_bitsm:
.LFB2086:
.cfi_startproc
endbr64
xorl %eax, %eax
testq %rdi, %rdi
je .L4
.p2align 4
.p2align 4
.p2align 3
.L3:
movl %edi, %edx
andl $1, %edx
addl %edx, %eax
shrq %rdi
jne .L3
ret
.p2align 4,,10
.p2align 3
.L4:
ret
.cfi_endproc |
It looks like %eax holds "count" and %rdi holds "number". The "for" loop is at label L3.
(Hey, how do you get g++ to copy the source lines in the assembly listing?)
I didn't benchmark it, but it looks more efficient... (I did run your tests and the function is correct.) |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Mar 03, 2025 2:24 am Post subject: |
|
|
NeddySeagoon wrote: | John R. Graham,
I'll have a wee nibble ...
It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy.
Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the is redundant. No need to count bits being shifted. | Interesting. I think that requires an additional test because I don't believe that SRL sets all the flags. It might make the worst case performance worse. Let me check. Thanks!
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Mar 03, 2025 2:45 am Post subject: |
|
|
sublogic wrote: | The assembler looks like this: ...
I didn't benchmark it, but it looks more efficient... (I did run your tests and the function is correct.) | Thanks! I'll check. I'm going to run worst case, though, testing against ~0l.
sublogic wrote: | (Hey, how do you get g++ to copy the source lines in the assembly listing?) | That'd be the --verbose-asm command line option.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
sublogic Guru


Joined: 21 Mar 2022 Posts: 318 Location: Pennsylvania, USA
|
Posted: Mon Mar 03, 2025 2:57 am Post subject: |
|
|
John R. Graham wrote: | Thanks! I'll check. I'm going to run worst case, though, testing against ~0l. | Make sure it's using logical shifts, not arithmetic shifts, otherwise it could get very slow
(I made mine take an unsigned long instead of just long, for that reason.) |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Mar 03, 2025 3:04 am Post subject: |
|
|
Indeed. My original hand-crafted code already did.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
Hu Administrator

Joined: 06 Mar 2007 Posts: 23173
|
Posted: Mon Mar 03, 2025 3:17 am Post subject: |
|
|
If I were to guess, I would speculate that:- Perhaps sar / shr are relatively slow. In the g++ version, the result of sar isn't needed until the top of the next iteration of the loop, so a pipelined CPU can finish the and/add/sub before it needs the result of that sar to stabilize. In the handmade version, you need the result of shr to feed the very next instruction, so your adc will stall waiting for shr to produce the carried bit.
- Perhaps adc, like loop, is just a slow instruction on modern CPUs, so the g++ version that avoids adc and the carry flag fares better. It might be interesting to replace that adc with a setc %dl ; addl %rdx, %eax - although setc itself might be slow. (Remember to zero edx before the loop if you try this, since setc only writes the lowest byte.)
If you have access to a CPU that has little or no pipelining and speculative execution, it could be interesting to benchmark both programs there. That would let us rule in or rule out that g++'s sar benefits from executing in the background while the following instructions are decoded and resolved. |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 54990 Location: 56N 3W
|
Posted: Mon Mar 03, 2025 10:45 am Post subject: |
|
|
John R. Graham,
You could do it in 4 byte size chunks with a 256 entry look up table.
The implementation detail is left as an exercise for the reader :) _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Fri Mar 07, 2025 2:52 pm Post subject: |
|
|
I ended up determining that the culprit was the ADCx instruction, not the SHRx, so one of Hu's guesses was spot on. It's just waay slower than ADDx. I also learned that SHRx does set the flags so it can gainfully be used at the end of the loop. I ended up beating the -O2 compiler code with my final hand-optimized bit-at-a-time code, but only by a small amount: 1.6%.
These results were with Neddy's recommended "don't continue if there are no more one bits" algorithm, both in C++ and assembler. The final C++ code is: Code: | int count_one_bits(unsigned long number) {
int count = 0;
do {
count += number & 1;
number >>= 1;
} while (number);
return count;
} | Interestingly, "do { } while()" was significantly faster than "while() { }". Come to think of it, that actually makes sense. The former almost has to have two jumps per loop whereas the latter can have only one.
My final hand-crafted assembly code is: Code: | 10 _Z14count_one_bitsm:
11 .LFB0:
12 .cfi_startproc
13 0000 F30F1EFA endbr64
14 0004 31C0 xorl %eax, %eax # <retval>
15 0006 BA010000 movl $1, %edx
15 00
16
17 000b 0F1F4400 .p2align 4
17 00
18 .L2:
19 0010 89F9 movl %edi, %ecx
20 0012 21D1 andl %edx, %ecx
21 0014 01C8 addl %ecx, %eax
22 0016 48D1EF shrq %rdi
23 0019 75F5 jnz .L2
24 001b C3 ret
25 .cfi_endproc
| As my whole purpose is to practice, I'm going to work on the byte-at-a-time algorithm next.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 54990 Location: 56N 3W
|
Posted: Fri Mar 07, 2025 7:52 pm Post subject: |
|
|
John R. Graham,
You use a 64 bit register for the counter. An 8 bit register would be plenty.
Does that help speed wise?
In theory, no, its a 64 bit CPU, In practice, its a microcoded RISC machine, so nothing surprises me. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Fri Mar 07, 2025 8:10 pm Post subject: |
|
|
No, it's only a 32-bit register: eax. 64-bit would be rax. My intuition is that it won't make a difference, but I'll check.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
Hu Administrator

Joined: 06 Mar 2007 Posts: 23173
|
Posted: Fri Mar 07, 2025 8:37 pm Post subject: |
|
|
If you're still testing and tweaking, I wonder whether you would be better off using a literal andl $1,%ecx instead of andl %edx,%ecx where %edx always contains a 1. The CPU might take a different path when the mask is a directly embedded constant, versus being a register that it must predict is 1.
As regards do { } while: that form removes the initial test+jump that a while would require. This helps code size, and may help in other ways. |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Fri Mar 07, 2025 10:44 pm Post subject: |
|
|
Hu wrote: | If you're still testing and tweaking, I wonder whether you would be better off using a literal andl $1,%ecx ... | As it turns out, register to register is fastest, which is a piece of traditional lore that's survived.
Hu wrote: | As regards do { } while: that form removes the initial test+jump that a while would require. This helps code size, and may help in other ways. | Yes, that was kind of what I was saying as well, above. The compiler at -O2 actually came up with the same code that I wrote for the do { } while() minus the preloading of a constant in a register.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
logrusx Advocate


Joined: 22 Feb 2018 Posts: 2819
|
Posted: Sat Mar 08, 2025 7:48 am Post subject: |
|
|
NeddySeagoon wrote: | John R. Graham,
You use a 64 bit register for the counter. An 8 bit register would be plenty.
Does that help speed wise?
In theory, no, its a 64 bit CPU, In practice, its a microcoded RISC machine, so nothing surprises me. |
Even RISC wouldn't benefit. Everything unaligned would require padding and thus - extra work.
Best Regards,
Georgi |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Sun Mar 09, 2025 8:01 pm Post subject: |
|
|
Completed the byte-at-a-time implementation (another of Neddy's suggestions). C++ code is (bit count table omitted for brevity): Code: | #include "bit-count-table.h"
int count_one_bits(unsigned long number) {
int count = 0;
do {
count += bitCountTable[number & 0xFF];
number >>= 8;
} while (number);
return count;
} | which renders, at -O2, code that's 384% faster than the bit-at-a-time code. Not really surprising, but, oh my gosh, I've discovered the "-masm=intel" option to gcc/g++! Intel assembly language syntax is sooo much nicer than AT&T syntax (in my humble opinion). The generated code is...pretty perfect: Code: | 11 _Z14count_one_bitsm:
12 .LFB0:
13 .cfi_startproc
14 0000 F30F1EFA endbr64
15 # count-bits-byte-at-a-time.cpp:5: int count = 0;
16 0004 31C0 xor eax, eax # <retval>
17 0006 488D0D00 lea rcx, _ZL13bitCountTable[rip] # tmp107,
17 000000
18 000d 0F1F00 .p2align 4
19 .p2align 4
20 .p2align 3
21 .L2:
22 # count-bits-byte-at-a-time.cpp:8: count += bitCountTable[number & 0xFF];
23 0010 400FB6D7 movzx edx, dil # _1, number
24 # count-bits-byte-at-a-time.cpp:8: count += bitCountTable[number & 0xFF];
25 0014 0FB61411 movzx edx, BYTE PTR [rcx+rdx] # _3, bitCountTable[_1]
26 # count-bits-byte-at-a-time.cpp:8: count += bitCountTable[number & 0xFF];
27 0018 01D0 add eax, edx # <retval>, _3
28 # count-bits-byte-at-a-time.cpp:10: } while (number);
29 001a 48C1EF08 shr rdi, 8 # number,
30 001e 75F0 jne .L2 #,
31 # count-bits-byte-at-a-time.cpp:13: }
32 0020 C3 ret
33 .cfi_endproc | I'm not sure I'm going to be able to do any better than that.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
szatox Advocate

Joined: 27 Aug 2013 Posts: 3545
|
Posted: Sun Mar 09, 2025 9:46 pm Post subject: |
|
|
One trick I saw in xxhash was making the code suitable for reordering by the CPU.
In this case it would probably mean unrolling that loop though, so you can create a few blocks of instructions independent of each other.
E.g. instead of: shl 1, and 1, add couter, loop
make i: shl 1, shl 2, shl 3, shl 4, to 4 different output registers, then bitwise AND 1 each of them into next set of 4 registers, and then add each resulting bit to its respective counter, and then loop back using the value from shl 4 as input.
I know it's not a perfect example, but should be good enough to explain the idea; always have a few other things to do between calculating a value and using it.
And a part of why it's not perfect is because at this point you could probably MMX the hell out of it  _________________ Make Computing Fun Again |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 9926 Location: almost Mile High in the USA
|
Posted: Sun Mar 09, 2025 11:44 pm Post subject: |
|
|
What -march are you all using?
I'm surprised gcc and clang aren't optimizing it to one instruction and no loops, as long as you're not using base x86-64.
---
After a bit of research, it looks like if you rewrite your loop to:
Code: | while(number != 0) {
++count;
number &= (number-1);
} |
and target for a modern x86_64 microarchitecture, then gcc will figure things out and give you the fastest popcount solution. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Last edited by eccerr0r on Mon Mar 10, 2025 2:53 am; edited 1 time in total |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Mar 10, 2025 2:26 am Post subject: |
|
|
Well, -march=native, which I'm using, on my Intel(R) Xeon(R) W-2135 CPU evaluates to: Code: | -march=skylake-avx512 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mavx512f -mbmi -mbmi2 -maes -mpclmul -mavx512vl -mavx512bw -mavx512dq -mavx512cd -mno-avx512vbmi -mno-avx512ifma -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mhle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mrtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex -mno-avxvnniint16 -mno-sm3 -mno-sha512 -mno-sm4 -mno-apxf -mno-usermsr --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=8448 -mtune=skylake-avx512 -fcf-protection -foffload-options=-fcf-protection=none -dumpbase null | What single instruction would that be? One of the SIMD ones? I really need to learn about those.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters.
Last edited by John R. Graham on Mon Mar 10, 2025 2:55 am; edited 2 times in total |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 9926 Location: almost Mile High in the USA
|
Posted: Mon Mar 10, 2025 2:54 am Post subject: |
|
|
I made a post edit above...
It's not MMX/SIMD, it's just an x86/x86_64 extension, not sure where I found it, but it's a neat instruction indeed.
I think newer arm also has some analog, mainly because now a lot of cpus are trying to optimize for encryption and ECC.
Oh I guess "modern" I mean something newer than a Core2 CPU... so even old nehalems can do it. So I think the excluded x86_64's are pentium4s, original K8s, Core2 quad/duo, there might be more that don't have it. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
no101 n00b

Joined: 10 Oct 2022 Posts: 15 Location: Piney Woods
|
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 9926 Location: almost Mile High in the USA
|
Posted: Mon Mar 10, 2025 5:57 am Post subject: |
|
|
I assumed otherwise because I'd think a good optimizing compiler should have optimized popcount ... and it appears a specific method of writing the C implementation would popcount automatically (meaning, you don't need to use a __builtin or std::) where none of the assembly dumps did so... hence was quite interesting that it didn't.
I don't consider an instruction to be a MMX or SIMD instruction if it doesn't use the MMX/SIMD registers, this popcount works on the base architectural registers. I suspect there is a variant that will work on simd/mmx registers... or definitely avx512 registers... _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 54990 Location: 56N 3W
|
Posted: Mon Mar 10, 2025 10:24 am Post subject: |
|
|
John R. Graham,
Quote: | What single instruction would that be? |
Axioms
All programs can be shortened by one byte.
All programs contain one more bug
Conclusion
In the limit all programs can be shortened to a single instruction, which is wrong, :) _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 10745 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Mar 10, 2025 11:51 am Post subject: |
|
|
eccerr0r wrote: | I assumed otherwise because I'd think a good optimizing compiler should have optimized popcount ... and it appears a specific method of writing the C implementation would popcount automatically (meaning, you don't need to use a __builtin or std::) where none of the assembly dumps did so... hence was quite interesting that it didn't. | Ultimately it did with the loop source code you presented: Code: | 11 _Z14count_one_bitsm:
12 .LFB0:
13 .cfi_startproc
14 0000 F30F1EFA endbr64
15 0004 31C0 xor eax, eax # <retval>
16 0006 F3480FB8 popcnt rax, rdi # <retval>, tmp102
16 C7
17 # count-bits.cpp:25: }
18 000b C3 ret
19 .cfi_endproc
| Interesting that g++ didn't optimize out line 15. But, yes, it seems to be a rather narrow peephole optimization because several similar, reasonable implementations were not recognized. I really like this line of code in your example, by the way: Code: | number &= (number-1); | which clears the rightmost one bit without even knowing where it is. Cool!
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
 |
Hu Administrator

Joined: 06 Mar 2007 Posts: 23173
|
Posted: Mon Mar 10, 2025 2:16 pm Post subject: |
|
|
You may like the related trick that value & (value - 1) is false when value is a power of 2, and true otherwise. |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|