Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Lessons in Optimization on Modern X86-64 Silicon
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Sun Mar 02, 2025 10:49 pm    Post subject: Lessons in Optimization on Modern X86-64 Silicon Reply with quote

I was just doing a brush-up on some not-too-recently-used skills, so I set out to write a small function in 64-bit X86 assembly language. To start simply, I chose "count the number of 1 bits in a long integer"). The calling code is written in C++ and in fact I first prototyped the bit counting code in C++ as well.

Here's my trivial callng program (it also running enough iterations of the function to allow for benchmarking):
Code:
#include <iostream>
#include <fmt/core.h>
#include "count-bits.h"

void print_bit_count(long number) {
   fmt::print("The number {0:10d} (0x{0:016x}) has {1:2d} one bits.\n", number, count_one_bits(number));
}

int main(void) {
   fmt::print("Hello, world!\n");

   print_bit_count(1000l);
   print_bit_count(5000l);
   print_bit_count(1000000l);
   print_bit_count(1000000000l);

   // Benchmark by calling the count_one_bits() function a billion times:
   auto result = 0;
   for (auto i = 0; i < 100000000; i++) {
      result += count_one_bits(1000000000l);
   }
   
   return 0;
}
and here's the C++ bit counting function I created:
Code:
int count_one_bits(long number) {
   int count = 0;
   
   for (int i = 0; i < sizeof(long) * 8; i++) {
      if ((number & 1) != 0)
         count++;
      number >>= 1;
   }

   return count;
}
Initially, I looked at the assembler produced from the C++ version of the function. Here's a snippet, produced with g++ -O2:
Code:
   6                    .text
   7                    .p2align 4
   8                    .globl   _Z14count_one_bitsl
   9                    .type   _Z14count_one_bitsl, @function
  10                 _Z14count_one_bitsl:
  11                 .LFB0:
  12                    .cfi_startproc
  13 0000 F30F1EFA       endbr64   
  14                 # count-bits.cpp:2: int count_one_bits(long number) {
  15 0004 B8400000       movl   $64, %eax   #, ivtmp_4
  15      00
  16                 # count-bits.cpp:3:    int count = 0;
  17 0009 31D2           xorl   %edx, %edx   # <retval>
  18 000b 0F1F4400       .p2align 4
  18      00
  19                    .p2align 4
  20                    .p2align 3
  21                 .L2:
  22                 # count-bits.cpp:6:       if ((number & 1) != 0)
  23 0010 4889F9         movq   %rdi, %rcx   # number, _1
  24                 # count-bits.cpp:8:       number >>= 1;
  25 0013 48D1FF         sarq   %rdi   # number
  26                 # count-bits.cpp:6:       if ((number & 1) != 0)
  27 0016 83E101         andl   $1, %ecx   #, _1
  28                 # count-bits.cpp:6:       if ((number & 1) != 0)
  29 0019 01CA           addl   %ecx, %edx   # _1, <retval>
  30                 # count-bits.cpp:5:    for (int i = 0; i < sizeof(long) * 8; i++) {
  31 001b 83E801         subl   $1, %eax   #, ivtmp_4
  32 001e 75F0           jne   .L2   #,
  33                 # count-bits.cpp:12: }
  34 0020 89D0           movl   %edx, %eax   # <retval>,
  35 0022 C3             ret   
  36                    .cfi_endproc
Actually, that didn't look too bad to me. I see evidence of some cleverness from the optimizer. (I won't even bother to show the assembly language produced by -O0; it's not worth our time.) But I thought I could do better. Here's what I came up with:
Code:
   6                    .text
   7                    .p2align 4
   8                    .globl   _Z14count_one_bitsl
   9                    .type   _Z14count_one_bitsl, @function
  10                 _Z14count_one_bitsl:
  11                 .LFB0:
  12                    .cfi_startproc
  13 0000 F30F1EFA       endbr64   
  14                 
  15 0004 B9400000       movl    $64, %ecx
  15      00
  16 0009 31C0           xorl   %eax, %eax   # <retval>
  17 000b 0F1F4400       .p2align 4
  17      00
  18                 .L2:
  19 0010 48D1EF         shrq   %rdi   # number
  20 0013 83D000         adcl   $0, %eax   # _1, <retval>
  21 0016 FFC9           decl   %ecx
  22 0018 75F6           jnz    .L2
  23 001a C3             ret   
  24                    .cfi_endproc
I managed to remove a redundant copy and a logical AND from the loop by talking advantage of the carry bit (where the SHR instructions place the bit that's shifted off) and a redundant copy of the sum (but that's outside the loop so just makes smaller—but not particularly faster—code. I was surprised (but I guess not completely astonished) to learn that my shorter, "better" code was about 13% slower that the g++ -O2 generated code. So the GCC optimizer knows some things I don't. :P I tried several variations on the code without being able to ever achieve the performance of the g++-generated code. I don't think I broke any of the obvious rules (foremost among them, don't stall the pipeline unnecessarily). I didn't try unrolling the loop, because g++ didn't and I thought that would be cheating, anyway. ;)

Couple of things I discovered (or, at least was reminded of) along the way:
  • That ".p2align 4" psuedo-op (and the fancy multi-byte NOP it concocts) is important. Aligning the loop on a 16-byte boundary increased the speed of the loop by about 30%.
  • At one point I tried replacing the decl/jnz loop counter management with a single "loop" opcode. That was a bad idea, decreasing performance by over 300%. I remember reading somewhere that modern silicon schedulers don't try to optimize the loop instruction (because it's hardly ever used) and boy does that seem to be true!
This isn't really a support request. I just thought some of you might be interested. (Although if you have ideas, I'd love to hear them.)

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54990
Location: 56N 3W

PostPosted: Sun Mar 02, 2025 11:10 pm    Post subject: Reply with quote

John R. Graham,

I'll have a wee nibble ...

It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy. :)

Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the
Code:
decl   %ecx
is redundant. No need to count bits being shifted.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
sublogic
Guru
Guru


Joined: 21 Mar 2022
Posts: 318
Location: Pennsylvania, USA

PostPosted: Mon Mar 03, 2025 2:11 am    Post subject: Reply with quote

(Independently of Neddy) I tried this:
Code:
int count_one_bits(unsigned long number)
{
   int count;
   for(count= 0; number; number >>= 1) {
      count+= number&1;
   }
   return count;
}

The assembler looks like this:
Code:
_Z14count_one_bitsm:
.LFB2086:
        .cfi_startproc
        endbr64
        xorl    %eax, %eax
        testq   %rdi, %rdi
        je      .L4
        .p2align 4
        .p2align 4
        .p2align 3
.L3:
        movl    %edi, %edx
        andl    $1, %edx
        addl    %edx, %eax
        shrq    %rdi
        jne     .L3
        ret
        .p2align 4,,10
        .p2align 3
.L4:
        ret
        .cfi_endproc

It looks like %eax holds "count" and %rdi holds "number". The "for" loop is at label L3.
(Hey, how do you get g++ to copy the source lines in the assembly listing?)

I didn't benchmark it, but it looks more efficient... (I did run your tests and the function is correct.)
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Mar 03, 2025 2:24 am    Post subject: Reply with quote

NeddySeagoon wrote:
John R. Graham,

I'll have a wee nibble ...

It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy. :)

Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the
Code:
decl   %ecx
is redundant. No need to count bits being shifted.
Interesting. I think that requires an additional test because I don't believe that SRL sets all the flags. It might make the worst case performance worse. Let me check. Thanks!

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Mar 03, 2025 2:45 am    Post subject: Reply with quote

sublogic wrote:
The assembler looks like this: ...
I didn't benchmark it, but it looks more efficient... (I did run your tests and the function is correct.)
Thanks! I'll check. I'm going to run worst case, though, testing against ~0l.

sublogic wrote:
(Hey, how do you get g++ to copy the source lines in the assembly listing?)
That'd be the --verbose-asm command line option.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
sublogic
Guru
Guru


Joined: 21 Mar 2022
Posts: 318
Location: Pennsylvania, USA

PostPosted: Mon Mar 03, 2025 2:57 am    Post subject: Reply with quote

John R. Graham wrote:
Thanks! I'll check. I'm going to run worst case, though, testing against ~0l.
Make sure it's using logical shifts, not arithmetic shifts, otherwise it could get very slow :wink:
(I made mine take an unsigned long instead of just long, for that reason.)
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Mar 03, 2025 3:04 am    Post subject: Reply with quote

Indeed. ;) My original hand-crafted code already did.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23172

PostPosted: Mon Mar 03, 2025 3:17 am    Post subject: Reply with quote

If I were to guess, I would speculate that:
  • Perhaps sar / shr are relatively slow. In the g++ version, the result of sar isn't needed until the top of the next iteration of the loop, so a pipelined CPU can finish the and/add/sub before it needs the result of that sar to stabilize. In the handmade version, you need the result of shr to feed the very next instruction, so your adc will stall waiting for shr to produce the carried bit.
  • Perhaps adc, like loop, is just a slow instruction on modern CPUs, so the g++ version that avoids adc and the carry flag fares better. It might be interesting to replace that adc with a setc %dl ; addl %rdx, %eax - although setc itself might be slow. (Remember to zero edx before the loop if you try this, since setc only writes the lowest byte.)

If you have access to a CPU that has little or no pipelining and speculative execution, it could be interesting to benchmark both programs there. That would let us rule in or rule out that g++'s sar benefits from executing in the background while the following instructions are decoded and resolved.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54990
Location: 56N 3W

PostPosted: Mon Mar 03, 2025 10:45 am    Post subject: Reply with quote

John R. Graham,

You could do it in 4 byte size chunks with a 256 entry look up table.
The implementation detail is left as an exercise for the reader :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Fri Mar 07, 2025 2:52 pm    Post subject: Reply with quote

I ended up determining that the culprit was the ADCx instruction, not the SHRx, so one of Hu's guesses was spot on. It's just waay slower than ADDx. I also learned that SHRx does set the flags so it can gainfully be used at the end of the loop. I ended up beating the -O2 compiler code with my final hand-optimized bit-at-a-time code, but only by a small amount: 1.6%.

These results were with Neddy's recommended "don't continue if there are no more one bits" algorithm, both in C++ and assembler. The final C++ code is:
Code:
int count_one_bits(unsigned long number) {
   int count = 0;
   
   do {
      count += number & 1;
      number >>= 1;
   } while (number);

   return count;
}
Interestingly, "do { } while()" was significantly faster than "while() { }". Come to think of it, that actually makes sense. The former almost has to have two jumps per loop whereas the latter can have only one.

My final hand-crafted assembly code is:
Code:
  10                 _Z14count_one_bitsm:
  11                 .LFB0:
  12                    .cfi_startproc
  13 0000 F30F1EFA       endbr64   
  14 0004 31C0           xorl   %eax, %eax   # <retval>
  15 0006 BA010000       movl    $1, %edx
  15      00
  16                 
  17 000b 0F1F4400       .p2align 4
  17      00
  18                 .L2:
  19 0010 89F9           movl    %edi, %ecx
  20 0012 21D1           andl    %edx, %ecx
  21 0014 01C8           addl    %ecx, %eax
  22 0016 48D1EF         shrq     %rdi
  23 0019 75F5           jnz     .L2
  24 001b C3             ret   
  25                    .cfi_endproc
 
As my whole purpose is to practice, I'm going to work on the byte-at-a-time algorithm next.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54990
Location: 56N 3W

PostPosted: Fri Mar 07, 2025 7:52 pm    Post subject: Reply with quote

John R. Graham,

You use a 64 bit register for the counter. An 8 bit register would be plenty.
Does that help speed wise?
In theory, no, its a 64 bit CPU, In practice, its a microcoded RISC machine, so nothing surprises me.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Fri Mar 07, 2025 8:10 pm    Post subject: Reply with quote

No, it's only a 32-bit register: eax. 64-bit would be rax. My intuition is that it won't make a difference, but I'll check.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23172

PostPosted: Fri Mar 07, 2025 8:37 pm    Post subject: Reply with quote

If you're still testing and tweaking, I wonder whether you would be better off using a literal andl $1,%ecx instead of andl %edx,%ecx where %edx always contains a 1. The CPU might take a different path when the mask is a directly embedded constant, versus being a register that it must predict is 1.

As regards do { } while: that form removes the initial test+jump that a while would require. This helps code size, and may help in other ways.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Fri Mar 07, 2025 10:44 pm    Post subject: Reply with quote

Hu wrote:
If you're still testing and tweaking, I wonder whether you would be better off using a literal andl $1,%ecx ...
As it turns out, register to register is fastest, which is a piece of traditional lore that's survived.

Hu wrote:
As regards do { } while: that form removes the initial test+jump that a while would require. This helps code size, and may help in other ways.
Yes, that was kind of what I was saying as well, above. The compiler at -O2 actually came up with the same code that I wrote for the do { } while() minus the preloading of a constant in a register.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
logrusx
Advocate
Advocate


Joined: 22 Feb 2018
Posts: 2818

PostPosted: Sat Mar 08, 2025 7:48 am    Post subject: Reply with quote

NeddySeagoon wrote:
John R. Graham,

You use a 64 bit register for the counter. An 8 bit register would be plenty.
Does that help speed wise?
In theory, no, its a 64 bit CPU, In practice, its a microcoded RISC machine, so nothing surprises me.


Even RISC wouldn't benefit. Everything unaligned would require padding and thus - extra work.

Best Regards,
Georgi
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Sun Mar 09, 2025 8:01 pm    Post subject: Reply with quote

Completed the byte-at-a-time implementation (another of Neddy's suggestions). C++ code is (bit count table omitted for brevity):
Code:
#include "bit-count-table.h"

int count_one_bits(unsigned long number) {
    int count = 0;

    do {
        count += bitCountTable[number & 0xFF];
        number >>= 8;
    } while (number);

    return count;
}
which renders, at -O2, code that's 384% faster than the bit-at-a-time code. Not really surprising, but, oh my gosh, I've discovered the "-masm=intel" option to gcc/g++! Intel assembly language syntax is sooo much nicer than AT&T syntax (in my humble opinion). The generated code is...pretty perfect:
Code:
  11                    _Z14count_one_bitsm:
  12                    .LFB0:
  13                            .cfi_startproc
  14 0000 F30F1EFA              endbr64
  15                    # count-bits-byte-at-a-time.cpp:5:      int count = 0;
  16 0004 31C0                  xor     eax, eax        # <retval>
  17 0006 488D0D00              lea     rcx, _ZL13bitCountTable[rip]    # tmp107,
  17      000000
  18 000d 0F1F00                .p2align 4
  19                            .p2align 4
  20                            .p2align 3
  21                    .L2:
  22                    # count-bits-byte-at-a-time.cpp:8:              count += bitCountTable[number & 0xFF];
  23 0010 400FB6D7              movzx   edx, dil        # _1, number
  24                    # count-bits-byte-at-a-time.cpp:8:              count += bitCountTable[number & 0xFF];
  25 0014 0FB61411              movzx   edx, BYTE PTR [rcx+rdx] # _3, bitCountTable[_1]
  26                    # count-bits-byte-at-a-time.cpp:8:              count += bitCountTable[number & 0xFF];
  27 0018 01D0                  add     eax, edx        # <retval>, _3
  28                    # count-bits-byte-at-a-time.cpp:10:     } while (number);
  29 001a 48C1EF08              shr     rdi, 8  # number,
  30 001e 75F0                  jne     .L2     #,
  31                    # count-bits-byte-at-a-time.cpp:13: }
  32 0020 C3                    ret     
  33                            .cfi_endproc
I'm not sure I'm going to be able to do any better than that.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3545

PostPosted: Sun Mar 09, 2025 9:46 pm    Post subject: Reply with quote

One trick I saw in xxhash was making the code suitable for reordering by the CPU.

In this case it would probably mean unrolling that loop though, so you can create a few blocks of instructions independent of each other.
E.g. instead of: shl 1, and 1, add couter, loop
make i: shl 1, shl 2, shl 3, shl 4, to 4 different output registers, then bitwise AND 1 each of them into next set of 4 registers, and then add each resulting bit to its respective counter, and then loop back using the value from shl 4 as input.

I know it's not a perfect example, but should be good enough to explain the idea; always have a few other things to do between calculating a value and using it.
And a part of why it's not perfect is because at this point you could probably MMX the hell out of it :lol:
_________________
Make Computing Fun Again
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9926
Location: almost Mile High in the USA

PostPosted: Sun Mar 09, 2025 11:44 pm    Post subject: Reply with quote

What -march are you all using?

I'm surprised gcc and clang aren't optimizing it to one instruction and no loops, as long as you're not using base x86-64.

---

After a bit of research, it looks like if you rewrite your loop to:
Code:
   while(number != 0) {
   ++count;
   number &= (number-1);
   }

and target for a modern x86_64 microarchitecture, then gcc will figure things out and give you the fastest popcount solution.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?


Last edited by eccerr0r on Mon Mar 10, 2025 2:53 am; edited 1 time in total
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10744
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Mar 10, 2025 2:26 am    Post subject: Reply with quote

Well, -march=native, which I'm using, on my Intel(R) Xeon(R) W-2135 CPU evaluates to:
Code:
-march=skylake-avx512 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mavx512f -mbmi -mbmi2 -maes -mpclmul -mavx512vl -mavx512bw -mavx512dq -mavx512cd -mno-avx512vbmi -mno-avx512ifma -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mhle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mrtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex -mno-avxvnniint16 -mno-sm3 -mno-sha512 -mno-sm4 -mno-apxf -mno-usermsr --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=8448 -mtune=skylake-avx512 -fcf-protection -foffload-options=-fcf-protection=none -dumpbase null
What single instruction would that be? One of the SIMD ones? I really need to learn about those.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.


Last edited by John R. Graham on Mon Mar 10, 2025 2:55 am; edited 2 times in total
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9926
Location: almost Mile High in the USA

PostPosted: Mon Mar 10, 2025 2:54 am    Post subject: Reply with quote

I made a post edit above...

It's not MMX/SIMD, it's just an x86/x86_64 extension, not sure where I found it, but it's a neat instruction indeed.

I think newer arm also has some analog, mainly because now a lot of cpus are trying to optimize for encryption and ECC.

Oh I guess "modern" I mean something newer than a Core2 CPU... so even old nehalems can do it. So I think the excluded x86_64's are pentium4s, original K8s, Core2 quad/duo, there might be more that don't have it.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
no101
n00b
n00b


Joined: 10 Oct 2022
Posts: 15
Location: Piney Woods

PostPosted: Mon Mar 10, 2025 5:19 am    Post subject: Reply with quote

Given the nature of your post, I assumed that you were aware of popcount but wanted to achieve the same result on your own. Here's a few links.

C++20 has std::popcount
https://en.cppreference.com/w/cpp/numeric/popcount

GCC and Clang both have __builtin_popcount
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Wikipedia says that AMD considers POPCNT part of BMI1 while Intel considers it part of SSE4.2
It's under ABM at
https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9926
Location: almost Mile High in the USA

PostPosted: Mon Mar 10, 2025 5:57 am    Post subject: Reply with quote

I assumed otherwise because I'd think a good optimizing compiler should have optimized popcount ... and it appears a specific method of writing the C implementation would popcount automatically (meaning, you don't need to use a __builtin or std::) where none of the assembly dumps did so... hence was quite interesting that it didn't.

I don't consider an instruction to be a MMX or SIMD instruction if it doesn't use the MMX/SIMD registers, this popcount works on the base architectural registers. I suspect there is a variant that will work on simd/mmx registers... or definitely avx512 registers...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54990
Location: 56N 3W

PostPosted: Mon Mar 10, 2025 10:24 am    Post subject: Reply with quote

John R. Graham,

Quote:
What single instruction would that be?


Axioms
All programs can be shortened by one byte.
All programs contain one more bug

Conclusion
In the limit all programs can be shortened to a single instruction, which is wrong, :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum