Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Lessons in Optimization on Modern X86-64 Silicon
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10739
Location: Somewhere over Atlanta, Georgia

PostPosted: Sun Mar 02, 2025 10:49 pm    Post subject: Lessons in Optimization on Modern X86-64 Silicon Reply with quote

I was just doing a brush-up on some not-too-recently-used skills, so I set out to write a small function in 64-bit X86 assembly language. To start simply, I chose "count the number of 1 bits in a long integer"). The calling code is written in C++ and in fact I first prototyped the bit counting code in C++ as well.

Here's my trivial callng program (it also running enough iterations of the function to allow for benchmarking):
Code:
#include <iostream>
#include <fmt/core.h>
#include "count-bits.h"

void print_bit_count(long number) {
   fmt::print("The number {0:10d} (0x{0:016x}) has {1:2d} one bits.\n", number, count_one_bits(number));
}

int main(void) {
   fmt::print("Hello, world!\n");

   print_bit_count(1000l);
   print_bit_count(5000l);
   print_bit_count(1000000l);
   print_bit_count(1000000000l);

   // Benchmark by calling the count_one_bits() function a billion times:
   auto result = 0;
   for (auto i = 0; i < 100000000; i++) {
      result += count_one_bits(1000000000l);
   }
   
   return 0;
}
and here's the C++ bit counting function I created:
Code:
int count_one_bits(long number) {
   int count = 0;
   
   for (int i = 0; i < sizeof(long) * 8; i++) {
      if ((number & 1) != 0)
         count++;
      number >>= 1;
   }

   return count;
}
Initially, I looked at the assembler produced from the C++ version of the function. Here's a snippet, produced with g++ -O2:
Code:
   6                    .text
   7                    .p2align 4
   8                    .globl   _Z14count_one_bitsl
   9                    .type   _Z14count_one_bitsl, @function
  10                 _Z14count_one_bitsl:
  11                 .LFB0:
  12                    .cfi_startproc
  13 0000 F30F1EFA       endbr64   
  14                 # count-bits.cpp:2: int count_one_bits(long number) {
  15 0004 B8400000       movl   $64, %eax   #, ivtmp_4
  15      00
  16                 # count-bits.cpp:3:    int count = 0;
  17 0009 31D2           xorl   %edx, %edx   # <retval>
  18 000b 0F1F4400       .p2align 4
  18      00
  19                    .p2align 4
  20                    .p2align 3
  21                 .L2:
  22                 # count-bits.cpp:6:       if ((number & 1) != 0)
  23 0010 4889F9         movq   %rdi, %rcx   # number, _1
  24                 # count-bits.cpp:8:       number >>= 1;
  25 0013 48D1FF         sarq   %rdi   # number
  26                 # count-bits.cpp:6:       if ((number & 1) != 0)
  27 0016 83E101         andl   $1, %ecx   #, _1
  28                 # count-bits.cpp:6:       if ((number & 1) != 0)
  29 0019 01CA           addl   %ecx, %edx   # _1, <retval>
  30                 # count-bits.cpp:5:    for (int i = 0; i < sizeof(long) * 8; i++) {
  31 001b 83E801         subl   $1, %eax   #, ivtmp_4
  32 001e 75F0           jne   .L2   #,
  33                 # count-bits.cpp:12: }
  34 0020 89D0           movl   %edx, %eax   # <retval>,
  35 0022 C3             ret   
  36                    .cfi_endproc
Actually, that didn't look too bad to me. I see evidence of some cleverness from the optimizer. (I won't even bother to show the assembly language produced by -O0; it's not worth our time.) But I thought I could do better. Here's what I came up with:
Code:
   6                    .text
   7                    .p2align 4
   8                    .globl   _Z14count_one_bitsl
   9                    .type   _Z14count_one_bitsl, @function
  10                 _Z14count_one_bitsl:
  11                 .LFB0:
  12                    .cfi_startproc
  13 0000 F30F1EFA       endbr64   
  14                 
  15 0004 B9400000       movl    $64, %ecx
  15      00
  16 0009 31C0           xorl   %eax, %eax   # <retval>
  17 000b 0F1F4400       .p2align 4
  17      00
  18                 .L2:
  19 0010 48D1EF         shrq   %rdi   # number
  20 0013 83D000         adcl   $0, %eax   # _1, <retval>
  21 0016 FFC9           decl   %ecx
  22 0018 75F6           jnz    .L2
  23 001a C3             ret   
  24                    .cfi_endproc
I managed to remove a redundant copy and a logical AND from the loop by talking advantage of the carry bit (where the SHR instructions place the bit that's shifted off) and a redundant copy of the sum (but that's outside the loop so just makes smaller—but not particularly faster—code. I was surprised (but I guess not completely astonished) to learn that my shorter, "better" code was about 13% slower that the g++ -O2 generated code. So the GCC optimizer knows some things I don't. :P I tried several variations on the code without being able to ever achieve the performance of the g++-generated code. I don't think I broke any of the obvious rules (foremost among them, don't stall the pipeline unnecessarily). I didn't try unrolling the loop, because g++ didn't and I thought that would be cheating, anyway. ;)

Couple of things I discovered (or, at least was reminded of) along the way:
  • That ".p2align 4" psuedo-op (and the fancy multi-byte NOP it concocts) is important. Aligning the loop on a 16-byte boundary increased the speed of the loop by about 30%.
  • At one point I tried replacing the decl/jnz loop counter management with a single "loop" opcode. That was a bad idea, decreasing performance by over 300%. I remember reading somewhere that modern silicon schedulers don't try to optimize the loop instruction (because it's hardly ever used) and boy does that seem to be true!
This isn't really a support request. I just thought some of you might be interested. (Although if you have ideas, I'd love to hear them.)

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54956
Location: 56N 3W

PostPosted: Sun Mar 02, 2025 11:10 pm    Post subject: Reply with quote

John R. Graham,

I'll have a wee nibble ...

It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy. :)

Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the
Code:
decl   %ecx
is redundant. No need to count bits being shifted.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
sublogic
Guru
Guru


Joined: 21 Mar 2022
Posts: 318
Location: Pennsylvania, USA

PostPosted: Mon Mar 03, 2025 2:11 am    Post subject: Reply with quote

(Independently of Neddy) I tried this:
Code:
int count_one_bits(unsigned long number)
{
   int count;
   for(count= 0; number; number >>= 1) {
      count+= number&1;
   }
   return count;
}

The assembler looks like this:
Code:
_Z14count_one_bitsm:
.LFB2086:
        .cfi_startproc
        endbr64
        xorl    %eax, %eax
        testq   %rdi, %rdi
        je      .L4
        .p2align 4
        .p2align 4
        .p2align 3
.L3:
        movl    %edi, %edx
        andl    $1, %edx
        addl    %edx, %eax
        shrq    %rdi
        jne     .L3
        ret
        .p2align 4,,10
        .p2align 3
.L4:
        ret
        .cfi_endproc

It looks like %eax holds "count" and %rdi holds "number". The "for" loop is at label L3.
(Hey, how do you get g++ to copy the source lines in the assembly listing?)

I didn't benchmark it, but it looks more efficient... (I did run your tests and the function is correct.)
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10739
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Mar 03, 2025 2:24 am    Post subject: Reply with quote

NeddySeagoon wrote:
John R. Graham,

I'll have a wee nibble ...

It looks like the loop is executed 64 times regardless of the number you start with.
It a long time since I did any x86 assy. :)

Once you have shifted all the non zero bit out into the carry bit, you can stop.
The register holding the remains of your number will be zero, so the
Code:
decl   %ecx
is redundant. No need to count bits being shifted.
Interesting. I think that requires an additional test because I don't believe that SRL sets all the flags. It might make the worst case performance worse. Let me check. Thanks!

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10739
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Mar 03, 2025 2:45 am    Post subject: Reply with quote

sublogic wrote:
The assembler looks like this: ...
I didn't benchmark it, but it looks more efficient... (I did run your tests and the function is correct.)
Thanks! I'll check. I'm going to run worst case, though, testing against ~0l.

sublogic wrote:
(Hey, how do you get g++ to copy the source lines in the assembly listing?)
That'd be the --verbose-asm command line option.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
sublogic
Guru
Guru


Joined: 21 Mar 2022
Posts: 318
Location: Pennsylvania, USA

PostPosted: Mon Mar 03, 2025 2:57 am    Post subject: Reply with quote

John R. Graham wrote:
Thanks! I'll check. I'm going to run worst case, though, testing against ~0l.
Make sure it's using logical shifts, not arithmetic shifts, otherwise it could get very slow :wink:
(I made mine take an unsigned long instead of just long, for that reason.)
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10739
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Mar 03, 2025 3:04 am    Post subject: Reply with quote

Indeed. ;) My original hand-crafted code already did.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23146

PostPosted: Mon Mar 03, 2025 3:17 am    Post subject: Reply with quote

If I were to guess, I would speculate that:
  • Perhaps sar / shr are relatively slow. In the g++ version, the result of sar isn't needed until the top of the next iteration of the loop, so a pipelined CPU can finish the and/add/sub before it needs the result of that sar to stabilize. In the handmade version, you need the result of shr to feed the very next instruction, so your adc will stall waiting for shr to produce the carried bit.
  • Perhaps adc, like loop, is just a slow instruction on modern CPUs, so the g++ version that avoids adc and the carry flag fares better. It might be interesting to replace that adc with a setc %dl ; addl %rdx, %eax - although setc itself might be slow. (Remember to zero edx before the loop if you try this, since setc only writes the lowest byte.)

If you have access to a CPU that has little or no pipelining and speculative execution, it could be interesting to benchmark both programs there. That would let us rule in or rule out that g++'s sar benefits from executing in the background while the following instructions are decoded and resolved.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54956
Location: 56N 3W

PostPosted: Mon Mar 03, 2025 10:45 am    Post subject: Reply with quote

John R. Graham,

You could do it in 4 byte size chunks with a 256 entry look up table.
The implementation detail is left as an exercise for the reader :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum