Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[split] complex CPU instructions increase binary size?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
mv
Watchman
Watchman


Joined: 20 Apr 2005
Posts: 6780

PostPosted: Tue Aug 08, 2023 5:38 am    Post subject: [split] complex CPU instructions increase binary size? Reply with quote

Split from Historically increasing compilation times to avoid more derailing. -- Zucca

eccerr0r wrote:
Yes I'd expect some runtime checks added to help with bloat

Runtime checks are all optional. Of course, I'd not recommend to disable them. What is probably rather expensive is the code which helps against spectre & co. If you do not care about security, you can disable that, but I'd advice against it, as well.
Quote:
but I would hope coupled with code optimization and using newer CPU instructions, code size could also go down.

No, quite the opposite. What the compiler would previously do in a simple loop now needs many instructions to do in parallel, e.g. shifting first to the sse* (and friends) part of the cpu: Unless you optimize for size, code gets longer but faster (since more parallel). But it is not only the compiler optimization but also how the code is written (that is, whether it makes use of sse* techniques).
Quote:
Also another thing is C++ ... where the compiler tries to guess what you're doing leading to massive bloat because you "might" need the code.

I do not agree with that: What costs time in C++ is the compilation, because the data structures needed for that are quite complex, especially if the compiled code has complex data structures (which is the rule). Concerning the code length, there is a minimal overhead if you use true inheritance (with virtual functions) by having a function table for each such class, but this is not because you "might" need it, but because you do need it. Another thing is that in contrast to C, it is "cheap" (concerning the coding effort of the developer) to use non-trivial data structures such a hashes or red-black-trees (or even just dynamically growing vectors) throughout: In C every such usage is cumbersome, so C developers tend to use it only in cases which would otherwise be insanely expensive. The C++ compiler usually inlines most part of this fancy data structure code for each usage. The result is usually much more effective but much longer code compared to code produce from a "typical" C source (unless the author was extremely ambitious to use "good" data structures throughout).
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Tue Aug 08, 2023 6:34 am    Post subject: Reply with quote

So you're saying new instructions like movbe, popcnt, aesenc etc. on newer microarchitectures will increase code size? I think not. These tend to decrease code size, else don't bother using them and actually write loops that perform slower. So not all features add to code size.

Even if there is a case that it really does need to link everything in, C++ code still tends to generate huge binaries, granted if you're comparing to writing an optimized, takes forever to write C code versus stl'ized whatnot and easier to write/maintain C++ code, the former will be smaller. And there's still the aspect that gcc tends to add a lot of excess code from libraries, object files, etc., been fighting it since forever on my microcontroller projects that just won't strip out. Perhaps the hope is that the linker/lto does the right thing and prune excess code from the resultant binary, it's been a pain upgrading gcc, I still keep old gcc around because I really don't want all this new bounds checking, loop unrolling, stack gapping, and continue to write in pure C to fit in 4KB of flash EEPROM. Merely compiling the same code with the same target from version to version gets larger and larger binaries for no good reason, especially if it no longer fits.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 4001
Location: Rasi, Finland

PostPosted: Tue Aug 08, 2023 8:29 am    Post subject: Reply with quote

CaptainBlood wrote:
Expect binary size to grow as well, most of the time.

Thks 4 ur attention, interest & support.
I wonder if -Os will do better nowdays than before...
_________________
..: Zucca :..

My gentoo installs:
init=/sbin/openrc-init
-systemd -logind -elogind seatd

Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Tue Aug 08, 2023 12:43 pm    Post subject: Reply with quote

For my microcontroller work, -Os has been bloating too. In fact in old gcc3 I was able to use -O2 and get reasonable size, but nowadays I'm forced to use -Os else it won't fit -- all with the same source code.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
mv
Watchman
Watchman


Joined: 20 Apr 2005
Posts: 6780

PostPosted: Tue Aug 08, 2023 6:25 pm    Post subject: Reply with quote

eccerr0r wrote:
So you're saying new instructions like movbe, popcnt, aesenc etc. on newer microarchitectures will increase code size? I think not. These tend to decrease code size

They tend tend to increase code size but are more performant, because the loop executes in parallel.
Quote:
else don't bother using them and actually write loops that perform slower.

Why should one - performance is usually the aim one optimizes for and what these features are made for.
Quote:
C++ code still tends to generate huge binaries

I explained why: It uses everywhere datastructures which need, say, logarithmic time, which needs more code than straightforward data structure which use, say, linear time. If the data to be stored in that structure is small, this is only small performance and much larger code (if it is inlined which is usually the case) so that you would usually not write this in C. However, even in that case it usually is worth it, performance-wise.
Quote:
If you're comparing to writing an optimized, takes forever to write C code versus stl'ized whatnot and easier to write/maintain C++ code, the former will be smaller.

Yes, and usually still performs worse, because you would never be inlining in every second line some optimized red-black tree algorithm.
Quote:
it's been a pain upgrading gcc

This was when there were major changes in the STL code. My impression is that such changes have become rare meanwhile: STL has become quite stable.
Quote:
and continue to write in pure C to fit in 4KB of flash EEPROM

Writing with such size limitations is a different story where it is usually necessary to sacrifice performance for space. But these limitations do not apply to modern systems where performance is much more important. Even the smallest USB stick is far beyond 4KB of memory.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Tue Aug 08, 2023 8:40 pm    Post subject: Reply with quote

mv wrote:
eccerr0r wrote:
So you're saying new instructions like movbe, popcnt, aesenc etc. on newer microarchitectures will increase code size? I think not. These tend to decrease code size

They tend tend to increase code size but are more performant, because the loop executes in parallel.

I call BS assuming you made a very confusing and conflicting typo above. In any case try writing code without these instructions and with them. They will be shorter and run in fewer cycles as the CPU itself does the shifting/counting, and this is not to mention you need to compare like with like - if you want to run on multiple memory/register locations you have to replicate the loop for each expanded loop.

For example, movbe would require register spill and then do the swap versus just do it in one instruction. Net win, granted only a small number of code segments need this instruction - it's mostly networking code. Got multiple IP addresses to byteswap? Even more win.

There's no question more and more stuff needs to be run but these cpu instructions were made to reduce the number of instructions and thus fetch (and possible stack operation) cycles to perform these functions.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
mv
Watchman
Watchman


Joined: 20 Apr 2005
Posts: 6780

PostPosted: Tue Aug 08, 2023 9:56 pm    Post subject: Reply with quote

eccerr0r wrote:
mv wrote:
eccerr0r wrote:
So you're saying new instructions like movbe, popcnt, aesenc etc. on newer microarchitectures will increase code size? I think not. These tend to decrease code size

They tend tend to increase code size but are more performant, because the loop executes in parallel.

I call BS assuming you made a very confusing and conflicting typo above. In any case try writing code without these instructions and with them.

So let us take a typical code example, say this SSE2 code against the C++ one-liner.

We do not know how the C++ one-liner is implemented, but I guess that it is safe to assume that this is the straightforward implementation by two iterated loops.
One can see that concerning code size the SSE2 implementation needs quite some administration overhead: Two additional initialization loops with the actual SSE2 instructions in addition to the straightforward implementation after the 16th element. It is worth it, because the comparison of the first 16 elements are blazingly fast. But concerning code size, the straightforward implementation without the SSE2 optimization is the clear winner, of course.
Back to top
View user's profile Send private message
stefan11111
l33t
l33t


Joined: 29 Jan 2023
Posts: 954
Location: Romania

PostPosted: Tue Aug 08, 2023 10:27 pm    Post subject: Reply with quote

eccerr0r wrote:
For my microcontroller work, -Os has been bloating too. In fact in old gcc3 I was able to use -O2 and get reasonable size, but nowadays I'm forced to use -Os else it won't fit -- all with the same source code.

I wrote this a few weeks ago.
With -Os, I get this output for every empty function:
Code:
    5080:       31 c0                   xor    %eax,%eax
    5082:       c3                      ret

With -O2, I get this:
Code:
    5080:       31 c0                   xor    %eax,%eax
    5082:       c3                      ret
    5083:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
    508a:       00 00 00 00
    508e:       66 90                   xchg   %ax,%ax

In terms of size, there is a clear winner.
Can someone tell me how is the second one any faster(-O2 is supposed to be faster than -Os)?
_________________
My overlay: https://github.com/stefan11111/stefan_overlay
INSTALL_MASK="/etc/systemd /lib/systemd /usr/lib/systemd /usr/lib/modules-load.d *udev* /usr/lib/tmpfiles.d *tmpfiles* /var/lib/dbus /usr/bin/gdbus /lib/udev"
Back to top
View user's profile Send private message
spica
Guru
Guru


Joined: 04 Jun 2021
Posts: 351

PostPosted: Tue Aug 08, 2023 11:32 pm    Post subject: Reply with quote

stefan11111 wrote:
Can someone tell me how is the second one any faster(-O2 is supposed to be faster than -Os)?

-falign-functions

-O2 turns on ... -falign-functions ...
-Os Optimize for size. -Os enables all -O2 optimizations except those that often increase code size: ... -falign-functions ...

-falign-functions
Align the start of functions to the next power-of-two greater than or equal to n, skipping up to m-1 bytes.
This ensures that at least the first m bytes of the function can be fetched by the CPU without crossing an n-byte alignment boundary.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Tue Aug 08, 2023 11:35 pm    Post subject: Reply with quote

mv wrote:
We do not know how the C++ one-liner is implemented, but I guess that it is safe to assume that this is the straightforward implementation by two iterated loops.

You're not talking about the specific, NON-SSE/MMX instructions above, try again.

stefan11111 wrote:
I wrote this a few weeks ago.
With -Os, I get this output for every empty function:
Code:
    5080:       31 c0                   xor    %eax,%eax
    5082:       c3                      ret

With -O2, I get this:
Code:
    5080:       31 c0                   xor    %eax,%eax
    5082:       c3                      ret
    5083:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
    508a:       00 00 00 00
    508e:       66 90                   xchg   %ax,%ax

In terms of size, there is a clear winner.
Can someone tell me how is the second one any faster(-O2 is supposed to be faster than -Os)?

They actually look identical but the latter is exploiting cache. -O2 is stuffing crap to fill a cache line, not even sure if this is stuff related to this subroutine. Anyway at the very least any subsequent code after this code is going to be aligned on a cache line which will fill relevant code instead of filing with irrelevant code and needing to fetch another cache line soon after.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
mv
Watchman
Watchman


Joined: 20 Apr 2005
Posts: 6780

PostPosted: Wed Aug 09, 2023 9:47 pm    Post subject: Reply with quote

eccerr0r wrote:
mv wrote:
We do not know how the C++ one-liner is implemented, but I guess that it is safe to assume that this is the straightforward implementation by two iterated loops.

You're not talking about the specific, NON-SSE/MMX instructions above, try again.

Try again without shifting goalposts after being proven wrong. The claim you called BS was that using the possibilities of modern processors (SSE* as one example) will with good optimization usually produce faster but longer code, the reason being the necessary administrative (initialization) overhead for using it efficiently.

Unfortunately, it seems that the compiler still does not do the example optimization I mentioned automatically, so we can hopefully expect that the tendency to produce even more efficient but longer code by automatic optimization will continue in future compiler versions.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Wed Aug 09, 2023 11:06 pm    Post subject: Reply with quote

mv wrote:
eccerr0r wrote:
mv wrote:
We do not know how the C++ one-liner is implemented, but I guess that it is safe to assume that this is the straightforward implementation by two iterated loops.

You're not talking about the specific, NON-SSE/MMX instructions above, try again.

Try again without shifting goalposts after being proven wrong. The claim you called BS was that using the possibilities of modern processors (SSE* as one example) will with good optimization usually produce faster but longer code, the reason being the necessary administrative (initialization) overhead for using it efficiently.

What shifting goalposts? The goalposts were set from the beginning. Go re-read the first reply about specific instructions ... that you answered totally wrong. Stop doubling down on your own mistakes -- You were the person who moved the "goalposts" just to make you think you knew better!!!
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
mv
Watchman
Watchman


Joined: 20 Apr 2005
Posts: 6780

PostPosted: Thu Aug 10, 2023 6:58 am    Post subject: Reply with quote

eccerr0r wrote:
Go re-read the first reply about specific instructions ...

You must mean this
mv wrote:
...
eccerr0r wrote:
but I would hope coupled with code optimization and using newer CPU instructions, code size could also go down

No, quite the opposite. What the compiler would previously do in a simple loop now needs many instructions to do in parallel, e.g. shifting first to the sse* (and friends)...

eccerr0r wrote:
mv wrote:
eccerr0r wrote:
So you're saying new instructions like movbe, popcnt, aesenc etc. on newer microarchitectures will increase code size? I think not. These tend to decrease code size

They tend tend to increase code size but are more performant, because the loop executes in parallel.

I call BS

where after being proven wrong you now want to shift the context from "new instuctions" to "two specific instructions" (n which your "etc" makes no sense at all). I am stopping this nonsense discussion now.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Thu Aug 10, 2023 11:56 am    Post subject: Reply with quote

Actually you've been proven wrong that there are instructions that should be scheduled in that does decrease code size instead of increasing it, and new versions of gcc are necessary to use these instructions. Yes and I'm done, I've proven my point so far back, just stop claiming your selective quoting that show nothing as being righteous.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 4001
Location: Rasi, Finland

PostPosted: Thu Aug 10, 2023 7:22 pm    Post subject: Reply with quote

Posts above have been split from: Historically increasing compilation times.

While this is highly interesting topic, let's try to be more gentle.
For example prove your point(s) via code and disassembly of the created binary.

From all the talk I'd gather that generally the compilation process would go faster if one chooses a generic x86_64 for -march?
_________________
..: Zucca :..

My gentoo installs:
init=/sbin/openrc-init
-systemd -logind -elogind seatd

Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
GDH-gentoo
Veteran
Veteran


Joined: 20 Jul 2019
Posts: 1837
Location: South America

PostPosted: Thu Aug 10, 2023 9:40 pm    Post subject: Reply with quote

I was also thinking that while I read.

eccerr0r wrote:
In fact in old gcc3 I was able to use -O2 and get reasonable size, but nowadays I'm forced to use -Os else it won't fit -- all with the same source code.

Wouldn't the comparison of the outputs of gcc -S for different GCC versions tell you why?

mv wrote:
We do not know how the C++ one-liner is implemented, [...]

Wouldn't g++ -S reveal that?
_________________
NeddySeagoon wrote:
I'm not a witch, I'm a retired electronics engineer :)
Ionen wrote:
As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though :)
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Thu Aug 10, 2023 11:13 pm    Post subject: Reply with quote

TBH I think this thread has lived its life and everything has already been discussed. Grr.

GDH-gentoo wrote:
Wouldn't the comparison of the outputs of gcc -S for different GCC versions tell you why?

Indeed it would/did alas there was way too much to sift through by hand, and performance of -Os was sufficient that it wasn't worth pursuing. However the version to version changes were annoying especially since the AVR8 target has not changed - no cache and all instructions except for a handful run in one cycle. Any optimization would at most save a cycle here and there, and if you screw up optimizing, well you just lost performance...

I think the only major change on AVR8 was just the addition of mult, and this definitely would decrease code size -- except the chip I was targeting did not support it.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Thu Aug 10, 2023 11:33 pm    Post subject: Reply with quote

Zucca wrote:
From all the talk I'd gather that generally the compilation process would go faster if one chooses a generic x86_64 for -march?

I doubt it, or at least it's marginal. Actually it would be nice to do profiling to prove it but TBH my gut feel is that more dependent on the code being compiled.

The first step is all disk trying to do all headers, macro expansion. This is fairly easy for modern cpus but disk speed is involved here.

Then at least for C++ code, class expansion takes a while, every time it instantiates a new class, boom a copy of the code. Then it needs to go through optimization of all that expansion. However this hasn't changed other than C++(YEAR) or all these "new" revisions that I haven't kept up with, they probably add yet more features that take more time to expand out.

Parsing needs to be done to see what you actually wrote, it also needs to check for coding errors. Removing these checks probably won't help as it's a conditional branch that still needs to be run in the codestream of gcc, though it's worth trying if you're really interested.

After it parses your code, it needs to go through instruction scheduling. I don't think it really matters much of the microarchitecture (though if it's a big change like trying to use SSE registers instead of the standard register file, it's a different paradigm and needs a different code stream and thus difficult to compare), it just plops down code as needed, though given more constraints like cache size, number of registers available, etc. may affect it slightly but it should be straightforward and very localized, not like it needs to check for interference several kilobytes away as they'll probably be in another cache line anyway. Oh and yes if it needs to branch to that code frequently it will try to put it close so that it will more likely not get evicted from the cache... Yes if you have fewer registers available it will need to spill and fill more frequently, needs more memory, and thus take longer as more instructions need to be emitted to do what needs to be done.

I think the parsing is the lions share of runtime of gcc, though scheduling is equally important, I don't think it's huge. But technically, compile time should not really be important, the resultant binary that's generated by gcc is ultimately the gold standard. Yes it's annoying it takes forever to compile (since we're running Gentoo after all) but if the resultant binary, whether it's for your code or compiling gcc itself is small and fast, that's the end product that needs to be judged.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9974
Location: almost Mile High in the USA

PostPosted: Thu Aug 10, 2023 11:47 pm    Post subject: Reply with quote

And I have to write this about the topic of this thread:
Complex CPU instructions do NOT increase binary size. They *DECREASE* binary size, hence the CISC argument to begin with. RISC binaries are larger.

However, optimization of existing instructions so that they fit cache and squeeze out instruction latencies increase binary size.
Though this is no longer published by Intel or AMD, instruction execution times are different between instructions. Choosing different instructions can significantly decrease runtime by choosing the faster instructions even if you have to run multiples (also things like unrolling is related). You may need to run multiple of the faster instructions to do the same as the slow one, and well, this increases binary size to decrease runtime. This trick can only be done on CISC processors; RISC you don't have this flexibility.

A bad example here: Running x87 math code is *SLOW* at least compared to SSE/MMX. Though nowadays on chip they try to route x87 code through the MMX/SSE silicon by stealthily converting the x87 code, choosing SSE code will give you a performance boost. However I think SSE code is smaller than x87 code so hence this is a bad example.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
sam_
Developer
Developer


Joined: 14 Aug 2020
Posts: 2244

PostPosted: Sat Aug 12, 2023 5:17 am    Post subject: Reply with quote

eccerr0r wrote:
So you're saying new instructions like movbe, popcnt, aesenc etc. on newer microarchitectures will increase code size? I think not. These tend to decrease code size, else don't bother using them and actually write loops that perform slower. So not all features add to code size.

Even if there is a case that it really does need to link everything in, C++ code still tends to generate huge binaries, granted if you're comparing to writing an optimized, takes forever to write C code versus stl'ized whatnot and easier to write/maintain C++ code, the former will be smaller. And there's still the aspect that gcc tends to add a lot of excess code from libraries, object files, etc., been fighting it since forever on my microcontroller projects that just won't strip out. Perhaps the hope is that the linker/lto does the right thing and prune excess code from the resultant binary, it's been a pain upgrading gcc, I still keep old gcc around because I really don't want all this new bounds checking, loop unrolling, stack gapping, and continue to write in pure C to fit in 4KB of flash EEPROM. Merely compiling the same code with the same target from version to version gets larger and larger binaries for no good reason, especially if it no longer fits.


Stack smashing protection (SSP), stack clash protection (SCP), fortification (_FORTIFY_SOURCE), PIC & PIE, relro, and separate code sections can all be disabled if desired.

Unrolling is another matter, unrelated to security concerns, and compilers won't do it unconditionally. If you think it's unrolling when it shouldn't, do file a bug upstream.

I'd also like to note, for the benefit of anyone unaware, that both Clang and GCC nowadays have -Oz to even more aggressively optimise for size.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum