Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
PGO LTO optimize: compile with chroot on fast CPU
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
GreenNeonWhale
n00b
n00b


Joined: 30 Mar 2016
Posts: 63

PostPosted: Thu Aug 29, 2024 4:52 pm    Post subject: PGO LTO optimize: compile with chroot on fast CPU Reply with quote

Hi,

I hope that I've posted this in the correct sub-forum.

My question involves the following scenario:
Overall: compile code on faster CPU for installation and use on slower CPU
Method:
- mount entire filesystem from Slow CPU on FAST CPU system.
- chroot
- execute compiles, emerge/portage etc., and compile the code
- -march=amdfam10 set in make.conf in chroot, and on native system.
- -march=native has never been set in make.conf anytime during this.
- entire chroot used as native environment on Slow CPU system.
Assumption: Fast CPU has all the features of Slow CPU. In my case, Fast CPU is a 2nd Gen Bulldozer, and Slow CPU is a Turion II Dual-Core (march=amdfam10). So far, no crashes or execution problems.

My understanding is that both PGO and LTO optimizations involve executing code during compilation, and analyzing the results.

My Question: Given the above scenario, if I turn on PGO and/or LTO optimizations, which run on the faster and somewhat different CPU, and those results are used by the compiler to optimize, will I get code that will end up with poor optimizations when executed on the slower CPU as it is a bit different? Basically, am I wasting my time with PGO and/or LTO turned on, when I'm compiling on a non-native CPU like this? Please note, that in this scenario, substantially longer compile times are acceptable to me if I gain faster code in the end.

I'd appreciate any thoughts and/or insight you folks can offer.

Thank You!
Back to top
View user's profile Send private message
stefantalpalaru
Tux's lil' helper
Tux's lil' helper


Joined: 11 Jan 2009
Posts: 77
Location: Italy

PostPosted: Thu Aug 29, 2024 6:16 pm    Post subject: Re: PGO LTO optimize: compile with chroot on fast CPU Reply with quote

GreenNeonWhale wrote:
My understanding is that both PGO and LTO optimizations involve executing code during compilation, and analyzing the results.


No. Only PGO does that.

GreenNeonWhale wrote:
will I get code that will end up with poor optimizations when executed on the slower CPU as it is a bit different?


No.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22648

PostPosted: Thu Aug 29, 2024 6:20 pm    Post subject: Reply with quote

Link Time Optimization does not run the built code as part of its operation. It is merely an expedient way to give the compiler visibility into the whole program, so that it can perform optimizations across translation units. PGO does run instrumented code, but as far as I know, it is only looking at which paths are executed most often, not at how quickly the CPU can execute them. Therefore, results obtained on a fast system should not be skewed relative to what you would get running on a slow system (unless the test involves timing sensitive decisions, like "Run this loop as many times as possible in 2 seconds"). In my opinion, such tests should be rare.
Back to top
View user's profile Send private message
molletts
Tux's lil' helper
Tux's lil' helper


Joined: 16 Feb 2013
Posts: 129

PostPosted: Thu Aug 29, 2024 6:22 pm    Post subject: Reply with quote

Hi,

As I understand it:
  • LTO doesn't execute any of the generated code - it effectively defers (some of) the optimisation phase of compilation until link time, when all of the code to be linked together is available for inspection, allowing better choices to be made by the optimiser than it would have been able to make by looking at small chunks of code in isolation.
  • PGO does execute the generated code during the profiling phase but, as far as I am aware, it doesn't rely on precise timings, either at the instruction level or at function level. Instead, it collects statistics about things like how many times functions get called and from where, how many times loops go around (and how often they get run), which branch of "if" statements gets taken most often (the "if this then do this" case or the "otherwise do this" case) and suchlike. This allows the optimiser to make more informed decisions about whether it's worth inlining a function or unrolling a loop and whether it might be beneficial to reorganise an "if" statement so that the most likely outcome results in the code running straight through instead of jumping to elsewhere then coming back.

I haven't read the code, though, so I may be wrong.

I've been doing what you describe for some years now on this basis and haven't encountered any major problems (except when I accidentally forgot that I had put -march=native in make.conf on one of the "client" systems while experimenting on it...) I think I've seen one or two packages whose build systems generate native code regardless of the -march=... setting but I can't remember which ones offhand. I don't think they were big ones so I was able to just rebuild them locally. They've either been fixed or I'm not using them any more (or they haven't been updated for a long time) because it's quite a while since I last had to do this.

My slower systems all have a small hand-built initramfs which can be booted instead of the main system from the grub menu, which exports the main system drive via nbd (the Network Block Device). I then mount this on the build host (usually either my main PC, an FX-9590, or my test-lab server, a dual Opteron 6380, both -march=bdver2) and chroot into it to run updates. I use this method on systems including a Pentium M laptop, an Athlon Neo (-march=k8-sse3) mini-PC and my 1st-gen Core i5 (-march=nehalem) work laptop. The only system I can't update from these is an Atom-based "set top box"-type PC which I have to update from a VM in the Xeon-based cluster at work because it has the BSWAP instruction which isn't available on AMD Piledriver. Interestingly, the Core i5 laptop can't be updated from my Core i7 desktop at work because the i7 doesn't have AES-NI!

If anyone with deeper knowledge of the internal workings of the toolchain can correct any misunderstandings (or confirm that this is broadly correct), I'd certainly be grateful.

Thanks,
Stephen
Back to top
View user's profile Send private message
GreenNeonWhale
n00b
n00b


Joined: 30 Mar 2016
Posts: 63

PostPosted: Thu Aug 29, 2024 7:35 pm    Post subject: Thank You! :) & Some Other Thoughts Reply with quote

Hu & molletts,

Thank you for your replies. They were quite helpful to me, and quite informative. :)

molletts,

I think you and I have come to some similar ideas in regards to computing. Thus, I thought I'd share in case you (or anyone else reading this) is interested.

I too have decided, at some point, to try out a 9590. I haven't gotten there yet, as its still in its box, but I really hope to soon. I have a STRONG distaste for management engines. If I'm not mistaken, it is one of the fastest CPUs available that doesn't have one. Raptor Computing's POWER9 systems look fantastic, but they're rather expensive. I'm also planning trying my hand at running coreboot/libreboot on the Opteron board that they fully support. Hopefully, it will work out.

Out of curiosity, if you feel like sharing, do you air cool your 9590, or water cool it?

While I haven't build a custom initramfs like you have, what I have built and used, is a variation on that idea. I've built a custom boot/rescue/backup OS on a separate device, either a USB flash drive with hardware write protect (Kanguru), one without, or an SD card, on which I've installed a very basic setup. Said basic setup is for the purposes of booting a full disk encryption OS on the primary drive, using tripwire to scan the primary OS (and the boot drive), perform backup operations while the primary OS isn't running, and other such rescue/maintenance tasks. When I'm running the primary OS, I usually disconnect the boot drive. That way the bootloader, the kernel, and the entire tripwire setup, is completely inaccessible to the primary OS most of the time. Furthermore, the write protect capable flash drive gives me just a little more protection, as I can keep it read-only, enforced by hardware, except after I've verified the system, and temporarily disconnected the network. Thus, hopefully, keeping everything somewhat more secure. So far, its worked out rather well. It is more work to use and maintain, but I like the extra security of it all.

Again, my thanks to you both for your replies and help. I really appreciate it.
Back to top
View user's profile Send private message
molletts
Tux's lil' helper
Tux's lil' helper


Joined: 16 Feb 2013
Posts: 129

PostPosted: Thu Aug 29, 2024 9:25 pm    Post subject: Re: Thank You! :) & Some Other Thoughts Reply with quote

GreenNeonWhale wrote:
Out of curiosity, if you feel like sharing, do you air cool your 9590, or water cool it?

I went with the best-performing air cooling I could get when I got the 9590 - a Prolimatech Genesis which also cools the RAM a bit. At the time, the best air coolers were neck-and-neck with the best closed-loop water coolers so there was no big benefit to using water. I had also experienced a few annoyances with a water-cooled Core i7-2600K system which I had at work at the time: the pump was loud at full speed (it would go "brrrrrrr!") but the very low "thermal mass" (heat capacity) of the waterblock meant that, if the pump speed was temperature-controlled, the CPU would get very hot very fast in the time it took the controller to respond and ramp up the pump to full speed (which was probably less than a second). Hitting it really hard, with an 8-thread build job, would result in the CPU temperature spiking to about 80°C before settling back to 45°C or so within a few seconds. The big, rapid temperature swings made me very uncomfortable - they are not good for hardware. The revving London taxi sound-effects were amusing at first but rapidly became annoying.

I'd probably go with a big Noctua if I had to buy a cooler for the 9590 today, although I'd also look into the current state of closed-loop water coolers as my case can accommodate a 280mm radiator. 220W was a lot back in 2014-ish or whenever it was I got it (crikey, that's a decade ago! 8O) but modern high-end desktop CPUs can dissipate a lot more than that so coolers have had to improve. The biggest problem would be finding something that will fit Socket AM3! (Although it wouldn't surprise me if Noctua could supply mounting brackets for fairly recent coolers even if they don't come in the box. They're pretty cool guys in my experience - they sent me a set of brackets for an NH-U9 to allow me to rotate it by 90° from its normal "front-to-back" orientation, to suit a case with a top exhaust fan.)

I had a few problems getting the 9590 stable at 4.7GHz; going with auto-configured BIOS settings gave frequent hangs but after much tweaking I got it pretty solid (the 9590 is, after all, basically just a tightly-binned 8370 that's been factory-overclocked to within an inch of its life; my experience probably wasn't much different from what I'd have had to do if I'd just bought an 8370 and wanted to overclock it). A microcode update also helped a lot by disabling the LWP instructions which were found to cause core lockups - if I'd known about the issue beforehand, I'd have just added -mno-lwp to my CFLAGS but I discovered it when I got a SIGILL after the update! Having investigated why LWP had been removed, I stuck with it and rebuilt the affected packages as I came across them (I don't think there were many). Since then, it's been pretty stable. If I hit it hard with AVX (8-thread ffmpeg encode, for example), it can still hang, but it happily builds www-client/firefox[pgo,lto] in about 2 hours.

It does throttle from time to time, especially in summer. I'm not sure what temperature sensor is used to govern this, though - I've seen it pulling the full 220W at 4.7GHz (on the fam15h_power sensor) while the motherboard's CPU temperature sensor reports over 60°C (its official Tj(max) is 57°C) while at other times it throttles back to 180W and about 4GHz at 55°C.

I've rambled on for far too long! Hopefully some of this is useful, though.

Stephen
Back to top
View user's profile Send private message
GreenNeonWhale
n00b
n00b


Joined: 30 Mar 2016
Posts: 63

PostPosted: Thu Aug 29, 2024 11:51 pm    Post subject: Reply with quote

Stephen

Thanks for the info, that was helpful. I saved it in my notes for when I get the chance to tackle my 9590 project. :)
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2179

PostPosted: Fri Aug 30, 2024 10:07 am    Post subject: Re: PGO LTO optimize: compile with chroot on fast CPU Reply with quote

stefantalpalaru wrote:
GreenNeonWhale wrote:
My understanding is that both PGO and LTO optimizations involve executing code during compilation, and analyzing the results.

No. Only PGO does that. ...

AFAIK the correct answer is "neither". The process is
  1. compile the code with instrumentation to log execution patterns in a profile
  2. run that code in your typical usage pattern to create a profile
  3. recompile the code without the instrumentation but optimized according to the execution patterns in the profile collected.


I believe some packages ship with a developer-collected profile - for example firefox, so you get a version optimized according to the developer's guess as to normal usage. I've also seen advice to use a profile collected while running the python test suite, which I guess could be shared rather than recreated, but of course optimizes your python for running test suites rather than say running portage... IMHO neither approach makes much sense.

I guess some builds could run a developer-provided script to create a profile and then recompile with the resulting profile, all in one (long) "compilation", but that seems doubly pointless - the profile could just be included (like firefox), and anyway it's not for your use pattern.
_________________
Greybeard
Back to top
View user's profile Send private message
GreenNeonWhale
n00b
n00b


Joined: 30 Mar 2016
Posts: 63

PostPosted: Fri Aug 30, 2024 4:41 pm    Post subject: Reply with quote

Goverp,

Thank you for your write up and further explanation. That will be quite helpful in making future compile/optimization decisions.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum