optimization flags, myths and truths for the real world ;-)

xoxo_davide · n00b Joined: 24 Sep 2004 Posts: 37

Ok, i finally found the time to post this one.
Everybody wants the best compiler flags (cflags) to put in its make.conf file for speed and optimization, so after days of tests on recent processors i found interesting news.
Everything said here applies on athlon 64, athlon 64 dual core and opteron. No other processor tested.
Compiler version used: gcc-4.1.2 , glibc-2.6.1
profile: amd64 no-multilib (that means pure 64 bit system, but that doesn't affect results).

So, for the impatients, here you have the 2 night & days performance tests result:

######################################################
Best cflags: -march=athlon64 -O3 -pipe -fomit-frame-pointer -finline-functions
######################################################

Shocked??? that looks so normal... ;-)

no -ffast-math? no -ftrace or other exotic flags?
And, just a note: -fomit-frame-pointer and -finline-functions are inlined by default with -O3, so they are actually not usefull. I just put them along because sometimes ebuilds replace -O3 with -O2, and the hope is that the two flags are kept along with -O2. And -pipe is not a real cflags, that's just to speed up compile time. So the result is -O3.

Please don't argue with me on those results, i'll not answer neither go in-deep anyway. If you want to know a bit more, read ahead, but then don't argue with me anyway. :-P

--

For who wants to know how i came to that result.

The hardware:
Used machines are 2 similar asus motherboard based, one with an athlon 64 and one with an athlon 64 dual core, and a supermicro server board (dual cpu support) with a single opteron.

How i performed the tests:
- I started from acovea (simple said, an evolutionary compiler-flags tester) running all tests twice on all machines and noting the results: acovea best flags, acovea optimistic flags, and acovea pessimistic flags. Every test run (i said every) gave me differents flags. But i had to start from somewhere. (note: acovea's tests are very specific to a narrow range of operations, they are a kind of cpu sector test).
- I need daily usage, and daily said i'll never sacrifice one cpu range of operations for another, so i had to do a cross-strip of cflags, deleting from acovea's best and acovea's optimistic any flag appearing into any of the acovea's pessimistic flags. From the remaining i made 4 groups of flags build-up from common best flags and common optimistic flags, plus i added some "common used" groups of best flags (aherm.. well.. at least "thought to be best flags"..). Total: 7 groups of cflags to test.
- Now i needed to choose some program to compile. Question: will i gain productivity from a 7% speed increase in processing openoffice documents? Not me. My typing speed is still slower then my processor.. ;-)

and my Calc documents are not that heavy to see differences. So is for Quanta or the startup time for konqueror. Will i gain something making my gimp filters apply faster? yes. or extracting files from ark and compressing a divx quicker? yes. The real difference made by cflags optimization on "daily usage" is seen only (!) on time-consuming heavy processors based apps. Will not go in-deep. Not my interest. That said, my representative programs are the following: tar, bzip2, povray, ffmpeg, and sometime added konqueror in the bulk (yes, konqueror, running a complex ajax application).
- All programs were compiled on each machine with each group of flags and tested everytime in the same way. Yeah that was a whole weekend of testing.
- On the fastest machine i added some test taking exactly acovea's best flags.

Conclusions:
Best cflags: -march=athlon64 -O3 -pipe -fomit-frame-pointer -finline-functions
So let gcc do his job. If you need more pure processor speed, don't waist your time and buy a faster one.
No much else to say.

Some considerations (that's before somebody posts something already said or verified around):
- Latest versions of gcc do a very good job in optimization. Some flags made a noticable difference in the past 3.x serie, now that's over.
- Cflags are a bunch of a lot and more then hundred. Some of them taken alone are more pessimistic then bind togheter with others, and some of them are more harmful in some tasks then in other where - in contrast - can affect positively performance.
- Modern programs use a wide range of tasks, from floating point calculation to memory block manipulation, you cannot find a combination of flags that is best for everything.
- The -march=athlon64 flag is the first one you have to consider. Compared to this flag anything else is not relevant.
- A good programmed application can gain much more then any best found combination of compiler flags. Programmers: review slow routines.
- A difference of 20$ when buying a processor can make a 30% speed increase rendering an image with blender. Tweaking flags maybe 5%. You'll not see the difference surfing with konqueror neither downloading mails.
- Acovea flags give sometime an increase of 40% in performance on... acovea's test. That's because they are very specific tests performing very specific tasks at time. On daily applicaton they are worst. Anyways. That said remember that acovea is a good and sometimes very useful program (see ahead).
- if you have a very specific program that performs a very specific task that wastes a lot of time (let's say you are a researcher and you wrote a C program to solve field equations of GR in Riemann's manifolds using a sub-division approach) that can take days on your machine, then i suggest to take the heaviest functions and test them in acovea to find best performance cflags. You may really increase speed.

As last topic, lets nuke some legend:
- You will not gain from 64bit compiled programs on a modern 64bit athlon (or higher) system. False. You WILL gain from 64bit compiled programs. In some case even 20%.
- You will not gain from a dual core then a single core processor. Well.. there is something true in this statement if you use gcc. Applications aren't still optimized for dual processors/dual core. Some benchmarks found on the internet say you can actually have loss of performance in some cases. I don't have a direct experience for this (the single core i used is slower anyway then the dual core, so i can't compare the results) but my guess is that the processor wastes more time trying to divide threads and tasks then performing everything on a single core. Using icc seems to increase performances a lot. Yafray seems to me to gain much more then other programs compiled on the dual core machine.
- -Os (or -O) is better because apps load faster. Are you serious? Modern programs are made of a bunch of libraries, they load and perform while needed, and they usually don't exceed a few megs. The kwrite launcher is a few kilo. And its libraries too. They are loaded in sequence, and the startup of an application wastes more time initialiting and executing libraries then loading them. So that's just False in most of cases (of course i'm not considering old pc's running short of ram).
- -O1 -fomit-frame-pointer -finline-functions is comparable to -O2. False, and the difference is noticeable.
- -O2 is better then -O3. False, but the difference is often not noticeable.
- -fomit-frame-pointer AND -finline-functions are the first cflags to consider. True. The difference between -O2 and -O3 is kinda annihilated when adding the two flags to -O2, with a preference for the first one.
- -ffast-math or -funsafe-math-optimizations or -mfpmath=387 or any combination of the 3 compile faster code. I wonder how many post i read about this of people claiming they are absolutely sure about this (very common is to bundle the -funsafe-math-optimizations with the -mfpmath=387, there are guys out there that say they got an impressive 50% increase on some applications!)... WTF!!!! Did they really test it or they all read about somebody else who read about? This is absolutely false. on any amd64 system every floating-point-processor-stressing-task performs slower (and not only...). A lot slower.

--

I'll not put the results, i really have a dozen of hand written sheets on my desk, and i don't think i'll ever find time to waste to reorder and post them, so please dont' ask me for those.
For the most curious, i include a set of tests for povray performed on the fastest machine (other results are coherent). Same rendering scene for every test. Scene includes transparency and reflections, radiosity calculated. Pure -O1 and -O2 are omitted, they are not of interest as they are worst anyways. I just want to highlight the interesting and more discussed differences.

Flags -> Rendering time (the faster the better)
-Acovea's best common (stripped) -> 1:13
-Acovea's best positive (stripped) -> 1:11
-Any of acovea's best set tried -> always more then 1:17 (i got even a 1:21 in a case)
-Os -> 1:16
-O1 -fomit-frame-pointer -finline-functions -> 1:14
-O2 -fomit-frame-pointer -finline-functions -> 1:10
-O3 -> 1:09 (!)
-O3 -mfpmath=387 -> 1:12
-O3 -ffast-math -> 1:16
-O3 -ffast-math -mfpmath -> 1:17
-O3 -funsafe-math-optimizations -> 1:18
-O3 -funsafe-math-optimizations -mfpmath=387 -> 1:18
_________________
Think. Then think twice. Then, if you really need it, talk. But i'm sure you'll still say something stupid.

Keruskerfuerst · Posted: Fri Nov 23, 2007 8:02 pm Post subject:

Simply No!

DaggyStyle · Watchman Joined: 22 Mar 2006 Posts: 5910

acording to http://gentoo-wiki.com/Safe_Cflags, -fomit-frame-pointer disables 64 bit support, care to answer?
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

loftwyr · Posted: Fri Nov 23, 2007 10:15 pm Post subject:

Could you point out where it says it disables 64bit support? The only thing I read is that it's inlined on -Os to -O3.
_________________
My emerge --info
Have you run revdep-rebuild lately? It's in gentoolkit and it's worth a shot if things don't work well.
Celebrating 5 years of Gentoo-ing.

s.hase · Apprentice Joined: 19 Nov 2004 Posts: 293

DaggyStyle · Watchman Joined: 22 Mar 2006 Posts: 5910

ok, my bad, all the 64 bit setups dont have it but the 32 bit setups for the same cpu has it, I've autoassumed it, anyway, what's the effect on the system?
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein

Paapaa · l33t Joined: 14 Aug 2005 Posts: 955 Location: Finland

aTan

squirrelfishfrog · n00b Joined: 28 Nov 2007 Posts: 15

I have some results and questions about sse flags,

I wrote a benchmark to test the usage of sse instructions on a xeon processor (gcc 4.2.0), and the results are weird.
The code constists of loops with a lot of floating point operations. So this is something sse2 is made for.

My flags were -O{n} -march=nocona -mtune=nocona
( with n=0,1,2,3; nocona worked and does support sse,sse2, also /proc/cpuinfo lists sse,sse2 as supported)

the gentoo optimization guide says that -msse and -msse2 are implied by correct -march but i included them explicitly.

I compiled the same code without those flags (no -march flag, same -O level) but from what i see in `gcc -v -Q example.c` the -msse(2) options are still enabled, among many others.

So the result was: no matter what -O level I set, the performances are equal between generic and correct -march/-msse2 compiled versions.

Also I read that -mfpmath=sse,387 activates an additional processing units for sse (you did not comment on that one), but that had no positive effect for me.
I tried -funroll-loops: no advantage for sse2 compiled code.

I would assume that those features (-msse, -msse2, -mmmx,...) should be deactivated with generic, for compatibility with older processors, so i hope i did sth. wrong, for it seems they are always on....

any comments?

by the way: do i misunderstand what gcc -v -Q lists under "options enabled:"? And by no effect I mean literally no effect not even a millisecond on average (i did some statistical error estimation stuff).

I decided to stick with CFLAGS="-O2 -march=appropriatecpu -mtune=appropriatecpu -pipe"

But still, for my own code, i would like to know what sse actually does (performancewise)....and if it really is always on if gcc thinks it should turn it on for your system...and such

i would be thankful for any hints on that

red-wolf76 · Posted: Wed Nov 28, 2007 2:08 pm Post subject:

If you're using a sufficiently advanced toolchain, you might consider -march=native. I wouldn't recommend using both -march and -mtune at the same time. Seems rather pointless.
_________________
0mFg, G3nt00 r0X0r$ T3h B1g!1111

Use sane CFLAGS! If for no other reason, do it for the lulz!

squirrelfishfrog · n00b Joined: 28 Nov 2007 Posts: 15

yes,
-march=X implies -mtune=X,
they differ only in the generic option.
But it doesn't hurt. (or does it?)

native (thanks for that):

i recompiled with -march=native
then with -mtune=generic

both versions had the same runtime.

At work i have access to intels C compiler (version 9.1.045 and 10.0.023) so i tried iccs (9.1.045) optimization flags and there was a 10% difference (no -march version slightly worse than -march=pentium4 version) and since the faster one was as fast as the gcc compiled ones, I would assume that my assumption was right and that sse(2) is always on. Although it is possible that sse had nothing to do with it and that some other optimization is responsible for the 10% difference.

I will try that with some other procs as soon as I can.

loftwyr · Posted: Wed Nov 28, 2007 3:28 pm Post subject:

Interestingly, I just found out on my X2 cpu, sse3 isn't enabled using -march=native. with gcc -v -Q -march=native, it passes -march=k8 -mtune=k8 and that does not include -msse3. My CPU does support sse3 so it should be enabled.

So much for -march=native.

*EDIT*
Seems it's true, and its fixed in 4.3
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33312
_________________
My emerge --info
Have you run revdep-rebuild lately? It's in gentoolkit and it's worth a shot if things don't work well.
Celebrating 5 years of Gentoo-ing.

timeBandit · Posted: Wed Nov 28, 2007 4:11 pm Post subject:

red-wolf76 · Posted: Wed Nov 28, 2007 4:21 pm Post subject:

squirrelfishfrog · n00b Joined: 28 Nov 2007 Posts: 15

red-wolf76 · Posted: Wed Nov 28, 2007 5:20 pm Post subject:

Physics number crunching may not be the poster child, but there are actually applications that benefit quite a bit from -ffast-math. I remember hearing something about video apps. But certainly, that wouldn't warrant a global setting, more likely, the ebuilds that benefit from it should be written to include it of themselves.
_________________
0mFg, G3nt00 r0X0r$ T3h B1g!1111

Use sane CFLAGS! If for no other reason, do it for the lulz!

Paapaa · l33t Joined: 14 Aug 2005 Posts: 955 Location: Finland

xoxo_davide · n00b Joined: 24 Sep 2004 Posts: 37

Keruskerfuerst · Posted: Fri Nov 30, 2007 4:53 pm Post subject:

Acovea is not working properly.

The best result is -O2 and some additional flags:

CFLAGS:

squirrelfishfrog · n00b Joined: 28 Nov 2007 Posts: 15

JeliJami · Veteran Joined: 17 Jan 2006 Posts: 1086 Location: Belgium

Keruskerfuerst · Posted: Wed Dec 05, 2007 8:48 pm Post subject:

MP_ · Posted: Thu Dec 06, 2007 10:32 am Post subject:

timeBandit · Posted: Thu Dec 06, 2007 3:43 pm Post subject:

Keruskerfuerst · Posted: Thu Dec 06, 2007 7:35 pm Post subject:

I have made some experiments with ASFLAGS, but i did not improve execution speed.
ASFLAGS="--64"
ASFLAGS="-mtune=aaa"
ASFLAGS="-march=aaa"