View previous topic :: View next topic |
Author |
Message |
NTU Apprentice
Joined: 17 Jul 2015 Posts: 187
|
Posted: Tue Feb 26, 2019 8:15 pm Post subject: LLVM/Clang 8 coming with BDVER2 optimizations |
|
|
LLVM/Clang 8 is coming out soon and it has optimizations for BDVER2 so I'll be building it (along with compiler-rt, clang-runtime, etc.) with -O3 -ftree-vectorize -funsafe-math-optimizations (more compliant than -ffast-math but still allows for vectorization of floating point or something along those lines) and -march=bdver2. I have a bunch of older APUs laying around as they were dirt cheap (like 40-60 bucks, A10s were like $120 which is still not bad) Is rsbench a good test for this kind of thing?
https://github.com/darktable-org/rawspeed/tree/develop/src/utilities/rsbench
I know that -O3 and those other flags can slow things down or introduce bugs which is why I want to make sure I get some good tests going. |
|
Back to top |
|
|
Naib Watchman
Joined: 21 May 2004 Posts: 6051 Location: Removed by Neddy
|
Posted: Tue Feb 26, 2019 8:51 pm Post subject: |
|
|
"more complaint than -ffast-math" you must not like you math results then _________________
Quote: | Removed by Chiitoo |
|
|
Back to top |
|
|
Akkara Bodhisattva
Joined: 28 Mar 2006 Posts: 6702 Location: &akkara
|
Posted: Tue Feb 26, 2019 9:47 pm Post subject: |
|
|
I do not recommend using -O3 unless you have personally inspected the resulting ".s" assember output and concluded that optimization level is doing what you want.
Usually what happens is "easy" stuff gets unrolled and expanded out to squeeze every last cycle out of it, while the more complex stuff isn't much different than what -O2 gives you. Problem is the "easy" stuff tends to be initalizations and run-once kind of code, that doesn't matter how fast it is. Worse, the now bigger code-size pushes other things out of cache and can make things slower. This is especially true when combined with -ftree-vectorize. Unless you've gone in and peppered your code with __restrict__ and __attribute__((aligned(16))), the compiler doesn't know whether the pointers are aligned and what shortcuts it can take. Instead it generates two copies of the code, one vectorized assuming it is aligned, the other simply unrolled, and tests and jumps to the appropriate one.
Meanwhile your heavy loops don't benefit much unless you've been thru several dozen iterations of hand-tweaking to reduce inter-statement data dependencies so that -O3 might have something interesting to chew on. If you've done that, and inspected the assember output to check that it's "seeing" what you have in mind, and benchmarked to make sure it even matters, then -O3 could help. But even then, only on the files where you've paid this kind of attention to.
Personally, I find that -Os gives better overall performance when I'm not tweaking by hand. And many packages where it matters will already have custom build flags for the critical sections.
Regarding -funsafe-math, read up on all the -f*math* flags. Then pick the ones you actually care about. I often use -fno-math-errno in my own code to reduce data-dependencies across mathlib calls, along with -fno-trapping-math -fno-signalling-nans in embedded systems where there's no place to trap to. I don't touch the general packages. _________________ Many think that Dilbert is a comic. Unfortunately it is a documentary. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3366 Location: Rasi, Finland
|
Posted: Tue Feb 26, 2019 11:39 pm Post subject: |
|
|
Akkara wrote: | I do not recommend using -O3 unless you have personally inspected the resulting ".s" assember output and concluded that optimization level is doing what you want. |
This article, while not very extensive, tells that -O3 is quite a safe bet. Which was somewhat a surprise to me. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Akkara Bodhisattva
Joined: 28 Mar 2006 Posts: 6702 Location: &akkara
|
Posted: Wed Feb 27, 2019 6:22 am Post subject: |
|
|
Zucca wrote: | Akkara wrote: | I do not recommend using -O3 unless you have personally inspected the resulting ".s" assember output and concluded that optimization level is doing what you want. |
This article, while not very extensive, tells that -O3 is quite a safe bet. Which was somewhat a surprise to me. |
Interesting, and good to know. Thanks. My comments are mostly based on experience with gcc-5 and 6, with some cursory checking of 7 (I should have remembered to mention that), and with battles fought in trying to get some inner loops vectorized, and generally failing to achieve what I was looking for, finally resorting to calling the intrinsics directly (non-portably). They are benchmarking with gcc-9 so there's hope things are getting better!
Looking at the benchmarks, there's some where -O3 clearly helps, such as ray-tracing. But the average results in the last page shows only a 4% improvement overall. I don't know how statistically significant that is. It does mean that -O3, at worse, probably won't hurt too much. The biggest positive effect (after going from no optimization to some) seems to come from -march=native. It would be interesting to see how the numbers for -O2 -march=native as well as for -Os -march=native compare. It would also be interesting to see how the code size is affected. Sometimes benchmarks that run well in isolation run more poorly when used concurrently with other programs and have to contend for cache. _________________ Many think that Dilbert is a comic. Unfortunately it is a documentary. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3366 Location: Rasi, Finland
|
Posted: Wed Feb 27, 2019 11:43 am Post subject: |
|
|
I also tried to search another article where some programs compiled with -O2 -fsome-lto-flags-enabled were, not significantly but ... "quite"(?), faster than -O3 -fsome-flags.
Then there's also the -Ofast... I wonder if it's any better than -O3, even in the best scenarios.
EDIT: There's this 2016 test. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
NTU Apprentice
Joined: 17 Jul 2015 Posts: 187
|
Posted: Fri Mar 01, 2019 2:27 am Post subject: |
|
|
Using -march flags with -ftree-vectorize only really makes sense though with -funsafe-math-optimizations as certain loop constructs can only be vectorized if GCC is allowed to change the order of math ops. -march with -ftree-vectorize can actually increase code size without benefit as the compiled code will not have those SIMD instructions properly aligned. That is where -ffast-math or -funsafe-math-optimizations come into play. -march on it's own is the least efficient way to use those extra CPU instructions because of the lack of vectorization. Maybe I have this wrong, but I thought -march only makes sense with -ftree-vectorize and -funsafe-math-optimizations or -ffast-math.
Last edited by NTU on Fri Mar 01, 2019 7:38 am; edited 1 time in total |
|
Back to top |
|
|
Akkara Bodhisattva
Joined: 28 Mar 2006 Posts: 6702 Location: &akkara
|
Posted: Fri Mar 01, 2019 2:50 am Post subject: |
|
|
That's true if the heavy part of your code is doing floating-point.
There's also integer vector opcodes. Those don't need special math flags to vectorize well, because integer math is fully commutative and associative. (Unlike floating-point, where ((a + b) + c) doesn't always give the same answer as (a + (b + c))).
If you're going to use -ftree-vectorize, avoid code-bloat by telling the compiler your buffers are aligned. (And make sure they actually are aligned!.) Read up on __attribute__((aligned(N))), which works for both gcc and clang.
Also try to reduce data dependencies. Statements such as y[n] = a * y[n-1] + stuff are hard to vectorize because you can't calculate several y's in parallel without knowing the previous result. Remember the compiler can't know that two pointers don't happen to be pointing to the same data. So even an inoccuous-looking statement such as y[n] = x[n] + x[n-1] leads to code-bloat: the compiler checks that x and y don't overlap and jumps to the vectorized version, otherwise uses the one-iteration-at-a-time version. Use __restrict__ on pointer declarations to state that there's no data-dependencies between them. _________________ Many think that Dilbert is a comic. Unfortunately it is a documentary. |
|
Back to top |
|
|
NTU Apprentice
Joined: 17 Jul 2015 Posts: 187
|
Posted: Fri Mar 01, 2019 7:28 am Post subject: |
|
|
Haha, I'm not going to be digging through the large LLVM code base. I rather just do some benchmarking with compiling LLVM+Clang with -march=bdver2 -O3 -ftree-vectorize -funsafe-math-optimizations compared to simply just -O2, and do some _basic_ sanity checking. It's not practical going through the entire tree and analyzing the compiled assembly of LLVM and Clang themselves. Any suggestions on an easy way to do some simple performance/compliance tests?
-ffast-math is pushing it a bit too far with IEEE 754 non-compliance, MPFR, MPC and GMP have quite a few tests that fail, but if you use -funsafe-math-optimizations, there's only one error here and there in the testsuites per package. I forget exactly what the error rate was, but I know that -ffast-math performed much worse in `make check`per package. |
|
Back to top |
|
|
PrSo Tux's lil' helper
Joined: 01 Jun 2017 Posts: 136
|
Posted: Fri Mar 01, 2019 6:04 pm Post subject: |
|
|
NTU wrote: | Haha, I'm not going to be digging through the large LLVM code base. I rather just do some benchmarking with compiling LLVM+Clang with -march=bdver2 -O3 -ftree-vectorize -funsafe-math-optimizations compared to simply just -O2, and do some _basic_ sanity checking. |
IIRC -ftree-vectorize in GCC (-ftree-loop-vectorize and -ftree-slp-vectorize) is enabled by deafault in -O3, and why not -march=native?
Regards,
Przemek. |
|
Back to top |
|
|
NTU Apprentice
Joined: 17 Jul 2015 Posts: 187
|
Posted: Fri Mar 01, 2019 6:58 pm Post subject: |
|
|
PrSo wrote: | NTU wrote: | Haha, I'm not going to be digging through the large LLVM code base. I rather just do some benchmarking with compiling LLVM+Clang with -march=bdver2 -O3 -ftree-vectorize -funsafe-math-optimizations compared to simply just -O2, and do some _basic_ sanity checking. |
IIRC -ftree-vectorize in GCC (-ftree-loop-vectorize and -ftree-slp-vectorize) is enabled by deafault in -O3, and why not -march=native?
Regards,
Przemek. |
-O3 enables -ftree-loop-vectorize and -ftree-slp-vectorize which means perform loop and basic block vectorization on trees, respectively. -ftree-vectorize however is not enabled at any level.
https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html -- "Vectorization is enabled by the flag -ftree-vectorize and by default at -O3." Yes, but only -ftree-loop-vectorize and -ftree-slp-vectorize.
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
-O3 enables:
Code: | -fgcse-after-reload
-finline-functions
-fipa-cp-clone
-floop-interchange
-floop-unroll-and-jam
-fpeel-loops
-fpredictive-commoning
-fsplit-paths
-ftree-loop-distribute-patterns
-ftree-loop-distribution
-ftree-loop-vectorize
-ftree-partial-pre
-ftree-slp-vectorize
-funswitch-loops
-fvect-cost-model
-fversion-loops-for-strides |
Anyway back to the question, so there is this: https://llvm.org/docs/TestSuiteGuide.html
In the "Displaying and Analyzing Results" section, what exactly do these numbers represent?
Code: | Metric: exec_time
Program baseline
INT2006/456.hmmer/456.hmmer 1222.90
INT2006/464.h264ref/464.h264ref 928.70
...
baseline
count 506.000000
mean 20.563098
std 111.423325
min 0.003400
25% 0.011200
50% 0.339450
75% 4.067200
max 1222.896800 |
Baseline 1222.90, what does 1222.90 mean in this context? count 506.000000 ? What is it counting? |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3366 Location: Rasi, Finland
|
Posted: Fri Mar 01, 2019 7:19 pm Post subject: |
|
|
Oh btw... a little off-topic but... I've enabled -mvzeroupper on my system. It should increase performance on Bulldozers. However, I have some packages which I have set custom en for clang compiling (in hopes for faster compiling). But I also needed to change CFLAGS, because clang wouldn't accept -mvzeroupper.
I wonder it that flags is actually worth anything... _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
NTU Apprentice
Joined: 17 Jul 2015 Posts: 187
|
Posted: Mon Mar 11, 2019 7:56 pm Post subject: |
|
|
Zucca wrote: | Oh btw... a little off-topic but... I've enabled -mvzeroupper on my system. It should increase performance on Bulldozers. However, I have some packages which I have set custom en for clang compiling (in hopes for faster compiling). But I also needed to change CFLAGS, because clang wouldn't accept -mvzeroupper.
I wonder it that flags is actually worth anything... |
Do some Blender, imagemagick/graphicsmagick benchmarks? |
|
Back to top |
|
|
|