LLVM/Clang 8 coming with BDVER2 optimizations

NTU · Apprentice Joined: 17 Jul 2015 Posts: 187

LLVM/Clang 8 is coming out soon and it has optimizations for BDVER2 so I'll be building it (along with compiler-rt, clang-runtime, etc.) with -O3 -ftree-vectorize -funsafe-math-optimizations (more compliant than -ffast-math but still allows for vectorization of floating point or something along those lines) and -march=bdver2. I have a bunch of older APUs laying around as they were dirt cheap (like 40-60 bucks, A10s were like $120 which is still not bad) Is rsbench a good test for this kind of thing?

https://github.com/darktable-org/rawspeed/tree/develop/src/utilities/rsbench

I know that -O3 and those other flags can slow things down or introduce bugs which is why I want to make sure I get some good tests going.

Naib · Posted: Tue Feb 26, 2019 8:51 pm Post subject:

"more complaint than -ffast-math" you must not like you math results then
_________________

Akkara · Posted: Tue Feb 26, 2019 9:47 pm Post subject:

I do not recommend using -O3 unless you have personally inspected the resulting ".s" assember output and concluded that optimization level is doing what you want.

Usually what happens is "easy" stuff gets unrolled and expanded out to squeeze every last cycle out of it, while the more complex stuff isn't much different than what -O2 gives you. Problem is the "easy" stuff tends to be initalizations and run-once kind of code, that doesn't matter how fast it is. Worse, the now bigger code-size pushes other things out of cache and can make things slower. This is especially true when combined with -ftree-vectorize. Unless you've gone in and peppered your code with __restrict__ and __attribute__((aligned(16))), the compiler doesn't know whether the pointers are aligned and what shortcuts it can take. Instead it generates two copies of the code, one vectorized assuming it is aligned, the other simply unrolled, and tests and jumps to the appropriate one.

Meanwhile your heavy loops don't benefit much unless you've been thru several dozen iterations of hand-tweaking to reduce inter-statement data dependencies so that -O3 might have something interesting to chew on. If you've done that, and inspected the assember output to check that it's "seeing" what you have in mind, and benchmarked to make sure it even matters, then -O3 could help. But even then, only on the files where you've paid this kind of attention to.

Personally, I find that -Os gives better overall performance when I'm not tweaking by hand. And many packages where it matters will already have custom build flags for the critical sections.

Regarding -funsafe-math, read up on all the -f*math* flags. Then pick the ones you actually care about. I often use -fno-math-errno in my own code to reduce data-dependencies across mathlib calls, along with -fno-trapping-math -fno-signalling-nans in embedded systems where there's no place to trap to. I don't touch the general packages.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.

Zucca · Posted: Tue Feb 26, 2019 11:39 pm Post subject:

Akkara · Posted: Wed Feb 27, 2019 6:22 am Post subject:

Zucca · Posted: Wed Feb 27, 2019 11:43 am Post subject:

I also tried to search another article where some programs compiled with -O2 -fsome-lto-flags-enabled were, not significantly but ... "quite"(?), faster than -O3 -fsome-flags.

Then there's also the -Ofast... I wonder if it's any better than -O3, even in the best scenarios.

EDIT: There's this 2016 test.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--

NTU · Apprentice Joined: 17 Jul 2015 Posts: 187

Using -march flags with -ftree-vectorize only really makes sense though with -funsafe-math-optimizations as certain loop constructs can only be vectorized if GCC is allowed to change the order of math ops. -march with -ftree-vectorize can actually increase code size without benefit as the compiled code will not have those SIMD instructions properly aligned. That is where -ffast-math or -funsafe-math-optimizations come into play. -march on it's own is the least efficient way to use those extra CPU instructions because of the lack of vectorization. Maybe I have this wrong, but I thought -march only makes sense with -ftree-vectorize and -funsafe-math-optimizations or -ffast-math.

Akkara · Posted: Fri Mar 01, 2019 2:50 am Post subject:

That's true if the heavy part of your code is doing floating-point.

There's also integer vector opcodes. Those don't need special math flags to vectorize well, because integer math is fully commutative and associative. (Unlike floating-point, where ((a + b) + c) doesn't always give the same answer as (a + (b + c))).

If you're going to use -ftree-vectorize, avoid code-bloat by telling the compiler your buffers are aligned. (And make sure they actually are aligned!.) Read up on __attribute__((aligned(N))), which works for both gcc and clang.

Also try to reduce data dependencies. Statements such as y[n] = a * y[n-1] + stuff are hard to vectorize because you can't calculate several y's in parallel without knowing the previous result. Remember the compiler can't know that two pointers don't happen to be pointing to the same data. So even an inoccuous-looking statement such as y[n] = x[n] + x[n-1] leads to code-bloat: the compiler checks that x and y don't overlap and jumps to the vectorized version, otherwise uses the one-iteration-at-a-time version. Use __restrict__ on pointer declarations to state that there's no data-dependencies between them.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.

NTU · Apprentice Joined: 17 Jul 2015 Posts: 187

Haha, I'm not going to be digging through the large LLVM code base. I rather just do some benchmarking with compiling LLVM+Clang with -march=bdver2 -O3 -ftree-vectorize -funsafe-math-optimizations compared to simply just -O2, and do some _basic_ sanity checking. It's not practical going through the entire tree and analyzing the compiled assembly of LLVM and Clang themselves. Any suggestions on an easy way to do some simple performance/compliance tests?

-ffast-math is pushing it a bit too far with IEEE 754 non-compliance, MPFR, MPC and GMP have quite a few tests that fail, but if you use -funsafe-math-optimizations, there's only one error here and there in the testsuites per package. I forget exactly what the error rate was, but I know that -ffast-math performed much worse in `make check`per package.

PrSo · Tux's lil' helper Joined: 01 Jun 2017 Posts: 136

NTU · Apprentice Joined: 17 Jul 2015 Posts: 187

Zucca · Posted: Fri Mar 01, 2019 7:19 pm Post subject:

Oh btw... a little off-topic but... I've enabled -mvzeroupper on my system. It should increase performance on Bulldozers. However, I have some packages which I have set custom en for clang compiling (in hopes for faster compiling). But I also needed to change CFLAGS, because clang wouldn't accept -mvzeroupper.
I wonder it that flags is actually worth anything...
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--

NTU · Apprentice Joined: 17 Jul 2015 Posts: 187