for (int i = 0; i < 1000; i++) {
small_numbers[smlen] = numbers[i];
smlen += (numbers[i] < 500);
}
is much faster than the conventional version with a conditional branch:
for (int i = 0; i < 1000; i++) {
if (numbers[i] < 500) {
small_numbers[smlen] = numbers[i];
smlen += 1;
}
}
Been staring at this for a bit, but my brain is not working properly today: could someone please explain how these to loops compute the same value for small_numbers[smlen]?
- the first one (branchless) use the condition to SAVE the correct value (< 500): it temporarily writes any current value to the same index i, always overwriting the previous value, effectively saving it (by moving forward to i+1) only when the value is right (small number).
Downside of this simple function: the last value may be bigger than 500
- the second one use the condition to ADD the value, when it is 100% sure it is a correct small number
They don't. After running, for the values in small_numbers from 0 to smlen-1 they are equivalent.
But if the last value of numbers[] is not smaller than 500, small_numbers[smlen] will contain that value for the first version whereas the second version does not write to small_numbers[smlen].
At what sequence point? The branchless version writes to small_numbers[smlen], for any given value of smlen, potentially more than once; so there are observable points of time during the loop where the behavior is different. But after the loop, both contain the final write to small_numbers[i] for all 0 <= I < smlen; and the transient writes both don't change observed external behavior, and are apparently cheaper than fewer but conditional writes.
smlen would be 0 for both if there are no small numbers, so end result of both is an empty array.
For the first version small_numbers[0] will contain an arbitrary value at the end, and for the second version it happens to contain the last number read, but that address is outside of the 0-length array being returned.
Writing to array[n] and not incrementing n means that the value just written is outside the "useful" range (from 0 to n-1) and will not be considered (it will be overwritten the next iteration).
It doesn't convert bogosort into heapsort either, despite the second being much faster than the first. I'm guessing that it's not that easy going from one to the other because the only thing they have in common is the output (and only after you have checked the last value), so if the transformation is not hard-coded into the compiler, the odds of it randomly discovering the optimization is close to zero
Yeah, I would expect such transformations to be implemented as optimizations. Just like maybe (the admitedly simpler):
(+ ((lambda () 1)) ((lambda () 1))) -> (+ 1 1)
A syntactical transformation, where it is possible as an equivalent transformation.
I may be overlooking special cases, but I thought the compiler is smart enough to infer that the array elements are integers and that `<` will result in a boolean, which is just `0` and `1` and will understand that having only the `if` without `else` branch is equivalent in this case. Guess I was wrong and the compiler is not sophisticated in this specific way.
They aren't equal, as the faster version does an unconditional memory write instead of only writing to the array if the condition is satisfied. The compiler is strictly forbidden from turning a conditional write into an unconditional one.
The Linux kernel had problems years ago when gcc started to do exactly that in certain cases (because it screwed things up with task switches, interrupts, and SMP). It fairly quickly afterwards either stopped doing it entirely or got a switch that would stop it from doing it. Don't remember which.
The two code snippets do different things, apples and oranges... e.g. the array modification in the second example needs to move in front of the if for the two snippets to behave identically. I bet then the compiler output is the same with -O1 or higher.
PS: e.g. note how bla() (first code snippet) and blob() (fixed second code snippet) have identical output (both are turned into the same 'branchless' code via a conditional 'setl' instruction), but the blub() function (original second code snippet) differs because that function has different behaviour:
TL;DR: most 'branchless advice' that only tinkers with language features (like "x = a ? b : c" instead of an if) is useless because to the optimizer passes both are the same thing (a condition).
When there's a difference in the generated code then it's usually a bug and the before-after code are not actually equivalent (like in the code examples above).
I played a bit with your link. It depends on the compiler. x86_64 clang trunk indeed compiles the original first and your fixed second form to the exact same code. I tried a couple of msvc and gcc versions and they did not but they all made them both branchless.