Compiler Optimization Techniques
Learning Objectives
- You know techniques that compilers apply when optimizing code.
At the end of the previous chapter, we noticed that the llc compiler reduced the amount of operations considerably, effectively precomputing the code during compile time. This is an example of a compiler optimization. Compiler optimizations are techniques used to improve the performance of a program by reducing its execution time, memory usage, or power consumption.
Here, we’ll look into a handful of compiler optimization techniques that are commonly used in modern compilers.
Constant folding and propagation
Constant folding is an optimization where the compiler evaluates expressions at compile time when all operands are known constants. By calculating values in advance, the compiler can replace expressions with the computed results. As an example, a compiler might simplify the expression 2 + 3 * 4
to 14
, eliminating the need for the arithmetic operations at runtime.
Constant propagation builds on constant folding by substituting constant values for variables whenever possible. When the compiler determines that a variable’s value is constant throughout its scope, it replaces all instances of that variable with its constant value. This process can lead to further opportunities for folding, creating a cascading effect.
In effect, our observation of the llc compiler in the previous chapter was an example of constant folding and propagation. The compiler precomputed the code during compile time, leading to a simple and efficient function.
Dead code elimination
Dead code elimination (DCE) is the process of identifying and removing code that does not affect the observable behavior of the program. This includes variables that are assigned values but never used, branches that are never taken, and computations whose results are never read.
Compiling the following Rust could involve both constant folding and propagation, as well as dead code elimiation. When you run the program, you notice that the Rust compiler also informs about the dead code.
Dead code elimination involves analyzing the control flow of the program to determine which parts are unreachable or redundant. The control flow is typically analyzed through a control flow graph, which represents the possible paths of execution in the program.
Common subexpression elimination
Common subexpression elimination (CSE) aims to reduce redundant computations by identifying expressions that are computed more than once and replacing subsequent computations with a single stored result. When the same expression is evaluated in different parts of a program and does not involve side effects, CSE allows the compiler to compute the expression once and reuse the result, saving both time and resources.
As an example, consider the expression a + b * c
appearing in multiple locations within a function. If the compiler can determine that a
, b
, and c
remain constant between evaluations, it can compute the expression once and reuse the result, replacing subsequent occurrences with a reference to the precomputed value.
The following Rust program demonstrates a code where this would be possible. The calculate
function computes the same expression twice, which could be optimized by common subexpression elimination.
That is, the compiler would optimize the calculate
function to compute a + b * c
once and reuse the result in both places — a possible optimized version of the function could look as follows.
Loop unrolling
Loop unrolling is an optimization technique that writes the body of a loop out to reduce the overhead associated with loop control instructions. As an example, consider the following loop in Dart:
The execution of the loop is as follows. The variable i
is at first initialized, the value of the variable i
is compared to 3
, and the value of i
is printed if the condition is met. The variable i
is then incremented, and the process is repeated until the condition is no longer met.
For the above loop, as the number of iterations is known in advance, the compiler could unroll the loop to reduce the number of iterations and the associated branching overhead. The unrolled version of the loop might look as follows.
The above is known as true unrolling as the loop is unrolled to the exact number of iterations. Partial unrolling is another form of loop unrolling where the loop is unrolled to a factor of the number of iterations. As an example of partial unrolling, consider the following program that prints the values from 0 to 9.
The loop could be partially unrolled to reduce the number of iterations and the associated branching overhead. One possible version of the partially unrolled loop might look as follows.
While another could look as follows.
Loop unrolling reduces the overhead associated with loop control instructions, such as incrementing the loop variable and checking the loop condition. However, excessive unrolling can lead to code bloat, which may impact CPU cache performance and increase memory usage.
At the same time, depending on what is done within the loop, loop unrolling can lead to additional optimization opportunities. Once the loop’s body is expanded, the compiler can perform additional transformations such as instruction scheduling or even parallelization. This can lead to better utilization of the processor resources, particularly in architectures that benefit from instruction-level parallelism.
Loop-invariant code motion
Another techique related to loops is loop-invariant code motion which identifies computations within a loop that yield the same result on every iteration. Such expressions can be safely moved outside the loop, reducing redundant calculations and improving performance.
The compiler begins by analyzing the loop body to detect expressions that do not depend on the loop index or any variables modified inside the loop. Once these invariants are identified, they are moved out of the loop, ensuring that the computation is performed only once rather than on every iteration.
As an example, consider the following loop in Dart:
In the above loop, the multiplier 2 * 3
is loop-invariant because it does not depend on the loop index i
. The compiler can safely move this expression outside the loop, resulting in the following optimized version:
Strength reduction
Strength reduction is a technique that replaces computationally expensive operations with simpler, less costly ones. This typically involves substituting a multiplication with an addition, or a division with a bit-shift, particularly within loops where the operation is executed repeatedly.
The compiler identifies opportunities for strength reduction by examining the arithmetic expressions within loops and other critical sections of code. When a multiplication by a constant or an exponentiation is detected, the compiler can often replace it with a sequence of additions or bit shifts, which are generally faster on most hardware architectures. This transformation leverages the mathematical equivalences that allow expensive operations to be re-expressed in simpler terms.
As an example, consider the following loop in Dart:
In the above loop, the multiplication i * 8
can be replaced with a left shift operation i << 3
, which is equivalent but more efficient. The optimized version of the loop could look as follows:
Strength reduction depends on a strong understanding of the target hardware and the hardware-specific costs of the operations involved. As an example, on modern Intel x86 processors, the multiplication instruction imul
takes three cycles, while the shift left instruction shl
and the addition instruction add
each take only one cycle. By replacing multiplications with additions or shifts, compilers can improve the performance of code running on these processors.
For reference of the instruction cycles, see Agner Fog’s instruction tables which provides detailed information about the latency and throughput of instructions on various microarchitectures.
There are also other things at play though, such as instruction pipelining, which can make the actual performance of the operations differ from the theoretical performance. This is why profiling code to see the actual performance of the code on the target hardware is needed.
Function inlining
Function inlining is an optimization technique where the compiler replaces a function call with the actual body of the function. This transformation eliminates the overhead associated with the call and return sequence, such as stack manipulation and branching, by effectively copying the function’s code directly into the caller’s code. This can lead to faster execution, particularly for small, frequently called functions.
As an example, let’s revisit the first Rust program that we looked into in this chapter. The program looked as follows.
With function inlining, the code could be optimized to the following.
Which in turn, with constant folding and propagation, would be further optimized to the following.
As we see form the above, an extra advantage of function inlining is that the compiler can perform further optimizations on the inlined code, such as constant folding and propagation. This can lead to more efficient code generation and better performance.
Despite the benefits, inlining is not without tradeoffs. One of the main downsides is the potential increase in code size. Excessive inlining can lead to code bloat, where the resulting program becomes significantly larger. This can hurt CPU cache performance and increase memory usage.
Tail-call optimization
Finally, let’s revisit tail-call optimization, which we already briefly discussed earlier. Tail-call optimization (TCO) is a technique used to optimize function calls that occur in tail position, i.e., when a function call is the very last operation in a function.
Instead of creating a new stack frame for the tail call, the compiler reuses the current function’s stack frame for the called function. This transformation converts a recursive call into an iterative process, reducing call overhead and removing the risk of stack overflow. Tail-call optimization is particularly useful in functional programming languages and recursive algorithms where deep recursion is common.
To illustrate tail-call optimization, consider the following Dart program that is used to calculate the factorial of a number.
As the function call factorial(n - 1, n * acc)
is in tail position, the compiler can optimize the recursive call by reusing the current stack frame. A possible optimized version of the above program could look as follows.
Keep in mind that these optimizations are just a few examples of the many techniques that modern compilers use to improve the performance of programs. The effectiveness of these optimizations can vary depending on the programming language, the compiler, and the target hardware. As a programmer, understanding optimizations can also help write more efficient code, although it is important to remember that premature optimization can lead to code complexity and reduced maintainability.