Peeking into Compilers

Intermediate Representations


Learning Objectives

  • You know of intermediate representations and can identify intermediate representation types.
  • You understand the need for different types of IRs in the compilation process.

Intermediate representations

When a program is compiled, it is typically translated into a series of intermediate representations (IRs) before being converted into machine code. IRs serve as a bridge between the high-level source code written by developers and the low-level machine instructions that the hardware processes.

At the core, an intermediate representation is a simplified and standardized form of a program that captures the semantics of the source code without the language-specific syntactic sugar. A key benefit of using intermediate representations is the decoupling of the compiler functionality.

Compilers are often discussed in terms of a front end and a back end. The front end is responsible for parsing and interpreting the source code, while the back end focuses on generating efficient machine code. By converting the source code into a common IR, the compiler can ensure that the front end and back end remain independent modules, allowing separation of concerns. This also allows using the same IR for multiple programming languages, as shown in Figure 1 below.

Fig 1. — If the intermediate representation is language-independent, the same intermediate representation can be used for multiple programming languages.
Loading Exercise...

Continuum of IRs

Intermediate representations can be represented as a continuum from high-level IRs to low-level IRs, each serving distinct roles in the compilation process. High-level IRs retain much of the structure and semantic information present in the original source code, and are often closer to the abstract syntax and high-level constructs used by developers. Medium-level IRs are often used for optimizations that require a balance between high-level and low-level information. Low-level IRs are much closer to the machine code and are designed to reflect the operational semantics of the target hardware.

In effect, IRs are chained, with high-level IRs being lowered to low-level IRs as the compilation process progresses, as shown in Figure 2 below.

Fig 2. — The compilation process involves lowering high-level intermediate representations to low-level intermediate representations.

The process of lowering from high-level to low-level IRs is critical in compilation. It involves systematically translating abstract constructs into detailed operations that can be executed efficiently by the hardware. Having multiple IRs makes it easier to add features to the compiler, as the high-level IR can be used to implement new optimizations and analyses, while the low-level IR can be used to generate efficient machine code.

As an example, while memory management is a high-level concern, register allocation is a low-level concern. By having multiple IRs, the compiler can focus on high-level optimizations early in the compilation process and then switch to low-level optimizations as the compilation progresses.

As a concrete example of a high-level IR, consider the abstract syntax tree (AST) that we have previously discussed. ASTs are a high-level representation of the structure of the source code, capturing the relationships between different elements such as expressions, statements, and declarations. At the same time, ASTs drop much of the syntactic detail present in the source code, focusing on the essential structure of the program — they are often the starting point for an IR chain.

Loading Exercise...

Types of IRs

There are a range of standard IRs that can be used in the compilation process. Some of the common types of IRs include three-address code (TAC), static single assignment form (SSA), control flow graph (CFG), continuation-passing style (CPS), bytecode or virtual machine code, and LLVM or other target-specific IRs.

  • Three-address code (TAC) is a simple form of IR that represents instructions with at most three operands. It is often used in the early stages of compilation to simplify the representation of complex expressions.

  • Static single assignment form (SSA) is an IR that ensures each variable is assigned exactly once, simplifying the analysis and optimization of the program. SSA form is often used in later stages of compilation to perform detailed optimizations.

  • Control flow graphs (CFGs) are graphical representations of the paths that a program can take during execution. CFGs are used to analyze the control flow of the program and identify opportunities for optimization.

  • Continuation-passing style (CPS) is an IR that represents programs as a series of continuations, simplifying the handling of control flow and function calls.

  • Bytecode or virtual machine code is an IR that is designed to be executed by a virtual machine. Bytecode is often used in interpreted languages and just-in-time compilers.

  • LLVM or other target-specific IRs are IRs that are designed to be optimized and transformed into machine code for a specific target architecture. LLVM IR, in particular, is a popular IR used in the LLVM compiler infrastructure.

Example: AST and TAC

Let’s briefly look into three-address code in more detail. The three-address code (TAC) is an IR that represents instructions with at most three operands. Consider the following Rust program.

The program could be represented as an AST similar to the one shown in Figure 3 below. For simplicity, the functionality of main has been omitted from the AST.

Fig 3. — Abstract Syntax Tree of a Rust program.

As a reminder, in ASTs, the functionality of the program is expressed as a series of nodes that represent the different constructs in the program, such as functions, statements, expressions, and literals.

The AST can be further lowered to TAC. Typically, this would be done by traversing the AST and generating a series of three-address instructions that represent the program. Each line in the TAC representation corresponds to a single operation or instruction in the program, with at most three operands — two source operands and one destination operand, and an operator such as +, -, *, or /.

For the above AST, the TAC representation might look like the following. Below, we declare a function calculate, which has a series of operations that are assigned to temporary variables t1, t2, t3, and t4. Finally, the result is assigned to the variable x and returned.

calculate:
   t1 = 42 * 42
   t2 = t1 / 42
   t3 = 42 + t2
   t4 = t3 - 42
   x = t4
   return x

main:
  // omitted

The above TAC representation would also be suitable for a single static assignment (SSA) form, where each variable is assigned exactly once. This holds as the above TAC representation does not reuse variables.

TAC to LLVM IR

Let’s next transform the above TAC to LLVM IR. LLVM IR is a popular IR used in the LLVM compiler infrastructure. The LLVM IR consists of a series of instructions that are designed to be easily transformed into machine code for a variety of target architectures.

These instructions include operations such as add, sub, mul, div, and many others. Registers and memory locations are represented using symbolic names such as %0, %1, and so on. LLVM also encodes type information in the IR.

As an example, the line t1 = 42 * 42 in TAC could be represented in LLVM IR as follows.

%t1 = mul i32 42, 42

The above declares a new variable %t1 of type i32 (32-bit integer) and assigns it the result of multiplying the constants 42 and 42. The mul instruction is used to perform the multiplication operation.

Similarly, the line t2 = t1 / 42 in TAC could be represented in LLVM IR as follows.

%t2 = sdiv i32 %t1, 42

Above, the sdiv instruction is used to perform the signed division operation. The %t1 is the result of the previous multiplication operation.

By following the process, the TAC representation of the calculate function could be represented as an LLVM IR as shown below. The variable x has been omitted and the result that is arrived to after calculating the value for variable t4 is directly returned.

define i32 @calculate() {
entry:
  %t1 = mul i32 42, 42
  %t2 = sdiv i32 %t1, 42
  %t3 = add i32 42, %t2
  %t4 = sub i32 %t3, 42
  ret i32 %t4
}

Although the above works, one would normally also add additional attributes like nuw and nsw to indicate that the multiplication and subtraction operations do not overflow. This information would help in further code optimizations.

Note that it would be also possible to directly compile the AST to LLVM IR without going through the TAC representation. The TAC representation is used here to illustrate the process of lowering from a high-level representation to a lower-level representation.

Loading Exercise...

From LLVM IR to Assembly

Finally, the LLVM IR can be further transformed into assembly code that can be executed by the target hardware. The LLVM infrastructure provides tools like llc that can be used to generate assembly code from LLVM IR.

Assembly is a low-level language that is tied to the specific instruction set architecture of the target hardware. It is essentially a human-readable representation of the machine code instructions that the hardware executes. Assembly is not an intermediate representation, but rather a direct representation of the machine instructions.

You can try out the LLVM IR representation on the godbolt.org website, which provides an interactive compiler explorer for various programming languages and compilers. Change the language to LLVM IR and paste the LLVM IR representation to the editor, and try what the outputs of the different compilers are.

For example, compiling the above LLVM IR using clang would generate the following assembly code.

calculate:
        mov     eax, 42
        imul    eax, eax, 42
        mov     ecx, 42
        cdq
        idiv    ecx
        add     eax, 42
        sub     eax, 42
        ret

Interestingly, however, compiling the above with llc would look as follows.

calculate:
        mov     eax, 42
        ret

As we can see, the llc compiler has optimized the code quite a bit, removing all the unnecessary operations and directly returning the value 42. Let’s next look into what sorts of optimizations compilers do.