Article: Matching source-level CFG basic blocks to LLVM IR basic blocks
Source-level Control Flow Graphs (CFGs) generated by Clang follow the same structure as CFGs composed of LLVM Intermediate Representation (IR) basic blocks. With a little effort, we can create a mapping between source-level CFG basic blocks and their corresponding LLVM IR basic blocks.
Posted: 18 December 2014
Contents
- Rationale
- Basic Blocks
- Control Flow Graphs (CFGs)
- Generating Clang source-level CFGs and LLVM IR CFGs
- Processing the source-level and IR instruction-level CFGs for a function
- References
Rationale
Out of the box, Clang contains very little functionality to ascertain the relationship between parsed source code and generated LLVM intermediate representation (IR). This is because optimisation steps are performed during several phases of translation 1, and the generated instructions may be very different in structure to the source code that they represent. This is particularly true in the case of language constructs that are not procedural, which may be written declaratively in the source code but must be represented procedurally in the generated IR.
In spite of these factors, a mapping between source statements and generated IR instructions can still be useful, particularly as a heuristic for determining the complexity of a set of source statements. Although determining such a mapping at the statement level of granularity would require modification to the Clang source code 2, determining a mapping at the basic block level of granularity is relatively straightforward using the facilities already available.
Basic Blocks
During optimisation, compilers break code down into units known as basic blocks. Basic blocks can be used to represent either source-level statements or low-level machine instructions, and are utilised during the optimisation passes of multiple translation phases.
Each statement or instruction in a translation unit exists in exactly one basic block. Basic blocks serve to group statements or instructions whose execution is always sequential - that is, there are no jump instructions or jump targets between them. Basic blocks are constructed by inspecting each statement or instruction in turn and applying the following rules (note that the term “instruction” is used here to represent either a source-level statement or a low-level machine instruction, depending on the input that we are creating basic blocks for):
- When a jump target is encountered, start a new basic block with the jump target as the first instruction.
- When a jump instruction is encountered, mark it as the “terminating instruction” of the current basic block, and treat the instruction immediately following the terminating instruction as a jump target.
- In all other cases, add the instruction to the current basic block.
Note that a jump instruction can also be a jump target, as is the case for a loop header. In this instance, the first two rules are applied in order. A new basic block is created containing the instruction, which is then marked as the terminating instruction. A new basic block will be started upon encountering the instruction immediately following the loop header.
To demonstrate the application of these rules, consider source code depicted in Listing 1:
Applying the rules for basic block construction to the source code from Listing 1 yields the basic blocks depicted in Listing 2:
Instructing Clang to generate the LLVM IR for the source code from Listing 1 (using the -emit-llvm
switch) yields the LLVM IR depicted in Listing 3. The basic blocks in LLVM IR instructions are easily indentifiable, as they are explicitly marked with descriptive labels (e.g. entry,if.then,while.body, etc.)
Control Flow Graphs (CFGs)
Basic blocks are useful in representing the control flow of a piece of code because each basic block has only one entry point and one exit point. Because basic blocks can represent either source-level statements or low-level machine instructions, the flow relationships between basic blocks can provide useful perspectives on a piece of code at different stages of translation. The visual representation of control flow that is built upon basic blocks is called a control flow graph (CFG).
A CFG is a directed graph that represents all of the possible execution paths through a piece of code. Basic blocks are used to form the vertices in a CFG, and edges are added to represent the flow of control between the basic blocks. The adjacent blocks that can flow to a given basic block are referred to as its predecessors. The adjacent blocks that a given basic block can flow to are referred to as its successors.
In the case of CFGs built from source-level basic blocks, two additional vertices are added, a source node termed entry and a sink node termed exit. Note that a CFG is not a directed acyclic graph, as loops are represented by back edges that form cycles to represent iteration.
The CFG for the source-level basic blocks listed in Listing 2 is depicted in Figure 1:
The CFG for the LLVM IR instruction-level basic blocks listed in Listing 3 is depicted in Figure 2:
Generating Clang source-level CFGs and LLVM IR CFGs
Generating Clang source-level CFGs
Clang can generate CFGs based on parsed source code, providing a source-level view of code after preprocessing has taken place. These source-level CFGs are represented by the clang::CFG
class. The Clang API interface to construct a source-level CFG is clang::CFG::buildCFG()
. Given a function declaration in the form of a clang::FunctionDecl
AST node and a clang::ASTContext
, we can build a source-level CFG like so:
Generating LLVM IR instruction-level CFGs
The Clang API provides a mechanism to trigger LLVM IR generation and retrieve the generated IR through the clang::EmitLLVMOnlyAction
class. Given a clang::CompilerInstance
object that has been set up, we can trigger IR generation for an input file like so:
After LLVM codegen has completed, the resultant IR is represented as an llvm::Module
object, which can be retrieved using the takeModule()
method of the clang::EmitLLVMOnlyAction
object. An llvm::Module
contains an llvm::Function
object for each function or method with a definition in the translation unit. Each llvm::Function
object contains a list of llvm::BasicBlock
objects, representing the basic blocks of the generated IR bytecode. An llvm::Function
object can be passed directly to LLVM’s internal graph visualisation routines to visualise the llvm::BasicBlock
objects as a CFG, as depicted in Figure 2.
Note that because the llvm::BasicBlock
objects are being visualised directly, LLVM does not insert additional entry and exit nodes as Clang does when generating a source-level CFG. This difference is visible in the IR instruction-level CFG depicted in Figure 2.
Processing the source-level and IR instruction-level CFGs for a function
Input source code requirements
Once the source-level CFG for a function has been generated, and the llvm::Function
object has been retrieved, the two CFGs can be processed to determine the relationship between the source-level basic blocks and the IR instruction-level basic blocks. If no optimisations were performed by the compiler at all, there would very likely be a direct one-to-one mapping from source code to IR instructions. However, even basic optimisation techniques, such as dead code removal, will result in a different (more efficient) structure in the generated IR code. To help minimise these differences, input code must satisfy the following constraints:
- No unreachable or unused code. Every line of code in the function should be reachable and required, since any unreachable or unused code will simply be removed during optimisation.
- No redundant control flow structures. If the condition expression of a control flow structure can be evaluated to true or false statically at compile-time, the compiler will remove the structure.
This constraint is not concerned with the case of redundant
if
/else
structures, where one branch will be taken 100% of the time and the other branch is unreachable, because such cases already violate our first constraint. This constraint is motivated by cases such asdo { ... } while (false);
, where the surrounding structure is simply replaced with the loop body, representing a redundant control flow structure without any unreachable code.
Processing methodology
To process the two CFGs, we perform a breadth-first (BFS) traversal by utilising a queue of pairs. Each pair contains a basic block from the source-level CFG and a basic block from the IR instruction-level CFG. We set the initial state of the queue by enqueueing a pair containing the entry block of the instruction-level CFG and the first successor of the entry block of the source-level CFG. This is done because the instruction-level CFG does not contain added entry and exit nodes, and so the entry node of the source-level CFG will have no counterpart in the instruction-level CFG.
For each pair in the queue, we must ascertain if the two blocks are counterparts. The details of the checks performed to ascertain this are described in the sections that follow. Assuming that the pair of basic blocks are counterparts, we mark them as such, assigning the instruction count of the instruction-level block to the source-level block. We then enqueue the blocks’ respective successors in matched pairs. (That is, we enqueue a pair containing the source-level block’s first successor and the instruction-level block’s first successor, a pair containing the source-level block’s second successor and the instruction-level block’s second successor, and so on.)
Determining if blocks are counterparts
By default, an enqueued pair of basic blocks should be counterparts, unless one of the following cases is encountered:
- Source-level CFG “entry” and “exit” nodes. Clang inserts entry and exit nodes when building a source-level CFG, as per the formal definition of a CFG. Such nodes are not added to the instruction-level CFG, however, because it is a direct visualisation of the LLVM IR basic blocks. As such, the source-level entry and exit nodes are ignored when they are encountered. In the case of the entry node, we enqueue a pair containing the entry node’s successor and the current instruction-level basic block.
- Source-level basic blocks split by exception handling. Although the generated LLVM IR code for a function contains a number of basic blocks related to exception handling, they are transparently omitted when the instruction-level CFG is visualised, masking the presence of exception handling entirely. This is not the case when Clang is generating a source-level CFG. When a
try { ... } catch
structure is encountered by Clang, it adds additional nodes to the source-level CFG to represent the exception handling code. These additional nodes are not reachable from the rest of the directed graph, so they represent no problem. What does represent a problem, however, is that the code inside thetry
will be split into a separate basic block from the code that follows it (this is reasonable, given that the first statement following the structure becomes a jump target.) Since this split is not visible in the instruction-level CFG, it must be dealt with. We mark the current source-level basic block as having been split, and enqueue a pair containing the current source-level basic block’s successor and the current instruction-level basic block. The current source-level basic block is ignored and we effectively match the current instruction-level basic block to the second half of the split block instead. - Redundant instruction-level basic blocks. Sometimes LLVM will generate basic blocks that contain only a single branch instruction, flowing on to a single successor block immediately. In these cases, we enqueue a pair containing the current source-level basic block and the instruction-level basic block’s successor. The current instruction-level basic block is ignored and we effectively match the current source-level basic block to the instruction-level block’s successor instead.
- Redundant source-level basic blocks. Sometimes Clang will generate empty basic blocks, which contain no statements and have a single successor. In these cases, we perform the opposite action to that performed when redundant instruction-level blocks are encountered. We enqueue a pair containing the current source-level basic block’s successor and the current instruction-level basic block. The current source-level basic block is ignored and we effectively match the current instruction-level basic block to the source-level block’s successor instead.
- Ternary operators with constant operands. If both of a ternary operator’s operands can be resolved to a concrete value at compile-time, the compiler will remove the branch, evaluating both operand expressions and propagating the result. This is because the overhead of storing a pre-determined constant is lower than branching to avoid evaluating both operand expressions. In these cases, we mark the source-level basic block containing the ternary operator condition as its terminating statement as having an instruction count of 1. We do the same for both source-level blocks representing the ternary operator’s operands. We then enqueue a pair containing the successor of the left-hand operand source-level block and the current instruction-level block. The three source-level blocks representing the ternary operator branching behaviour are all effectively treated as having an instruction count of 1, and the current instruction-level block is matched with the successor of the three mismatched source-level blocks.
Note, however, that there is one important exception to the third and fourth cases:
- Exception: switch case statements. Redundant source-level and instruction-level basic blocks can also be generated by “empty”
case
clauses in aswitch
statement (these clauses may or may not contain abreak
statement.) In these situations, the seemingly redundant block (either source-level or instruction-level) will indeed be the counterpart of the block it has been paired with, and should be processed as such.