Article: Matching source-level CFG basic blocks to LLVM IR basic blocks

Source-level Control Flow Graphs (CFGs) generated by Clang follow the same structure as CFGs composed of LLVM Intermediate Representation (IR) basic blocks. With a little effort, we can create a mapping between source-level CFG basic blocks and their corresponding LLVM IR basic blocks.

Tags: C++, Clang, Development, LLVM

Contents

Rationale

Out of the box, Clang contains very little functionality to ascertain the relationship between parsed source code and generated LLVM intermediate representation (IR). This is because optimisation steps are performed during several phases of translation 1, and the generated instructions may be very different in structure to the source code that they represent. This is particularly true in the case of language constructs that are not procedural, which may be written declaratively in the source code but must be represented procedurally in the generated IR.

In spite of these factors, a mapping between source statements and generated IR instructions can still be useful, particularly as a heuristic for determining the complexity of a set of source statements. Although determining such a mapping at the statement level of granularity would require modification to the Clang source code 2, determining a mapping at the basic block level of granularity is relatively straightforward using the facilities already available.

Basic Blocks

During optimisation, compilers break code down into units known as basic blocks. Basic blocks can be used to represent either source-level statements or low-level machine instructions, and are utilised during the optimisation passes of multiple translation phases.

Each statement or instruction in a translation unit exists in exactly one basic block. Basic blocks serve to group statements or instructions whose execution is always sequential - that is, there are no jump instructions or jump targets between them. Basic blocks are constructed by inspecting each statement or instruction in turn and applying the following rules (note that the term “instruction” is used here to represent either a source-level statement or a low-level machine instruction, depending on the input that we are creating basic blocks for):

Note that a jump instruction can also be a jump target, as is the case for a loop header. In this instance, the first two rules are applied in order. A new basic block is created containing the instruction, which is then marked as the terminating instruction. A new basic block will be started upon encountering the instruction immediately following the loop header.

To demonstrate the application of these rules, consider source code depicted in Listing 1:

function()
{
	int a = 1;       //Sequential instructions
	int b = 2;       //-----------------------
	
	if (b == 2)      //Jump instruction
	{
		++b;         //Jump target
	}
	
	int c = 3;       //Sequential instructions
	int d = 4;       //-----------------------
	
	while (a < 5)    //Jump instruction and jump target
	{
		++a;         //Jump target
	}
	
	int e = 5;       //Sequential instructions
	int f = 6;       //-----------------------
}

Listing 1: Example source code containing control flow structures

Applying the rules for basic block construction to the source code from Listing 1 yields the basic blocks depicted in Listing 2:

int a = 1;
int b = 2;
if (b == 2)


++b;


int c = 3;
int d = 4;


while (a < 5)


++a;


int e = 5;
int f = 6;

Listing 2: Basic blocks representing the source code from Listing 1

Instructing Clang to generate the LLVM IR for the source code from Listing 1 (using the -emit-llvm switch) yields the LLVM IR depicted in Listing 3. The basic blocks in LLVM IR instructions are easily indentifiable, as they are explicitly marked with descriptive labels (e.g. entry,if.then,while.body, etc.)

void @_Z8functionv() #0 {
entry:
	%a = alloca i32, align 4
	%b = alloca i32, align 4
	%c = alloca i32, align 4
	%d = alloca i32, align 4
	%e = alloca i32, align 4
	%f = alloca i32, align 4
	store i32 1, i32* %a, align 4
	store i32 2, i32* %b, align 4
	%0 = load i32* %b, align 4
	%cmp = icmp eq i32 %0, 2
	br i1 %cmp, label %if.then, label %if.end

if.then:
	%1 = load i32* %b, align 4
	%inc = add nsw i32 %1, 1
	store i32 %inc, i32* %b, align 4
	br label %if.end

if.end:
	store i32 3, i32* %c, align 4
	store i32 4, i32* %d, align 4
	br label %while.cond

while.cond:
	%2 = load i32* %a, align 4
	%cmp1 = icmp slt i32 %2, 5
	br i1 %cmp1, label %while.body, label %while.end

while.body:
	%3 = load i32* %a, align 4
	%inc2 = add nsw i32 %3, 1
	store i32 %inc2, i32* %a, align 4
	br label %while.cond

while.end:
	store i32 5, i32* %e, align 4
	store i32 6, i32* %f, align 4
	ret void
}

Listing 3: LLVM Intermediate Representation (IR) generated by the source code from Listing 1

Control Flow Graphs (CFGs)

Basic blocks are useful in representing the control flow of a piece of code because each basic block has only one entry point and one exit point. Because basic blocks can represent either source-level statements or low-level machine instructions, the flow relationships between basic blocks can provide useful perspectives on a piece of code at different stages of translation. The visual representation of control flow that is built upon basic blocks is called a control flow graph (CFG).

A CFG is a directed graph that represents all of the possible execution paths through a piece of code. Basic blocks are used to form the vertices in a CFG, and edges are added to represent the flow of control between the basic blocks. The adjacent blocks that can flow to a given basic block are referred to as its predecessors. The adjacent blocks that a given basic block can flow to are referred to as its successors.

In the case of CFGs built from source-level basic blocks, two additional vertices are added, a source node termed entry and a sink node termed exit. Note that a CFG is not a directed acyclic graph, as loops are represented by back edges that form cycles to represent iteration.

The CFG for the source-level basic blocks listed in Listing 2 is depicted in Figure 1:

Figure 1: Example source-level Control Flow Graph
Figure 1: Example source-level Control Flow Graph

The CFG for the LLVM IR instruction-level basic blocks listed in Listing 3 is depicted in Figure 2:

Figure 2: Example IR instruction-level Control Flow Graph
Figure 2: Example IR instruction-level Control Flow Graph

Generating Clang source-level CFGs and LLVM IR CFGs

Generating Clang source-level CFGs

Clang can generate CFGs based on parsed source code, providing a source-level view of code after preprocessing has taken place. These source-level CFGs are represented by the clang::CFG class. The Clang API interface to construct a source-level CFG is clang::CFG::buildCFG(). Given a function declaration in the form of a clang::FunctionDecl AST node and a clang::ASTContext, we can build a source-level CFG like so:

FunctionDecl* func = /* ... */ ;
ASTContext* context = /* ... */ ;
Stmt* funcBody = func->getBody();
CFG* sourceCFG = CFG::buildCFG(func, funcBody, context, clang::CFG::BuildOptions());

Listing 4: Code to create a source-level CFG using the Clang API

Generating LLVM IR instruction-level CFGs

The Clang API provides a mechanism to trigger LLVM IR generation and retrieve the generated IR through the clang::EmitLLVMOnlyAction class. Given a clang::CompilerInstance object that has been set up, we can trigger IR generation for an input file like so:

std::string inputFile = "input.cpp";
clang::CompilerInstance compInst;

clang::EmitLLVMOnlyAction codegenAction;
codegenAction.BeginSourceFile(compInst, FrontendInputFile(inputFile, IK_CXX, false));
codegenAction.Execute();
codegenAction.EndSourceFile();

llvm::Module* module = codegenAction.takeModule();

Listing 5: Code to trigger LLVM IR generation for an input file and retrieve the generated module using the Clang API

After LLVM codegen has completed, the resultant IR is represented as an llvm::Module object, which can be retrieved using the takeModule() method of the clang::EmitLLVMOnlyAction object. An llvm::Module contains an llvm::Function object for each function or method with a definition in the translation unit. Each llvm::Function object contains a list of llvm::BasicBlock objects, representing the basic blocks of the generated IR bytecode. An llvm::Function object can be passed directly to LLVM’s internal graph visualisation routines to visualise the llvm::BasicBlock objects as a CFG, as depicted in Figure 2.

Note that because the llvm::BasicBlock objects are being visualised directly, LLVM does not insert additional entry and exit nodes as Clang does when generating a source-level CFG. This difference is visible in the IR instruction-level CFG depicted in Figure 2.

Processing the source-level and IR instruction-level CFGs for a function

Input source code requirements

Once the source-level CFG for a function has been generated, and the llvm::Function object has been retrieved, the two CFGs can be processed to determine the relationship between the source-level basic blocks and the IR instruction-level basic blocks. If no optimisations were performed by the compiler at all, there would very likely be a direct one-to-one mapping from source code to IR instructions. However, even basic optimisation techniques, such as dead code removal, will result in a different (more efficient) structure in the generated IR code. To help minimise these differences, input code must satisfy the following constraints:

Processing methodology

To process the two CFGs, we perform a breadth-first (BFS) traversal by utilising a queue of pairs. Each pair contains a basic block from the source-level CFG and a basic block from the IR instruction-level CFG. We set the initial state of the queue by enqueueing a pair containing the entry block of the instruction-level CFG and the first successor of the entry block of the source-level CFG. This is done because the instruction-level CFG does not contain added entry and exit nodes, and so the entry node of the source-level CFG will have no counterpart in the instruction-level CFG.

For each pair in the queue, we must ascertain if the two blocks are counterparts. The details of the checks performed to ascertain this are described in the sections that follow. Assuming that the pair of basic blocks are counterparts, we mark them as such, assigning the instruction count of the instruction-level block to the source-level block. We then enqueue the blocks’ respective successors in matched pairs. (That is, we enqueue a pair containing the source-level block’s first successor and the instruction-level block’s first successor, a pair containing the source-level block’s second successor and the instruction-level block’s second successor, and so on.)

Determining if blocks are counterparts

By default, an enqueued pair of basic blocks should be counterparts, unless one of the following cases is encountered:

Note, however, that there is one important exception to the third and fourth cases:

References