Article: AST Instrumentation (examples by language)

How to perform AST instrumentation in numerous programming languages.

Tags: AST, C++, Clang, Development, Javascript, PHP, Python, Static Analysis

Abstract Syntax Trees (ASTs) are an important tool used internally by compilers. ASTs are a structured, tree-based representation of parsed source code, typically used for semantic analysis and code generation. ASTs are a rich source of information about parsed code, encapsulating the code’s structure and semantic meaning whilst also often retaining information about the tokens in the original source file. The standard tree structure of ASTs makes them extremely simple to traverse and manipulate.

ASTs are used heavily in static analysis tools and program transformation tools. These tools typically analyse an AST and generate either other structured information or, in the case of source-to-source transformation tools, source code, either in the original input language or another language entirely.

In addition to static manipulation, ASTs can be modified and then converted to bytecode and executed. Modifying the AST in order to collect information when the code is executed is known as instrumenting the AST. The collected information can be used for a number of purposes, including profiling the runtime behaviour of code, performing code coverage tests, and generating visualisations that incorporate source code.

The sections below provide examples for a number of different programming languages, demonstrating how to generate ASTs, instrument those ASTs, and then convert the ASTs to bytecode and execute them.

Contents

C++ and Objective-C++

Clang is the LLVM compiler frontend for the C family of programming languages: C, C++, Objective-C, and Objective-C++. Both Clang and the LLVM backend are implemented as a set of modular C++ libraries, making all parts of the compiler’s internal functionality available for use in third-party tools.

Although it is possible to programmatically modify the Clang AST and then generate LLVM bytecode from it, Clang’s real strength is in performing source-to-source transformation. The powerful Rewriter class facilitates easy modification of the original source code. The rewritten source code can then be compiled using Clang or any other C/C++/Objective-C compiler.

Creating the boilerplate code for performing source-to-source transformation with Clang is outside the scope of this article, but this article by Eli Bendersky provides all of the required details. (Note that the example code in the article is for an old version of Clang, see this source file for the up-to-date code.)

Assume that we have the following pair of template functions (note that the use of C++ templates is what prevents this method of instrumentation from working with pure C or Objective-C code):

template <typename T>
T& InstrumentationProbe(T& result)
{
	//Perform logging here, potentially involving additional arguments
	
	//Make the probe transparent to the overall program result
	return result;
}

template <typename T>
const T& InstrumentationProbe(const T& result)
{
	//Perform logging here, potentially involving additional arguments
	
	//Make the probe transparent to the overall program result
	return result;
}

We can use the Rewriter in our AST visitor to wrap the source code for any AST node in a call to the probe function:

//By implementing RecursiveASTVisitor, we can specify which AST nodes
//we're interested in by overriding relevant methods.
class MyASTVisitor : public RecursiveASTVisitor<MyASTVisitor>
{
	private:
		Rewriter &rewriter;
	
	public:
		MyASTVisitor(Rewriter &R) : rewriter(R) {}
		
		//In this example, we're only interested in variable references
		bool VisitDeclRefExpr(DeclRefExpr* expr)
		{
			//Insert the start of the function call, including the opening paren
			rewriter.InsertTextBefore(expr->getLocStart(), "InstrumentationProbe(");
			
			//Retrieve the end location, with an offset of 1 so the closing paren
			//doesn't get interleaved with the tokens of the wrapped expression
			SourceLocation locEnd = expr->getLocEnd().getLocWithOffset(1);
			
			//Insert the closing paren
			rewriter.InsertTextAfterToken(locEnd, ")");
			
			return true;
		}
};

If a single source file is being instrumented, the instrumentation probe functions can be prepended to the source file. Otherwise, place the probe declarations in a header file, and prepend an #include preprocessor directive at the start of each instrumented source file. The transformed source code can then be compiled and run.

Resources

Javascript

There are a number of parsers that support generating ASTs for Javascript code. Although the example here uses Esprima, most parsers produce identical output, following the AST format of the Mozilla SpiderMonkey Parser API.

There are numerous tools available for manipulating ASTs following this standard. Here we use estraverse for traversing the AST and escodegen for converting the AST back to Javascript code that can be evaluated.

Assume we have the following probe function:

function InstrumentationProbe(nodeResult)
{
	//Perform logging here, potentially involving additional arguments
	
	//Make the probe transparent to the overall program result
	return nodeResult;
}

We can wrap AST nodes with a call to the probe function using the replace() method of estraverse:

//Generate the AST using Esprima
var ast = esprima.parse("javascript code goes here");

//Wrap nodes in probe function calls
ast = estraverse.replace(ast,
{
	leave: function(node, parent)
	{
		//For nodes we don't want to wrap, simply return the original node:
		//return node;
		
		//For nodes we do want to wrap, generate a function call:
		return {
			"type": "CallExpression",
			"callee": {
				"type": "Identifier",
				"name": "InstrumentationProbe"
			},
			"arguments": [ node ]
		};
	}
});

The instrumented AST can then be turned back into Javascript code using escodegen, and evaluated:

var instrumentedCode = escodegen.generate(ast);
eval(instrumentedCode);

Resources

PHP 5

Prior to PHP 7, the PHP compiler did not use an AST internally. The PhpParser project provides a parser and AST API for PHP versions 5.2 through to 5.6. PhpParser provides all the functionality required to generate and instrument ASTs, and to turn ASTs back into PHP code that can be evaluated.

Assume we have the following probe function:

<?php

function InstrumentationProbe($nodeResult)
{
	//Perform logging here, potentially involving additional arguments
	
	//Make the probe transparent to the overall program result
	return $nodeResult;
}

?>

We can wrap AST nodes with calls to the probe function by implementing an AST visitor for use with PhpParser’s AST traversal functionality:

<?php

//Generate the AST using PHP-Parser
$parser = new PhpParser\Parser(new PhpParser\Lexer\Emulative);
$ast = $parser->parse("PHP code goes here");

//Our AST visitor
class AstInstrumentation extends PhpParser\NodeVisitorAbstract
{
	public function leaveNode(Node $node)
	{
		//For nodes we don't want to wrap, simply return the original node:
		//return $node;
		
		//For nodes we do want to wrap, generate a function call:
		return new PhpParser\Node\Expr\FuncCall(
			new PhpParser\Node\Name('InstrumentationProbe'),
			Array(new PhpParser\Node\Arg($node, false, false))
		);
	}
}

//Instrument the AST
$instrument = new AstInstrumentation();
$traverser = new PhpParser\NodeTraverser();
$traverser->addVisitor($instrument);
$ast = $traverser->traverse($ast);

?>

The instrumented AST can then be turned back into PHP code, and evaluated:

<?php

$prettyPrinter = new PhpParser\PrettyPrinter\Standard();
$instrumentedCode = $prettyPrinter->prettyPrint($topLevelStmts);
eval($instrumentedCode);

?>

Resources

Python 3.x

Support for generating ASTs is built-in in Python, and can be accessed through the ast module:

import ast

# Generate the AST
rootNode = ast.parse("python code goes here")

To instrument any given node, we can generate a function call to wrap it, like so:

# Assume we have the following probe function:
def InstrumentationProbe(nodeEvaluationResult):
		
		# Perform logging here, potentially involving additional arguments
		
		# Make the probe transparent to the overall program result
		return nodeEvaluationResult

# Generate the instrumentation probe function call for the node "astNode"
instrumentationCall = ast.Call(
	func=ast.Name(id='InstrumentationProbe', ctx=ast.Load()),
	args=[astNode],
	keywords=[],
	starargs=None,
	kwargs=None
)
ast.copy_location(instrumentationCall, astNode)
ast.fix_missing_locations(instrumentationCall)

# Replace the node with the function call that wraps it
# (Assume here that the child is the attribute "child")
setattr(parentNode, "child", instrumentationCall)

The instrumented AST can then be compiled to bytecode and executed, like so:

# Execute the instrumented AST
globalMemory = {}
localMemory = {}
exec(compile(rootNode, '<string>', 'exec'), globalMemory, localMemory)

Resources