Article: AST Instrumentation (examples by language)
How to perform AST instrumentation in numerous programming languages.
Posted: 29 January 2015
Abstract Syntax Trees (ASTs) are an important tool used internally by compilers. ASTs are a structured, tree-based representation of parsed source code, typically used for semantic analysis and code generation. ASTs are a rich source of information about parsed code, encapsulating the code’s structure and semantic meaning whilst also often retaining information about the tokens in the original source file. The standard tree structure of ASTs makes them extremely simple to traverse and manipulate.
ASTs are used heavily in static analysis tools and program transformation tools. These tools typically analyse an AST and generate either other structured information or, in the case of source-to-source transformation tools, source code, either in the original input language or another language entirely.
In addition to static manipulation, ASTs can be modified and then converted to bytecode and executed. Modifying the AST in order to collect information when the code is executed is known as instrumenting the AST. The collected information can be used for a number of purposes, including profiling the runtime behaviour of code, performing code coverage tests, and generating visualisations that incorporate source code.
The sections below provide examples for a number of different programming languages, demonstrating how to generate ASTs, instrument those ASTs, and then convert the ASTs to bytecode and execute them.
Contents
C++ and Objective-C++
Clang is the LLVM compiler frontend for the C family of programming languages: C, C++, Objective-C, and Objective-C++. Both Clang and the LLVM backend are implemented as a set of modular C++ libraries, making all parts of the compiler’s internal functionality available for use in third-party tools.
Although it is possible to programmatically modify the Clang AST and then generate LLVM bytecode from it, Clang’s real strength is in performing source-to-source transformation. The powerful Rewriter
class facilitates easy modification of the original source code. The rewritten source code can then be compiled using Clang or any other C/C++/Objective-C compiler.
Creating the boilerplate code for performing source-to-source transformation with Clang is outside the scope of this article, but this article by Eli Bendersky provides all of the required details. (Note that the example code in the article is for an old version of Clang, see this source file for the up-to-date code.)
Assume that we have the following pair of template functions (note that the use of C++ templates is what prevents this method of instrumentation from working with pure C or Objective-C code):
We can use the Rewriter
in our AST visitor to wrap the source code for any AST node in a call to the probe function:
If a single source file is being instrumented, the instrumentation probe functions can be prepended to the source file. Otherwise, place the probe declarations in a header file, and prepend an #include
preprocessor directive at the start of each instrumented source file. The transformed source code can then be compiled and run.
Resources
- Clang homepage
- Article on basic source-to-source transformation with Clang by Eli Bendersky
- Eli Bendersky’s Clang & LLVM code samples GitHub repository provides a number of useful examples for working with the Clang & LLVM libraries
Javascript
There are a number of parsers that support generating ASTs for Javascript code. Although the example here uses Esprima, most parsers produce identical output, following the AST format of the Mozilla SpiderMonkey Parser API.
There are numerous tools available for manipulating ASTs following this standard. Here we use estraverse for traversing the AST and escodegen for converting the AST back to Javascript code that can be evaluated.
Assume we have the following probe function:
We can wrap AST nodes with a call to the probe function using the replace()
method of estraverse:
The instrumented AST can then be turned back into Javascript code using escodegen, and evaluated:
Resources
- Esprima homepage
- Documentation for the Mozilla SpiderMonkey Parser API on MDN
- Article on using Esprima to process Javascript by Marcus Malka
PHP 5
Prior to PHP 7, the PHP compiler did not use an AST internally. The PhpParser project provides a parser and AST API for PHP versions 5.2 through to 5.6. PhpParser provides all the functionality required to generate and instrument ASTs, and to turn ASTs back into PHP code that can be evaluated.
Assume we have the following probe function:
We can wrap AST nodes with calls to the probe function by implementing an AST visitor for use with PhpParser’s AST traversal functionality:
The instrumented AST can then be turned back into PHP code, and evaluated:
Resources
- The RFC detailing the AST implementation for PHP 7, written by the author of PhpParser, which states the limitations of the old (non-AST) parser
- PhpParser repository on GitHub
- PhpParser documentation on AST traversal functionality
Python 3.x
Support for generating ASTs is built-in in Python, and can be accessed through the ast
module:
To instrument any given node, we can generate a function call to wrap it, like so:
The instrumented AST can then be compiled to bytecode and executed, like so:
Resources
- Official Python documentation for the
ast
module - Article on instrumenting the Python AST by Andrew Dalke