Skip to content

Syntax tree queries using Tree-sitter#

Knut uses Tree-sitter to build a syntax tree of your files and query data from it.

From the Tree-sitter website:

Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited.

Tree-sitter aims to be:

  • General enough to parse any programming language
  • Fast enough to parse on every keystroke in a text editor
  • Robust enough to provide useful results even in the presence of syntax errors
  • Dependency-free so that the runtime library (which is written in pure C) can be embedded in any application

Tree-sitter in Knut#

Current support in user scripts is limited to:

  • Inspecting the concrete syntax tree generated by Tree-sitter using the "Tree-sitter Inspector".
  • Running Tree-sitter Queries
  • Only available on C/C++ files

However, this already allows for some very easy & powerful analysis and modification of C++ source files.

Knut also returns Tree-sitter query results in some of its high-level methods. Usually in the form of the QueryMatch class.

See the QueryMatch documentation for more details of how to use this class.

Compared to the language server, Tree-sitter is a lot more resistant to errors and works well, even if the current code would not compile. For this reason many Knut functions internally rely on Tree-sitter instead of the language server. This may result in false-positives though, as Tree-sitter only works on a syntax level and doesn't understand the high-level structures of the code. Tree-sitter will especially struggle if symbols are overloaded, as all references may be picked up.

Writing Tree-sitter queries#

Tree-sitter queries are a powerful tool, as the syntax tree parsed by Tree-sitter makes it easy to find functions, class definitions, etc. For the exhaustive specification of the query syntax, see the Tree-sitter website.

Knut supports prototyping queries with the Tree-sitter Inspector, which can be accessed in the Knut GUI under the C++ menu.

The Tree-sitter Inspector allows you to inspect the syntax tree of the current file. Note that this does not include anonymous nodes (e.g. any symbols like -> * + - / ). These are still in the syntax tree and can be queried. For details, see the query documentation.

As the syntax tree can become large very quickly, the Inspector provides a bi-directional mapping to the edited file.

  1. Clicking any node in the inspector will select the corresponding text in the document
  2. Any node that includes the cursor position is highlighted in green.

Additionally, you can test your queries in the bottom-left input field. The Tree-sitter Inspector will notify you of errors in your query and display the number of patterns, matches and captures as you type. Any captured nodes will also display their captures in the syntax tree view.

After prototyping a query, it can easily be included in a script by calling query on any LSP capable document (e.g. C/C++). Together with Javascripts Destructuring Assignment and Template Literals this makes queries very ergonomic to use in your own scripts.

Example:

let className = /*...*/;

//  Destructuring Assignment with [ ] can be used if only one result is expected.
let [constructor] = cppFile.query(`
    (function_definition
        declarator: (function_declarator
            declarator: (qualified_identifier
                scope: (_) @scope
                (identifier) @name)
                (#eq? "${className}" @scope @name))
        body: (compound_statement) @body)
`);
//                     ^^^^^^^^^^^^
// Note how className was entered directly into the query using the ${...} syntax.

// The result will be an empty list if nothing was found.
// When using destructuring assignment, constructor will then be undefined.
if (!constructor) {
    Message.warning(`Cannot find constructor for class ${className}`);
    return;
}

// Get the text of a capture
let old_body = constructor.get("body").text;

// Modify a capture
constructor.get("body").replace("body").replace(`{
    std::log << "${className} Constructor" << std::endl;
}`);

// Or remove it outright
constructor.get("body").remove()

Predicates#

Tree-sitter queries can be extended by the embedding application (in this case Knut) using predicates.

The predicates provided by Knut are divided into two categories:

  1. Commands - Predicates ending with an exclamation mark (!)
    • These predicates may modify a QueryMatch, but not discard it
    • Commands are run before the filters are checked
  2. Filters - Predicates ending with a question mark (?)
    • These predicates can discard, but not modify a QueryMatch
    • A filter discards a QueryMatch if its check fails
    • Filters are checked after the commands run

Knut currently provides implementations for these predicates:

(exclude! [capture]+ [exclusion]+)#

Exclude node types listed in exclusion from the given captures.

This is especially useful to remove any unwanted (comment) nodes. Especially when transforming function calls, this is helpful, as the order of arguments is often relevant there and inline comments should likely not be treated as positional arguments.

Example:

(call_expression
    function: (identifier) @name (#eq? @name myFunction)
    arguments: (argument_list
        [(_) @argument ","]*)
    (#exclude! @argument comment))
This query would return a QueryMatch with only 2 captures for "argument" on the following call:
myFunction(1, /*documentation for the second parameter*/ 2);

Where the first @argument capture would be "1" and the second "2".

Without the (#exclude!) predicate, 3 nodes would have been captured, with the comment as the second capture.

(#eq? [args]+)#

Check if all arguments are exactly string-equal

Example usage to find the constructor of MyClass

(function_definition
    declarator: (function_declarator
        declarator: (qualified_identifier
            scope: (_) @scope
            (identifier) @name)
            (#eq? "MyClass" @scope @name))
    body: (compound_statement) @body)
Would find the constructor of the class MyClass.

(like? [args]+)#

Check if all arguments are "alike".

In this case "alike" means the arguments are all string-equal, after all white-space is removed.

This is very useful when comparing strings that might span multiple lines or may be indented/formatted differently depending on preference. E.g. const QString& could also be formatted as const QString &. The like? predicate would match both of these variations.

In general, prefer like? over eq? when matching anything other than a single identifier.

(eq_except? [pattern] [capture] [exclusion]+)#

Check if the pattern and the capture are string-equal, excluding any (sub-)nodes that have a type listed in exclusion.

This is similar to the (#eq?) operator. However, the captured nodes and their child nodes are filtered. Any (child) node listed in exclusion is removed from the string before comparing for string-equality.

This is useful to remove the identifier from a parameter_declaration, that may have arbitrarily many pointer indirections.

E.g.: To check that the type of a parameter is const std::string &, you can simply exclude the identifier:

(function_definition
    declarator: (function_declarator
        parameters: (parameter_list
            (parameter_declaration) @param
            (#eq_except? "const std::string &" @param "identifier"))))

(like_except? [pattern] [capture] [exclusion]+)#

Check if the pattern and capture are "alike", excluding any (sub-)nodes that have a type listed in exclusion.

See: (#eq_except?) and (#like?).

(#match? [regex] [args]+)#

Check if the given args match the given regex

Example usage to find all member function definitions of MyClass

(function_definition
    declarator: (function_declarator
        declarator: (_) @name
            (#match? "MyClass::" @name))
    body: (compound_statement) @body)

(#not_is? [capture]+ [node_type]+)#

Check that none of the captures are of any of the given node types.

This is especially useful when using the wildcard operators (_) and _. These match any (named) node type. Combined with this predicate these can match any node type but the given types.

Example usage to find all member functions that return any type other than a primitive type:

(function_definition
    type: (_) @type
    (#not_is? @type primitive_type)) @function

(#in_message_map? [capture]+)#

Check if the given capture is within a MFC message map.

Example usage to find all elements of the message map:

(
(expression_statement) @expr
(#in_message_map? @expr)
)
Note that for most operations, using CppDocument::mfcExtractMessageMap is probably easier.