Syntax tree queries using Tree-sitter#
Knut uses Tree-sitter to build a syntax tree of your files and query data from it.
From the Tree-sitter website:
Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited.
Tree-sitter aims to be:
- General enough to parse any programming language
- Fast enough to parse on every keystroke in a text editor
- Robust enough to provide useful results even in the presence of syntax errors
- Dependency-free so that the runtime library (which is written in pure C) can be embedded in any application
Tree-sitter in Knut#
Knut exposes Tree-sitter via high-level APIs that allow:
- Inspecting the concrete syntax tree generated by Tree-sitter using the "Tree-sitter Inspector".
- Running Tree-sitter Queries
While this doesn't give access to all of the details exposed by Tree-sitter, this already allows for some very easy & powerful analysis and modification of code files.
Knut also returns Tree-sitter query results in many of its high-level methods. Usually in the form of the QueryMatch class.
See the QueryMatch documentation for more details of how to use this class.
Compared to a language server, Tree-sitter is a lot more resistant to errors and works well, even if the current code would not compile. For this reason many Knut functions internally rely on Tree-sitter instead of the language server. This may result in false-positives though, as Tree-sitter only works on a syntax level and doesn't understand the high-level structures of the code. Tree-sitter will especially struggle if symbols are overloaded, as all references may be picked up.
Viewing the Tree-sitter state#
Familiarizing yourself with Tree-sitters state is the first step to writing your own queries.
Knut includes the Tree-sitter Inspector
, which displays the concrete syntax tree (CST) generated by Tree-sitter.
To open it, click Code
>Tree-sitter Inspector
.
The Tree-sitter Inspector
displays the CST on the left.
As the syntax tree can become large very quickly, the Inspector provides a bi-directional mapping to the edited file.
- Clicking any node in the inspector will select the corresponding text in the document
- Any node that includes the cursor position is highlighted in green.
Also note the
Show unnamed nodes
option in the bottom. The Tree-sitter CST is highly detailed and includes a lot of individual characters/keywords A lot of these nodes are hidden by default and won't be relevant for many queries. For the cases where these nodes may be relevant, the checkbox makes them visible.
On the right side, you can prototype a Tree-sitter query, the results of which will immediately be displayed at the bottom and in the CST. The Tree-sitter Inspector will notify you of errors in your query and display the number of patterns, matches and captures as you type. Any captured nodes will also display their captures in the syntax tree view.
⚠️ Because the query runs immediately, testing complex queries on large files can cause Knut to freeze when typing. If this is the case, extracting the relevant code into a smaller file can be a good remedy.
Writing Tree-sitter queries#
Tree-sitter queries are a powerful tool, as the syntax tree parsed by Tree-sitter makes it easy to find functions, class definitions, etc. For the exhaustive specification of the query syntax, see the Tree-sitter website.
After prototyping a query, it can easily be included in a script by calling query
on any LSP capable document (e.g. C/C++).
Together with Javascripts Destructuring Assignment and Template Literals this makes queries very ergonomic to use in your own scripts.
Example:
let className = /*...*/;
// Destructuring Assignment with [ ] can be used if only one result is expected.
let [constructor] = cppFile.query(`
(function_definition
declarator: (function_declarator
declarator: (qualified_identifier
scope: (_) @scope
(identifier) @name)
(#eq? "${className}" @scope @name))
body: (compound_statement) @body)
`);
// ^^^^^^^^^^^^
// Note how className was entered directly into the query using the ${...} syntax.
// The result will be an empty list if nothing was found.
// When using destructuring assignment, constructor will then be undefined.
if (!constructor) {
Message.warning(`Cannot find constructor for class ${className}`);
return;
}
// Get the text of a capture
let old_body = constructor.get("body").text;
// Modify a capture
constructor.get("body").replace(`{
std::cout << "${className} Constructor" << std::endl;
}`);
// Or remove it outright
constructor.get("body").remove()
Predicates#
Tree-sitter queries can be extended by the embedding application (in this case Knut) using predicates.
The predicates provided by Knut are divided into two categories:
- Commands - Predicates ending with an exclamation mark (
!
)- These predicates may modify a QueryMatch, but not discard it
- Commands are run before the filters are checked
- Filters - Predicates ending with a question mark (
?
)- These predicates can discard, but not modify a QueryMatch
- A filter discards a QueryMatch if its check fails
- Filters are checked after the commands run
Knut currently provides implementations for these predicates:
(exclude! [capture]+ [exclusion]+)
#
Exclude node types listed in exclusion
from the given capture
s.
This is especially useful to remove any unwanted (comment)
nodes.
Especially when transforming function calls, this is helpful, as the order of arguments is often relevant there and inline comments should likely not be treated as positional arguments.
Example:
(call_expression
function: (identifier) @name (#eq? @name myFunction)
arguments: (argument_list
[(_) @argument ","]*)
(#exclude! @argument comment))
myFunction(1, /*documentation for the second parameter*/ 2);
Where the first @argument
capture would be "1"
and the second "2"
.
Without the (#exclude!)
predicate, 3 nodes would have been captured, with the comment as the second capture.
(#eq? [args]+)
#
Check if all arguments are exactly string-equal
Example usage to find the constructor of MyClass
(function_definition
declarator: (function_declarator
declarator: (qualified_identifier
scope: (_) @scope
(identifier) @name)
(#eq? "MyClass" @scope @name))
body: (compound_statement) @body)
MyClass
.
(like? [args]+)
#
Check if all arguments are "alike".
In this case "alike" means the arguments are all string-equal, after all white-space is removed.
This is very useful when comparing strings that might span multiple lines or may be indented/formatted differently depending on preference.
E.g. const QString&
could also be formatted as const QString &
.
The like?
predicate would match both of these variations.
In general, prefer like?
over eq?
when matching anything other than a single identifier.
(eq_except? [pattern] [capture] [exclusion]+)
#
Check if the pattern and the capture are string-equal, excluding any (sub-)nodes that have a type listed in exclusion
.
This is similar to the (#eq?)
operator.
However, the captured nodes and their child nodes are filtered.
Any (child) node listed in exclusion
is removed from the string before comparing for string-equality.
This is useful to remove the identifier
from a parameter_declaration, that may have arbitrarily many pointer indirections.
E.g.: To check that the type of a parameter is const std::string &
, you can simply exclude the identifier:
(function_definition
declarator: (function_declarator
parameters: (parameter_list
(parameter_declaration) @param
(#eq_except? "const std::string &" @param "identifier"))))
Note that for this exact use-case, using #like_except?
would be a better fit, to match const std::string&
, as well as const std::string &
.
(like_except? [pattern] [capture] [exclusion]+)
#
Check if the pattern and capture are "alike", excluding any (sub-)nodes that have a type listed in exclusion
.
See: (#eq_except?)
and (#like?)
.
(#match? [regex] [args]+)
#
Check if the given args
match the given regex
Example usage to find all member function definitions of MyClass
(function_definition
declarator: (function_declarator
declarator: (_) @name
(#match? "MyClass::" @name))
body: (compound_statement) @body)
(#not_is? [capture]+ [node_type]+)
#
Check that none of the captures are of any of the given node types.
This is especially useful when using the wildcard operators (_)
and _
.
These match any (named) node type. Combined with this predicate these can match any node type but the given types.
Example usage to find all member functions that return any type other than a primitive type:
(function_definition
type: (_) @type
(#not_is? @type primitive_type)) @function
(#in_message_map? [capture]+)
#
Check if the given capture is within a MFC message map.
Example usage to find all elements of the message map:
(
(expression_statement) @expr
(#in_message_map? @expr)
)
CppDocument::mfcExtractMessageMap
is probably easier.