From source text to a parser-independent AST for JSONiq and XQuery
RumbleDB supports both JSONiq and XQuery 3.0/3.1. The language server aims to support both languages too through a similar selection rule: it looks for xquery version or jsoniq version in the document and falls back to the LSP languageId when neither is present. The VS Code extension assigns that language ID based on the file extension.
When the language server receives a document, it selects the JSONiq or XQuery parser adapter. The adapter first creates an ANTLR lexer, which turns the source text into a stream of tokens: keywords, variable names, literals, punctuation, and so on. The tokens are then passed to the ANTLR parser, which matches them against the grammar and builds a parse tree.
Internally, the document text is wrapped in a
CharStream. The tokens from the lexer are buffered by aCommonTokenStream, which is consumed by the parser. Parsing starts at the top-level rulemoduleAndThisIsIt(). A custom error listener is attached to both the lexer and parser so that syntax errors can be sent back to the editor.
Once the parse tree exists, it can be traversed directly or through the listener and visitor helpers generated by ANTLR.
Malformed input is normal because parsing happens while the user is editing. Syntax errors are collected during parsing, but ANTLR still attempts to recover and produce a usable tree so the language server can continue providing features.
For example:
let $a := 3
return $ The variable name after $ is missing, so this query cannot be executed. However, $a is still a valid declaration and should appear in the completion list on the second line.
ANTLR already handles much of this recovery. Code built on top of it must still expect missing tokens and partially constructed grammar nodes, but a separate error-tolerant parser is not required.
antlr-ng generates a TypeScript visitor with a typed method for each grammar rule. This avoids many type casts during manual tree traversal. A listener could also be used, but the visitor is a better fit because each method can return nodes for the next representation.
antlr-ng -Dlanguage=TypeScript --generate-visitor true --output-directory src/parser/adapters/jsoniq/grammar src/parser/adapters/jsoniq/grammar/JsoniqLexer.g4 src/parser/adapters/jsoniq/grammar/JsoniqParser.g4 The generated visitor class has this shape:
export class JsoniqParserVisitor<Result> extends AbstractParseTreeVisitor<Result> {
/**
* Visit a parse tree produced by `JsoniqParser.moduleAndThisIsIt`.
* @param ctx the parse tree
* @return the visitor result
*/
visitModuleAndThisIsIt?: (ctx: ModuleAndThisIsItContext) => Result;
// ...and methods for the other parser rules
} As mentioned in the first article, the Language Server Protocol is language-agnostic, but library support remains an important factor when choosing an implementation language. antlr-ng marks optional grammar elements as nullable in the generated TypeScript code. This is particularly useful for incomplete queries because the type checker requires missing children to be handled explicitly.
The implementation uses three different tree representations rather than exposing the ANTLR tree directly to the entire language server.
The first is the parser-dependent tree, which is the raw parse tree produced by ANTLR. It closely follows the grammar, including many intermediate rules that are useful for parsing but not especially useful to the rest of the language server. It is also tied to ANTLR's generated classes.
The second is the parser-independent tree, or parser AST. It still represents syntax, but its node types belong to the language server rather than to the parser generator. Both the JSONiq and XQuery adapters produce this same structure.
The third is the analysis AST, where declarations, references, scopes, and resolved names are added. That stage will be covered in the next article.
flowchart LR
A["Query document<br/>raw source text"] --> B["Parser-dependent tree<br/>ANTLR parse tree"]
B --> C["Parser-independent tree<br/>language-server parser AST"]
C --> D["Analysis AST<br/>scope and feature-oriented tree"]
B1["Generated from grammar<br/>JSONiqParser / XQueryParser"] -.-> B
C1["Normalizes language-specific grammar nodes<br/>into shared node types"] -.-> C
D1["Adds semantic meaning<br/>declarations, references, scopes, function calls"] -.-> D
This conversion adds some code, especially because JSONiq and XQuery need separate adapters. In return, the generated parser classes stay inside those adapters, and the rest of the language server uses stable internal node types.
ANTLR creates the full parse tree, but not all of it is copied into the parser AST. The parser AST is not intended to reproduce the complete executable representation used by RumbleDB, so many grammar details are unnecessary for editor features. Only nodes needed later for completion, definitions, references, rename, signature help, and related features are retained.
Consider this simple query:
let $x := 1
return fn:abs($x) The following diagram shows a simplified version of the transformation. The ANTLR side omits many expression and precedence rules, but it still shows how closely the tree follows the grammar. The parser AST keeps only the nodes needed by later stages; the literal 1, for example, is not represented because no current language-server feature requires it.
flowchart
subgraph A["Parser AST"]
direction TB
A0["module"] --> A1["flowr-expression"]
A1 --> A2["let-binding: $x"]
A1 --> A3["function-call: fn:abs#1"]
A3 --> A4["argument: index 0"]
A4 --> A5["variable-reference: $x"]
end
subgraph P["ANTLR parse tree (simplified)"]
direction TB
P0["moduleAndThisIsIt"] --> P1["module"]
P1 --> P2["mainModule"]
P2 --> P3["program"]
P3 --> P4["flworExpr"]
P4 --> P5["letClause"]
P5 --> P6["letVar"]
P6 --> P7["VarRef: $x"]
P4 --> P8["return expression"]
P8 --> P9["functionCall: fn:abs"]
P9 --> P10["argument"]
P10 --> P11["VarRef: $x"]
end
P0 -. "visitor transformation" .-> A0
Most parser AST nodes have:
function-call or variable-referenceDeclarations also store a narrower selectionRange, usually covering only the variable or function name. This avoids selecting the whole declaration when a feature only needs to point at the symbol.
Names cannot be stored as plain strings because JSONiq and XQuery support qualified names. These are all valid:
a
local:f
Q{https://example.com}a They are represented with three types:
export type Prefix = string;
export type LocalName = string;
export type UnprefixedQName = {
kind: "unprefixed-qname";
localName: LocalName;
};
export type PrefixedQName = {
kind: "prefixed-qname";
prefix: Prefix;
localName: LocalName;
};
export type UriQualifiedQName = {
kind: "uri-qualified-qname";
namespaceUri: string;
localName: LocalName;
};
export type LexicalQName = UnprefixedQName | PrefixedQName | UriQualifiedQName; Functions also require an arity. It is explicit in a named function reference such as fn:count#1; for calls and declarations, it is derived from the number of arguments or parameters. The next stage uses both the name and arity to distinguish overloaded functions.
One complication is distinguishing a declaration from a reference. In JSONiq, both use the VarRef rule for the variable name, so both eventually reach visitVarRef. The parent ANTLR grammar context is inspected to tell them apart:
private isVariableDeclarationReference(node: VarRefContext): boolean {
return (
node.parent instanceof ParamContext ||
node.parent instanceof VarDeclContext ||
node.parent instanceof ForVarContext ||
node.parent instanceof PositionalVarContext ||
node.parent instanceof LetVarContext ||
node.parent instanceof GroupByVarContext ||
node.parent instanceof CountClauseContext ||
node.parent instanceof QuantifiedExprVarContext ||
node.parent instanceof TypeswitchExprContext ||
node.parent instanceof CaseClauseContext ||
node.parent instanceof TumblingWindowClauseContext ||
node.parent instanceof SlidingWindowClauseContext ||
node.parent instanceof WindowVarsContext ||
node.parent instanceof TypeSwitchStatementContext ||
node.parent instanceof CaseStatementContext ||
node.parent instanceof VarDeclForStatementContext ||
node.parent instanceof CopyDeclContext
);
} Another approach would be to modify the grammar by introducing a declaredVarRef rule as an alias of varRef and using it wherever a variable is declared. Its visitor could then consume the declaration without visiting the nested varRef, ensuring that every remaining call to visitVarRef represents an actual reference.
However, the grammar files are kept synchronized with RumbleDB so that future updates can be merged more easily.
Visitor methods return an array containing zero, one, or several parser AST nodes. Rules without an explicit override use the default traversal and collect relevant nodes from their children. An overridden method can create a node of its own or return an empty array to drop that subtree.
For example, this is how a function call is converted:
private functionCall(node: FunctionCallContext): AstVisitResult {
const nameNode = node._fn_name;
if (nameNode === undefined) {
return [];
}
return [
{
kind: "function-call",
name: parseFunctionName(node),
selectionRange: rangeFromNode(nameNode, this.document),
range: rangeFromNode(node, this.document),
children: this.visitChildrenAsNodes(node),
},
];
} Many grammar rules disappear during this transformation. Their useful descendants move up to the nearest retained node, so the resulting tree is much smaller than the ANTLR tree.
The JSONiq and XQuery adapters do this independently because their generated context classes are different. Their output types are shared, so this duplication stops at the adapter boundary.
ANTLR tokens use source offsets, while LSP ranges use zero-based line and character positions. rangeFromNode converts between the two while the parser AST is being built by calling document.positionAt(offset) for the start and end of each node.
Some nodes need more than one range. A declaration covers the complete declaration, but its selectionRange covers only the declared name. A function call similarly stores both the full call range and the name range. Later features can then highlight the right text without reading the tokens again.
Because of error recovery, ANTLR may create a grammar context even when an expected child is missing. The AST builder cannot assume that every generated field is present.
For example, an unfinished variable reference may have a VarRefContext but no name token. In that case, parseVarName returns null and the reference is skipped:
public override visitVarRef = (node: VarRefContext): AstVisitResult => {
if (this.isVariableDeclarationReference(node)) {
return [];
}
const name = parseVarName(node);
return name === null
? []
: [{
kind: "variable-reference",
name,
range: rangeFromNode(node, this.document),
children: [],
}];
}; This only drops the incomplete reference. Everything else that ANTLR recovered remains in the tree, including declarations before the error.
Sometimes the parse-tree node is not enough. For a module variable declaration, the adapter checks whether the next token on ANTLR's default channel is a semicolon. This indicates whether the declaration is finished while ignoring whitespace and comments on the hidden channel.
The parser adapter returns four things:
export interface ParseResult {
diagnostics: Diagnostic[];
ast: ModuleAstNode;
parser: Parser;
tokens: Token[];
} Each output serves a different purpose. Diagnostics come from the error listener, the parser AST is passed to the analysis stage, and completion uses the ANTLR parser and token list to calculate valid grammar candidates at the cursor.
The parser AST now records which relevant structures are present and where they appear, but not what each name refers to. The next part will cover the analysis stage: namespace expansion, scopes, declaration visibility, and reference resolution.