Grammars
A grammar defines how source code is tokenized for a particular language. Smalto ships with 30 built-in grammars and supports custom grammar definitions.
Grammar structure
A Grammar has three fields:
| Field | Type | Description |
|---|---|---|
name |
String |
Language name (e.g., "python", "javascript") |
extends |
Option(String) |
Optional parent language for inheritance |
rules |
List(Rule) |
Ordered list of tokenization rules |
Rules
Each Rule defines a pattern that matches a token type:
| Field | Type | Description |
|---|---|---|
token |
String |
Token name (maps to Token variants: "keyword" -> Keyword) |
pattern |
String |
PCRE regex pattern |
greedy |
Bool |
If True, matches against the full source text to avoid partial matches |
inside |
Option(Inside) |
Optional nested grammar for recursive tokenization |
Using built-in grammars
Each language module exports a grammar() function:
import smalto
import smalto/languages/python
let tokens = smalto.to_tokens("print('hello')", python.grammar())
Building custom grammars
Use the builder functions in smalto/grammar to define a grammar:
import gleam/option.{None}
import smalto
import smalto/grammar.{Grammar}
let my_grammar = Grammar(
name: "my-lang",
extends: None,
rules: [
grammar.greedy_rule("string", "\"[^\"]*\""),
grammar.rule("keyword", "\\b(?:let|if|else|fn|return)\\b"),
grammar.rule("number", "\\b\\d+(?:\\.\\d+)?\\b"),
grammar.rule("operator", "[+\\-*/=<>!]+"),
grammar.rule("punctuation", "[{}()\\[\\];,]"),
],
)
let html = smalto.to_html("let x = 42", my_grammar)
Rule builder functions
| Function | Description |
|---|---|
grammar.rule(token, pattern) |
Non-greedy rule with no nesting |
grammar.greedy_rule(token, pattern) |
Greedy rule with no nesting |
grammar.rule_with_inside(token, pattern, rules) |
Rule with inline nested grammar |
grammar.greedy_rule_with_inside(token, pattern, rules) |
Greedy rule with inline nested grammar |
grammar.nested_rule(token, pattern, language) |
Rule with a language reference for cross-language nesting |
Greedy vs non-greedy rules
Greedy rules match against the full source text before token boundaries are resolved. This prevents partial matches inside already-tokenized regions. Use greedy rules for tokens that might contain text resembling other tokens:
// Greedy: prevents the keyword "if" inside a string from being matched as a keyword
grammar.greedy_rule("string", "\"[^\"]*\"")
// Non-greedy: fine for keywords since they're matched by word boundary
grammar.rule("keyword", "\\b(?:if|else|fn)\\b")
Nested rules
Use rule_with_inside to recursively tokenize matched text:
grammar.rule_with_inside("template-string", "`[^`]*`", [
grammar.rule("interpolation", "\\$\\{[^}]+\\}"),
])
Cross-language nesting
Use nested_rule to reference another language’s grammar. This is useful for embedded languages like JavaScript inside HTML:
grammar.nested_rule("script", "<script[^>]*>[\\s\\S]*?</script>", "javascript")
Grammar inheritance
Grammars can extend other grammars. The child grammar’s rules are prepended to the parent’s rules, giving them higher priority:
import gleam/option.{Some}
import smalto/grammar.{Grammar}
let typescript_grammar = Grammar(
name: "typescript",
extends: Some("javascript"),
rules: [
grammar.rule("keyword", "\\b(?:interface|type|enum|namespace|declare|abstract|implements)\\b"),
grammar.rule("builtin", "\\b(?:string|number|boolean|void|never|any|unknown)\\b"),
],
)
When Smalto resolves this grammar, it prepends the TypeScript-specific rules before the inherited JavaScript rules. The built-in language grammars handle inheritance automatically.
Rule order
Rules are tried in order. The first rule that matches at a given position wins. Place more specific rules before general ones:
// Correct: triple-quoted strings before single-quoted
grammar.greedy_rule("string", "\"\"\"[\\s\\S]*?\"\"\""),
grammar.greedy_rule("string", "\"[^\"]*\""),
// Correct: specific keywords before general identifiers
grammar.rule("keyword", "\\b(?:if|else|fn)\\b"),
grammar.rule("function", "\\b[a-z_]\\w*(?=\\()"),