Grammars

A grammar defines how source code is tokenized for a particular language. Smalto ships with 30 built-in grammars and supports custom grammar definitions.

Grammar structure

A Grammar has three fields:

Field Type Description
name String Language name (e.g., "python", "javascript")
extends Option(String) Optional parent language for inheritance
rules List(Rule) Ordered list of tokenization rules

Rules

Each Rule defines a pattern that matches a token type:

Field Type Description
token String Token name (maps to Token variants: "keyword" -> Keyword)
pattern String PCRE regex pattern
greedy Bool If True, matches against the full source text to avoid partial matches
inside Option(Inside) Optional nested grammar for recursive tokenization

Using built-in grammars

Each language module exports a grammar() function:

import smalto
import smalto/languages/python

let tokens = smalto.to_tokens("print('hello')", python.grammar())

Building custom grammars

Use the builder functions in smalto/grammar to define a grammar:

import gleam/option.{None}
import smalto
import smalto/grammar.{Grammar}

let my_grammar = Grammar(
  name: "my-lang",
  extends: None,
  rules: [
    grammar.greedy_rule("string", "\"[^\"]*\""),
    grammar.rule("keyword", "\\b(?:let|if|else|fn|return)\\b"),
    grammar.rule("number", "\\b\\d+(?:\\.\\d+)?\\b"),
    grammar.rule("operator", "[+\\-*/=<>!]+"),
    grammar.rule("punctuation", "[{}()\\[\\];,]"),
  ],
)

let html = smalto.to_html("let x = 42", my_grammar)

Rule builder functions

Function Description
grammar.rule(token, pattern) Non-greedy rule with no nesting
grammar.greedy_rule(token, pattern) Greedy rule with no nesting
grammar.rule_with_inside(token, pattern, rules) Rule with inline nested grammar
grammar.greedy_rule_with_inside(token, pattern, rules) Greedy rule with inline nested grammar
grammar.nested_rule(token, pattern, language) Rule with a language reference for cross-language nesting

Greedy vs non-greedy rules

Greedy rules match against the full source text before token boundaries are resolved. This prevents partial matches inside already-tokenized regions. Use greedy rules for tokens that might contain text resembling other tokens:

// Greedy: prevents the keyword "if" inside a string from being matched as a keyword
grammar.greedy_rule("string", "\"[^\"]*\"")

// Non-greedy: fine for keywords since they're matched by word boundary
grammar.rule("keyword", "\\b(?:if|else|fn)\\b")

Nested rules

Use rule_with_inside to recursively tokenize matched text:

grammar.rule_with_inside("template-string", "`[^`]*`", [
  grammar.rule("interpolation", "\\$\\{[^}]+\\}"),
])

Cross-language nesting

Use nested_rule to reference another language’s grammar. This is useful for embedded languages like JavaScript inside HTML:

grammar.nested_rule("script", "<script[^>]*>[\\s\\S]*?</script>", "javascript")

Grammar inheritance

Grammars can extend other grammars. The child grammar’s rules are prepended to the parent’s rules, giving them higher priority:

import gleam/option.{Some}
import smalto/grammar.{Grammar}

let typescript_grammar = Grammar(
  name: "typescript",
  extends: Some("javascript"),
  rules: [
    grammar.rule("keyword", "\\b(?:interface|type|enum|namespace|declare|abstract|implements)\\b"),
    grammar.rule("builtin", "\\b(?:string|number|boolean|void|never|any|unknown)\\b"),
  ],
)

When Smalto resolves this grammar, it prepends the TypeScript-specific rules before the inherited JavaScript rules. The built-in language grammars handle inheritance automatically.

Rule order

Rules are tried in order. The first rule that matches at a given position wins. Place more specific rules before general ones:

// Correct: triple-quoted strings before single-quoted
grammar.greedy_rule("string", "\"\"\"[\\s\\S]*?\"\"\""),
grammar.greedy_rule("string", "\"[^\"]*\""),

// Correct: specific keywords before general identifiers
grammar.rule("keyword", "\\b(?:if|else|fn)\\b"),
grammar.rule("function", "\\b[a-z_]\\w*(?=\\()"),