Parsing Code for Agents: ASTs, Schemas, and the Librarian
In the previous post, I argued that "Discovery" is a recurring tax on agent workflows.
To reduce that tax, I built librarian, a tool that indexes codebases into a local SQLite catalog.
But how do you actually index code in a way that is useful for an LLM? A simple grep index isn't enough (it lacks structure),
and a full Language Server Protocol (LSP) dump is too verbose (it drowns the context window).
The sweet spot is an AST-aware Symbol Index stored in a relational database. This allows an agent to ask semantic questions like: "Where is the `Symbol` struct defined?" or "Show me the signature of `walk_symbols`."... yeah, I ran it on itself.
The Schema: Relational Code
The core of librarian is its SQLite schema. It doesn't try to store the entire graph of the code.
Instead, it focuses on Blobs and Anchors.
1. Blobs (The Content)
A blob is a chunk of a file. It could be the whole file, or it could be a specific function. The key is that it is content-addressed (hashed). If the code doesn't change, the ID doesn't change.
CREATE TABLE blobs (
blob_id TEXT PRIMARY KEY, -- hashed stable address
kind TEXT NOT NULL, -- 'file' | 'chunk' | 'symbol'
path TEXT NOT NULL, -- repo-relative path
language TEXT, -- 'rs', 'py', 'go'
start_line INTEGER,
end_line INTEGER,
content_hash TEXT NOT NULL,
-- ... timestamps and size
);
2. Anchors (The Knowledge Graph)
Anchors are human-readable tags that point to one or more blobs. This is how we map concepts to code.
For example, an anchor named "auth-flow" might point to blobs in src/auth.rs,
src/middleware.rs, and config/policies.yaml.
CREATE TABLE anchor_blobs (
anchor_id INTEGER NOT NULL,
blob_id TEXT NOT NULL,
rank REAL NOT NULL DEFAULT 0.0,
-- ...
);
This many-to-many relationship allows us to build "virtual folders" of context that cut across the file system hierarchy.
Parsing: Tree-sitter vs. Regex
To populate this schema, we need to understand the code structure. I initially tried regex, but it's brittle.
Instead, librarian uses Tree-sitter to parse source code into an Abstract Syntax Tree (AST)
and extract symbols reliably.
The ingestion logic in src/ingest/symbols.rs is generic over languages. We define the "kinds" of nodes we care about
(e.g., function_item in Rust, function_declaration in Go) and walk the tree.
// src/ingest/symbols.rs
pub fn discover_symbols_for_path(path: &std::path::Path, content: &str) -> Option> {
let ext = path.extension().and_then(|e| e.to_str()).unwrap_or("");
match ext {
"rs" => discover_symbols_with_treesitter(
tree_sitter_rust::language(),
content,
&["function_item", "struct_item", "enum_item", "trait_item"]
),
"go" => discover_symbols_with_treesitter(
tree_sitter_go::language(),
content,
&["function_declaration", "method_declaration", "type_spec"]
),
// ... js, py, etc.
}
}
This extraction is lightweight. We aren't trying to resolve types or validate lifetimes. We just want to know: "There is a function named `process_data` at lines 50-120."
Why SQLite Beats Vector Stores (For This)
It is trendy to throw everything into a Vector DB for "semantic search." But for code navigation, exactness matters more than vibes.
- Deterministic: If an agent asks for `fn:process_data`, it gets exactly that function. Vector search might return `process_date` or `process_datum` if the embeddings are close.
- Low Cost: SQLite runs locally, instantly, and with zero API cost. Vector indexes require embedding models (latency + cost) and heavy libraries.
- Structured: We can query by `path`, `mtime`, or `kind`. "Show me all Rust structs modified in the last 24 hours" is a trivial SQL query. It is impossible with vector search alone.
The Result: A Map, Not a Search Engine
The result isn't a search engine; it's a map. When an agent enters a repo, it doesn't have to stumble around reading files at random. It can query the `symbols` table to get the lay of the land, then pull exactly the `blobs` it needs to do the job.
It turns the codebase from an opaque text blob into a queryable database. And for an agent that pays per token, that structured access is the difference between "getting lost" and "getting done."