Parsing Code for Agents: ASTs, Schemas, and the Librarian

In the previous post, I argued that "Discovery" is a recurring tax on agent workflows. To reduce that tax, I built librarian, a tool that indexes codebases into a local SQLite catalog.

But how do you actually index code in a way that is useful for an LLM? A simple grep index isn't enough (it lacks structure), and a full Language Server Protocol (LSP) dump is too verbose (it drowns the context window).

The sweet spot is an AST-aware Symbol Index stored in a relational database. This allows an agent to ask semantic questions like: "Where is the `Symbol` struct defined?" or "Show me the signature of `walk_symbols`."... yeah, I ran it on itself.

The Schema: Relational Code

The core of librarian is its SQLite schema. It doesn't try to store the entire graph of the code. Instead, it focuses on Blobs and Anchors.

1. Blobs (The Content)

A blob is a chunk of a file. It could be the whole file, or it could be a specific function. The key is that it is content-addressed (hashed). If the code doesn't change, the ID doesn't change.

CREATE TABLE blobs (
  blob_id      TEXT PRIMARY KEY,          -- hashed stable address
  kind         TEXT NOT NULL,              -- 'file' | 'chunk' | 'symbol'
  path         TEXT NOT NULL,              -- repo-relative path
  language     TEXT,                       -- 'rs', 'py', 'go'
  start_line   INTEGER,
  end_line     INTEGER,
  content_hash TEXT NOT NULL,
  -- ... timestamps and size
);

2. Anchors (The Knowledge Graph)

Anchors are human-readable tags that point to one or more blobs. This is how we map concepts to code. For example, an anchor named "auth-flow" might point to blobs in src/auth.rs, src/middleware.rs, and config/policies.yaml.

CREATE TABLE anchor_blobs (
  anchor_id INTEGER NOT NULL,
  blob_id   TEXT NOT NULL,
  rank      REAL NOT NULL DEFAULT 0.0,
  -- ...
);

This many-to-many relationship allows us to build "virtual folders" of context that cut across the file system hierarchy.

Parsing: Tree-sitter vs. Regex

To populate this schema, we need to understand the code structure. I initially tried regex, but it's brittle. Instead, librarian uses Tree-sitter to parse source code into an Abstract Syntax Tree (AST) and extract symbols reliably.

The ingestion logic in src/ingest/symbols.rs is generic over languages. We define the "kinds" of nodes we care about (e.g., function_item in Rust, function_declaration in Go) and walk the tree.

// src/ingest/symbols.rs
pub fn discover_symbols_for_path(path: &std::path::Path, content: &str) -> Option> {
    let ext = path.extension().and_then(|e| e.to_str()).unwrap_or("");
    match ext {
        "rs" => discover_symbols_with_treesitter(
            tree_sitter_rust::language(),
            content,
            &["function_item", "struct_item", "enum_item", "trait_item"]
        ),
        "go" => discover_symbols_with_treesitter(
            tree_sitter_go::language(),
            content,
            &["function_declaration", "method_declaration", "type_spec"]
        ),
        // ... js, py, etc.
    }
}

This extraction is lightweight. We aren't trying to resolve types or validate lifetimes. We just want to know: "There is a function named `process_data` at lines 50-120."

Why SQLite Beats Vector Stores (For This)

It is trendy to throw everything into a Vector DB for "semantic search." But for code navigation, exactness matters more than vibes.

The Result: A Map, Not a Search Engine

The result isn't a search engine; it's a map. When an agent enters a repo, it doesn't have to stumble around reading files at random. It can query the `symbols` table to get the lay of the land, then pull exactly the `blobs` it needs to do the job.

It turns the codebase from an opaque text blob into a queryable database. And for an agent that pays per token, that structured access is the difference between "getting lost" and "getting done."