BLISP — Deterministic Execution for AI-Generated Research

The problem

AI agents can reason. They cannot be trusted to execute unchecked.

Large language models propose computational pipelines from natural-language prompts. The operations they select may be structurally valid but semantically unwarranted—the operation exists in the system, but the user's request does not justify it. Schema validation catches malformed output. It does not catch valid-but-wrong execution.

Example: valid-but-unwarranted execution

User request

“Build a momentum strategy on equity futures, ranked by Sharpe ratio.”

Agent proposal

Family: MOM_REV (mean-reversion)
Metric: SRP (Sharpe)

Both are valid capabilities. Schema validation passes. The pipeline executes—and produces the opposite computational signal.

The output is correct in form and exactly wrong in substance. Constrained decoding restricts the model to the full set of valid names—all 36 family×metric pairs—but not to the per-prompt discovered subset. The grounding gate restricts to discovered names only.

System design

The missing boundary between proposal and execution.

BLISP interposes a mandatory admissibility boundary—the grounding gate—between stochastic reasoning and deterministic execution. Above the boundary, agents propose. Below it, everything is deterministic, typed, and content-addressed.

01 Registry

A live capability registry (244 operations, 4 strategy families, 9 metrics): operations, families, signal blocks, and recipes. Each entry is hashed over semantic, algebraic, and implementation layers.

02 Discovery

Given natural-language terms, the system matches against the live registry using a four-tier cascade: exact, alias, tag, keyword. Unresolved terms cannot reach execution.

03 Grounding Gate

A deterministic function that checks whether every capability name in the agent's proposal has evidence in the discovery result. Names lacking evidence are rejected.

04 Specification

Admitted proposals become typed specification records with family, metric, parameter ranges, and data source. Parameter ranges expand into a morphism grid via Cartesian product.

05 Canonicalization

Expressions are parsed, normalized, canonicalized, planned, and optimized through a six-stage typed compilation pipeline. Surface syntax differences collapse to canonical identity.

06 Execution

Each admissible morphism executes through a typed deterministic execution engine. Deterministic: same input, same registry, same output. No randomness below the boundary.

07 Provenance

Every execution produces an 8-layer hash decomposing provenance into registry, request, morphisms, plans, artifacts, score, selection, and data. Fault localization without re-execution.

08 Replay

Identical grounded requests against identical data and registry produce bit-identical hashes. Compare two hashes to verify replay. Compare sub-hashes to localize divergence.

Nine papers

The scientific backbone.

BLISP is built on a nine-paper research program. Papers 1–5 establish the formal foundation: admissibility, canonicalization, quotient semantics, provenance, and fiber structure. Papers 6–7 show that a single semantic coordinate predicts optimizer behavior. Paper 8 tests whether this structure transfers to independent systems. Paper 9 asks whether agents reconstruct equivalent infrastructure under task pressure.

Paper 1

The Grounding Gate

A mandatory admissibility boundary between stochastic AI reasoning and deterministic execution. Proposals whose capability names lack evidence from the user's terms are rejected before execution.

F3 rate: 23.3% → 10.0% (p = 0.027)
Undiscoverable: 100% → 0%

Paper 2 — forthcoming

Canonical Execution Semantics

A typed specification space, canonicalization pipeline, and content-addressed hashing scheme that provides execution identity independent of surface syntax.

8-layer provenance decomposition
50/50 bit-identical replays

Paper 3

Execution Categories

Stochastic prompt variation defines an equivalence relation on the execution space. Prompts that produce the same canonical execution form a quotient class. Execution fibers bundle equivalent proposals.

Execution classes as categorical quotient objects

Paper 4

Provenance Algebra

Every execution produces a decomposable provenance record. Sub-hash comparison localizes divergence without re-execution. Drift detection isolates which semantic layer changed.

8 semantic layers · fault localization by sub-hash

Paper 5

Execution Fibers

Under stochastic prompt variation, many distinct proposals collapse into few execution identities. Synonym perturbations stay intra-fiber. Metric/family substitutions produce clean inter-fiber transitions.

2,200 proposals · synonym ≈ intra-fiber
metric/family swap = inter-fiber

Paper 6 — Zenodo

The Semantic Structure of Execution

A single 7-valued coordinate (DependencyClass) predicts four independent optimizer behaviors—fusion eligibility, window semantics, pipeline position, and state management—with 99.6% accuracy across 243/244 behavior predictions.

243/244 predictions correct · z = 13.0
p < 10⁻³⁸ vs random baseline

Paper 7 — Zenodo

Semantic Coordinates as Predictive Objects

Semantic coordinates function as predictive objects, not descriptive labels. A frozen taxonomy trained on 61 operations generalizes to 25 unseen operations at 100% accuracy with zero recalibration.

100/100 holdout predictions · 0 recalibration
Coordinates predict, they do not merely describe

Paper 8 — Zenodo

Cross-System Transferability

A frozen dependency-shape taxonomy, built without inspecting either target system, predicts execution behavior in Polars and DuckDB. Buffering predictions reach 96.7% in both systems. Zero errors from incorrect shape assignments.

91.1% combined accuracy · 180 predictions
Errors trace to architecture, not taxonomy

Paper 9 — Zenodo

Cross-Family Convergence

Independent frontier model families (Anthropic, OpenAI, Google), working on independent domains, reconstruct structurally equivalent execution-identity primitives under task pressure. Reconstruction is convergent, staged, and expensive.

7/8 primitives converge above 0.90 · 55 runs
~178k tokens per reconstruction · non-reusable

Why it matters

Different audiences, one execution problem.

For AI research

Agents need execution substrates, not just tool APIs.

Tool-augmented LLMs select tools directly with no admission gate between selection and execution. A valid but wrong tool call produces a silent failure. The grounding gate makes tool admission evidence-based and deterministic.

For research

Computations must be replayable, comparable, and attributable.

Two researchers running the same grounded request against the same data get bit-identical results. When results differ, 8-layer sub-hash comparison localizes the divergence to a specific semantic layer without re-execution.

For finance

Systematic research needs deterministic provenance from prompt to portfolio.

Strategy families, scoring metrics, and parameter grids are content-addressed. Every research pipeline has a verifiable execution fingerprint. Six months later, the hash still validates.

For infrastructure

BLISP turns agent outputs into typed, admissible, content-addressed executions.

The execution layer is domain-independent. Finance is the first package. The architecture—discovery, grounding, canonicalization, provenance—applies to any domain where AI-generated pipelines must be validated before execution.

Infrastructure thesis

Why this can become infrastructure.

Agentic AI increases the volume of generated computations. Most will be plausible. Not all will be warranted.
Regulated and scientific domains cannot execute black-box proposals. Auditable provenance is a requirement, not a feature.
BLISP sits between LLMs and execution engines—an admission and provenance layer that neither side provides alone.
Initial wedge: quantitative research and reproducible computational workflows where execution identity already matters.
Long-term position: a deterministic execution substrate for AI agents across domains that require typed, verifiable, replayable computation.

BLISP does not make the model truthful. It prevents unwarranted proposals from silently becoming executions. The model reasons stochastically. The execution layer operates deterministically. The boundary between them is the contribution.

Formal structure

The execution pipeline, formally.

E_R ⟶ Γ ⟶ B_R/∼_R ⟶ κ_R ⟶ ε_R ⟶ P_R

E_R: Stochastic proposal space—all agent-generated proposals
Γ: Grounding gate—rejects proposals without discovery evidence
B_R/∼_R: Execution identity—equivalence classes under canonicalization
κ_R: Canonical representative—one expression per equivalence class
ε_R: Deterministic execution—same canonical input, same output
P_R: 8-layer provenance record—decomposable, content-addressed

Stochastic prompt variation generates many elements of E_R. The grounding gate Γ admits only proposals with discovery evidence. Canonicalization collapses admitted proposals into equivalence classes B_R/∼_R, each with a unique canonical representative κ_R. Execution ε_R is a function on canonical representatives—deterministic by construction. The provenance record P_R decomposes the full execution into 8 semantic layers for audit and fault localization.

Deterministic execution for AI‑generated research.