Why AI codegen needs structured outputs, not chat history
· IsoKron team · 3 min read
Chat-history-based agents lose schema. Structured-output enforcement at every stage means the compiler can't silently corrupt your project. Zod, JSON Schema, and the discipline that comes with them.
- architecture
- structured-outputs
- ai-codegen
Chat history is not a contract
The dominant pattern for AI coding tools is: maintain a conversation. Add files to context. Stream the model's response. Parse code blocks out of the prose. Apply them.
It works for one-off tasks. It falls apart the moment you need to compose. The model can return a code block with a typo in a function name. It can describe a "small change" that quietly contradicts an earlier decision. It can format the response slightly differently this time than last time, and your parser breaks.
The root cause is that chat history is unstructured. The model is supposed to follow conventions. It usually does. Sometimes it doesn't. There's no point in the pipeline that enforces "the output must conform to this schema or this is a hard error."
Schema as a hard error
The alternative is structured outputs at every stage. In IsoKron, every LLM invocation in the 9-stage compiler is wrapped in a Zod schema. The model is constrained (via Anthropic tool calling, OpenAI Structured Outputs, or equivalents) to produce JSON conforming to that schema. If it doesn't conform — wrong shape, missing fields, extra fields — the call fails immediately and retries with the schema error fed back to the model.
This is the difference between "the model usually does the right thing" and "the model cannot do the wrong thing." The schema is the trust boundary. Whatever passes through it is structurally correct. Whatever doesn't never gets persisted.
A concrete example. When the compiler is producing a Decision, the schema requires:
title(string, max 200 chars)context(string)decision(string)considered_options(array of strings, parallel-aligned with two other arrays)consequences(array of strings)supersedes_id(UUID or null)
If the model returns an object without considered_options, the validator rejects. If it returns a supersedes_id that doesn't exist in the database, the foreign-key check rejects. If it returns extra fields, those are stripped. The result is that every Decision entity in the database has the exact same shape, queryable by every downstream stage, with no special cases.
What this looks like in practice
The discipline forces some non-obvious design choices. The biggest:
Parallel arrays beat nested objects for small models. Models in the 7B-class fleet workers we support routinely fail at deeply nested JSON. We measured this empirically against a published benchmark and got 17–37% degradation at 5–7 levels of nesting. So considered_options is three parallel arrays of strings (the option, its pros, its cons) — not an array of objects. Index-aligned arrays of primitives are reliable. Arrays of objects with string lists at the leaves are not.
Field names cannot be generic. Two entities with a description field create a pattern called Attribute Bleed — the model starts cross-applying descriptions from one entity to another. So we don't have description. We have decision_outcome, gotcha_symptom, convention_principle. The entity-specific prefix is verbose but it doesn't bleed.
The compiler validates between every LLM stage, not just at the end. A Stage 2 LLM call that produces a Component must pass Stage 2.5 deterministic validation before Stage 3 starts. If validation fails, the stage retries with the error. The chain of LLM calls never accumulates corruption.
Why this scales
Two properties fall out of structured-outputs-everywhere that don't exist in chat-history architectures:
- Determinism. The same input produces the same shape, even if the prose varies. You can run the same compile twice and get structurally identical graphs.
- Composability. Stage N+1 can rely on Stage N's output without parsing prose. The schema IS the contract between stages.
For comparison, in a chat-history architecture, the way you get Stage N+1 to use Stage N's output is to dump the prose into the context window and hope. That works for small projects. For a 9-stage compiler running against complex declarations, it doesn't.
Bottom line
Structured outputs aren't a feature. They're a discipline — every stage, every entity, every field. If you're evaluating AI codegen platforms, ask: "Show me where a Stage N output is parsed by Stage N+1." If the answer involves regex or "the model usually formats it correctly," you're looking at a chat-history platform. If the answer involves a Zod schema and a validator that rejects on mismatch, you're looking at something that can run a serious pipeline without silent corruption.
Related: Why structured graphs beat markdown for AI codegen.