Hacker News
Formal Verification Gates for AI Coding Loops
alexpotato
|next
[-]
The phases were basically:
- try out having the LLM do "a lot"
- now even more
- now run multiple agents
- back to single agents but have the agents build tools
- tools that are deterministic AND usable by both the humans (EDIT: and the LLMs)
The reasons:
1. Deterministic tools (for both deployments and testing) get you a binary answer and it's repeatable
2. In the event of an outage, you can always fall back to the tool that a human can run
3. It's faster. A quick script can run in <30 seconds but "confabulating" always seemed to take 2-3 minutes.
Really, we are back to this article: https://spawn-queue.acm.org/doi/10.1145/3194653.3197520 aka "make a list of tasks, write scripts for each task, combine the scripts into functions, functions become a system"
singron
|next
|previous
[-]
E.g. the jwt auth example has some major problems since the verification rules aren't fully specified in the spec. The jwt-token verified rule only checks that the string isn't empty, but it doesn't actually verify that it is correctly parsed, non-expired, and signed by a trusted key. The authenticated-user rule doesn't check that the user-id actually came from the jwt. If you hand-wrote your constructor, you would ensure these things. Similarly, all the other constructors allow passing in whatever values you like instead of checking the connections of the real objects.
By calling the constructor for these types, you are making an assertion about the relationship of the parameter values. If AI is calling the constructor, then it's able to make it's own assertions and derive whatever result it wants. That seems backwards. AI should use the result of tenant-access to deduce that a user is a member of tenant, but if they can directly call `(tenant-access user-id tenant-id true)`, then they can "prove" tenant-access for anything. In the past, we have named the constructors for these types `TenantAccess.newUnverified`, and then heavily scrutinized all callers (typically just jwt-parsers and specific database lookups). You can then use `TenantAccess.{userId,tenantId}` without scrutiny elsewhere.
pyrex41
|root
|parent
[-]
Your underlying point, that calling the constructor is the assertion so AI passing `true` can "prove" whatever — is true of any smart-constructor pattern, including your own `newUnverified` approach. The trust still has to live somewhere. In your pattern it lives in the small set of audited callers; in shengen's case it lives in the same place — the wrappers (like `CheckTenantAccess`) that actually establish the premise via a DB query or a JWT parse. Structurally the two approaches are doing the same thing. To harden it, you'd keep the raw constructors package-private and export only the wrappers, so the handler code the LLM is writing physically cannot call NewTenantAccess(..., true) — only CheckTenantAccess.
On the deeper question about "why codegen": the short answer is "obviously, you don't have to." But if we assume that we're using AI to write at least some of the code, now you have to either (1) describe the constructor in very precise English and have the LLM generate it, (2) inject yourself into the loop closely with the LLM, or (3) not use an LLM for this part. My proposition is that writing the core invariants as proofs that can be deterministically checked for internal consistency and written declaratively is (1) more efficient, (2) less lossy, and (3) easier for the developer to read and reason about than writing the constructor from scratch. This puts a lot of trust in the codegen, as you point out; but as a practical matter, having a formal representation of what you want plus an English prompt is stronger context to the LLM anyway.
The other reason I started down this path, which I didn't get into in the post because I haven't figured out yet if it's truly practical, comes from a property specific to Shen: it has a very small kernel that has been ported into a lot of runtimes — Lisp, C, JS, Go, Python, Erlang, Scheme, Java, etc (https://shen-language.github.io/#downloads). That opens up the possibility of writing specs whose predicates run as runtime gates from the same Shen expression, no translation step — and even mixing compile-time and runtime assertions into the same spec. I find this very interesting conceptually, but I'm not sure yet whether it's practically useful for anything.
MetaverseClub
|next
|previous
[-]
We could leverage existing FV tools for a given programming language by using an LLM to generate a translator that maps the language to the FV tool's input format. This would essentially require a finite number of "abstract interpretation" functions—one for each language construct. While the total number of constructs might be large (e.g., around 500), each function would be independent. A function would only need to reason about the abstract semantics of a single construct, assuming the others adhere to their respective semantics. We could then distribute these LLM-generated functions among a group of experts (e.g., 100 reviewers). Thanks to the modularity of the functions, reviewers could evaluate their assigned subset in parallel without bottlenecks. The end result would be a working FV tool for the target language.
cpard
|next
|previous
[-]
In a way, you push some of the reasoning into deterministic tooling which is great both for reliability and performance.
At least on what I've been working on where I ended up creating type systems over SQL to solve some of the annoying issues I was having with agents reasoning over complex data infrastructure.
max_unbearable
|next
|previous
[-]
But singron's JWT point is the real limit. Backpressure doesn't remove the judgment, it moves it. The type still has to be written by someone who understands the actual invariant, and a guard type that compiles while only checking "string is non-empty" gives you the feeling of a gate with none of the guarantee. The compiler enforces what you encoded, not what you meant.
So this reads less like formal verification and more like forcing the human judgment to land up front, in the type definitions, instead of hoping it survives in a prompt. That's still a real win: a constraint in a type is reviewable and permanent, while a prompt rule decays the moment the context window moves past it. Worth being honest, though, that the hard part doesn't go away. It just changes address.
solomonb
|next
|previous
[-]
pyrex41
|root
|parent
|next
[-]
But, for everyone else: even if you skipped the sidecar entirely, didn't use the codegen, you just had the AI interpret the spec'd application into a short Shen proof, iterate until it's internally consistent / compiles...now you have a spec that is internally consistent, straightforward for human to understand, and much stronger context for the LLM than English language spec alone.
appstorelottery
|next
|previous
[-]
eximius
|next
|previous
[-]
vrm
|next
|previous
[-]
pyrex41
|next
|previous
[-]
elfly
|root
|parent
|next
[-]
I wish Go had a serious type system. Never mind algebraic types, but one that fucking respected private values and did things like validating enum values.
ajb
|root
|parent
|next
|previous
[-]
pyrex41
|root
|parent
[-]
Now, a lot of the ports haven't been maintained. But the underlying Shen kernel is only 4-5k lines of code...remains extremely portable. More discussion here https://news.ycombinator.com/item?id=39602472
I didn't focus a ton on Shen in the blog post, because the underlying principles aren't really about the implementation. Shen is very cool tho.
zozbot234
|root
|parent
[-]
That's not a good thing! A Turing complete type system means that compilation is potentially undecidable and non-terminating. The whole unintuitive mess in dependently-typed systems about "definitional" equality (loosely speaking, a notion of equalities that are 'trivially' checked as part of evaluation) is entirely about avoiding Turing-complete type checking!
pyrex41
|root
|parent
[-]
But I mentioned its Turing completeness as a lazy shorthand to illustrate that it is far more flexible and expressive than what people think of as a "type system". https://shenlanguage.org/OSM/Recursive.html
Mikhail_K
|root
|parent
|next
|previous
[-]
pyrex41
|root
|parent
[-]
It does not do the second (formally verify the entire codebase). Outside the guard types your Go or TypeScript is just code. It can panic, race, have bugs in unrelated logic, use reflection to forge values inside the guard package, get a wrong answer from the SQL query that fed the predicate, and so on. The proof ends at the projection boundary.
Why not go further? Not really "too complex to implement," in theory; it has been done before. But verifying the whole program is much higher engineering cost, and the trades-offs to do it make sense in a much narrower set of cases than what I'm trying to target: teams shipping production code with AI in the loop, in the language they already ship.
The pragmatic choice is to spend the verification budget on the small set of invariants that genuinely matter and leave the rest as ordinary code with ordinary review and tests. That is why the claim is phrased as "practically impossible to accidentally bypass, not categorically impossible to bypass." Over-claiming "verified" when the host language is unverified would be misleading.