Hacker News
Toward automated verification of unreviewed AI-generated code
phailhaus
|next
[-]
loloquwowndueo
|root
|parent
|next
[-]
This doesn’t matter in the age of AI - when you get a new requirement just tell the AI to fulfill it and the old requirements (perhaps backed by a decent test suite?) and let it figure out the details, up to and including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements.
For performance, give the AI a benchmark and let it figure it out as well. You can create teams of agents each coming up with an implementation and killing the ones that don’t make the cut.
Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding
Swizec
|root
|parent
|next
[-]
Let me guess, you've never worked in a real production environment?
When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.
Really good post about why AI is a poor fit in software environments where nobody even knows the full requirements: https://www.linkedin.com/pulse/production-telemetry-spec-sur...
person22
|root
|parent
|next
[-]
empath75
|root
|parent
|previous
[-]
Well, now it'll take them 5 minutes to rewrite their code to work around your change.
Swizec
|root
|parent
|next
[-]
You misunderstand. It will take them 2 years to retrain 5000 people on the new process across hundreds of locations. In some fields, whole new college-level certifications courses will have to be created.
In my specific experience it’s just a few dozen (maybe 100) people doing the manual process on top of our software and it takes weeks for everyone to get used to any significant change.
We still have people using pages that we deprecated a year ago. Nobody can figure out who they are or what they’re missing on the new pages we built
baq
|root
|parent
|next
|previous
[-]
builtbyzac
|root
|parent
|previous
[-]
jryio
|next
|previous
[-]
For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.
If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.
tedivm
|next
|previous
[-]
Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.
Lerc
|root
|parent
[-]
There are a subset of things that it would be ok to do this right now. Instances where the cost of utter failure is relatively low. For visual results the benchmark is often 'does it look right?' rather than 'Is it strictly accurate?"
jghn
|next
|previous
[-]
sharkjacobs
|next
|previous
[-]
hrmtst93837
|root
|parent
[-]
Proving a small pure function is one thing, but once the code touches syscalls, a stateful network protocol, time, randomness, or messy I/O semantics, the work shifts from 'verify the program' to 'model the world well enough that the proof means anything,' and that is where the wheels come off.
duskdozer
|next
|previous
[-]
fhd2
|root
|parent
|next
[-]
I'd consider shipping LLM generated code without review risky. Far riskier than shipping human-generated code without review.
But it's arguably faster in the short run. Also cheaper.
So we have a risk vs speed to market / near term cost situation. Or in other words, a risk vs gain situation.
If you want higher gains, you typically accept more risk. Technically it's a weird decision to ship something that might break, that you don't understand. But depending on the business making that decision, their situation and strategy, it can absolutely make sense.
How to balance revenue, costs and risks is pretty much what companies do. So that's how I think about this kind of stuff. Is it a stupid risk to take for questionable gains in most situations? I'd say so. But it's not my call, and I don't have all the information. I can imagine it making sense for some.
pron
|next
|previous
[-]
Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.
There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's attempt to have agents write a C compiler. They had thousands of human-written tests, but at some point the agents couldn't get the software to converge. Fixing a bug created another.
davemp
|next
|previous
[-]
I’m actually starting to get annoyed about how much material is getting spread around about software analysis / formal methods by folks ignorant about the basics of the field.
phillipclapham
|next
|previous
[-]
The thread's hitting on this with "who writes the tests" but I think it undersells the scope. You're not just shifting responsibility, you're also hitting a ceiling: test specs can verify behavior, not decisions. Worth thinking about what it'd even mean to verify the decision trail that produced the code, not just the code itself.
boombapoom
|next
|previous
[-]
Ancalagon
|next
|previous
[-]
Animats
|root
|parent
|next
[-]
But if you actually can specify what the program is supposed to do, this can work. It's appropriate where the task is hard to do but easy to specify. A file system or a database can be specified in terms of large arrays. Most of the complexity of a file system is in performance and reliability. What it's supposed to do from the API perspective isn't that complicated. The same can be said for garbage collectors, databases, and other complex systems that do something that's conceptually simple but hard to do right.
Probably not going to help with a web page user interface. If you had a spec for what it was supposed to do, you'd have the design.
jryio
|root
|parent
|previous
[-]
We are simply shuffling cognitive and entropic complexity around and calling it intelligence. As you said, at the end of the day the engineer - like the pilot - is ultimately the responsible party at all stages of the journey.
otabdeveloper4
|next
|previous
[-]
Just write your business requirements in a clear, unambiguous and exhaustive manner using a formal specification language.
Bam, no coding required.
ventana
|next
|previous
[-]
Each of these approaches is just fine and widely used, and none of them can be called "automated verification", which, if my understanding of the term is correct, is more about mathematical proof that the program works as expected.
The article mostly talks about automatic test generation.