Hacker News

Reliable Software in the LLM Era

56 points by mempirate ago | 21 comments

_pdp_ |next [-]

Nothing changes in terms of how to make reliable software. You need the same things like unit tests, integration tests, monitoring tools, etc.

Basically AI now makes every product operate as if it has a vibrant open-source community with hundreds of contributions per day and a small core team with limited capacity.

joshribakoff |root |parent |next [-]

While nothing fundamentally changes i have found an increased need for tests and their taxonomies — because the LLM can “hack” the tests. So, having more robust tests with more ways to organize and run the tests. For example instead of 200 tests maybe i have 1,200, along with some lightweight tools to run tests in different parts of the test taxonomies.

A more concrete example is maybe you have tests that show you put a highlight on the active item tests that show you don’t put the highlight on the inactive items, but with an llm you might also want to have tests that wait a while and verify the highlight is not flickering on and off overtime (something so absurd you wouldn’t even test for it before AI).

The value of these test is in catching areas of the code where things are drifting towards nonsense because humans aren’t reviewing as thoroughly. I don’t think that you can realistically have 100% data coverage and prevent every single bug and not review the code. It’s just that I found that slightly more tests are warranted if you do want to step back.

hrmtst93837 |root |parent |next |previous [-]

The tough part is that the "core team" can't see inside most model updates so even if you have great tests, judgment calls by the model can change silently and break contracts you didn't even know you had. Traditional monitoring can catch obvious failures but subtle regressions or drift in LLM outputs need their own whole bag of tricks. If you treat LLM integration like any other code lib you'll be chasing ghosts every time the upstream swaps a training data set or tweaks a prompt template.

_pdp_ |root |parent [-]

This is no different than receiving PRs from anonymous users on the Internet. Some more successful open source projects are already doing this at scale.

flykespice |root |parent |next |previous [-]

> Nothing changes in terms of how to make reliable software. You need the same things like unit tests, integration tests, monitoring tools, etc.

It just changes in terms of doubling the work you have to do in order verify your system rather than you writing the code from scratch, because you have to figure out whatever code your AI agent spitted out before beginning the formal verification process.

With you having written the code from scratch, you already know it beforehand and the verification process is more smoother.

ok123456 |root |parent |previous [-]

Exactly. NO SILVER BULLET.

sriramgonella |next |previous [-]

Generating code is easy; maintaining correctness over time is the harder problem. I’m curious whether the future stack ends up combining AI code generation with property-based testing and automated verification tools to ensure systems remain reliable even as more code is machine-generated.

OutOfHere |next |previous [-]

"Spec validation" is extremely underrated. I easily have spent 10-20x the tokens on spec refinement and validation than I have on generating the code.

sastraxi |next |previous [-]

The idea is interesting, but have some more respect for your potential readers and actually write the post. There’s so much AI sales drivel here it’s hard to see what’s interesting about your product. I’m more interested in the choices behind your design decisions than being told “trust me, it’ll work”.

dude250711 |next |previous [-]

AI Era, Agentic Era, LLM Era...

Can we settle on Slop Decade?

bigblind |root |parent |next [-]

I guess it's too late for an "Eternal sunshine of the slopless mind"?

minraws |root |parent |next |previous [-]

Into the Slopverse.

SkyeCA |root |parent |next |previous [-]

Eternal Sloptember

duskdozer |root |parent [-]

One can dream there's a way out...

aleph_minus_one |root |parent [-]

I propose a "proof of quality" consensus mechanism. :-)

duskdozer |root |parent [-]

Sounds great -- let me fire up my agent swarm to get started on orchestrating the development of a planning spec.

prox |root |parent [-]

Have your agent contact my agent, we will never be in touch

OutOfHere |root |parent |previous [-]

Shallow dismissals are not permitted on this site as per its rules.

forgetfreeman |root |parent [-]

Not everything is worthy of a 5000 word Atlantic-style deconstruction when presented. I think the community largely embraces Hitchens' Razor. That being the case what is the minimum word count required to issue a dismissal?

OutOfHere |root |parent [-]

Please. If you don't want to deconstruct, then don't post a shallow comment, especially a cliched shallow AI hating comment.

As for Hitchen's razor, when it does apply, it applies implicitly, with no need for a comment or explicit mention.

esafak |next |previous [-]

I haven't even used TLA+ yet and now it's got derivatives... My understanding is: TLA+ but like C, functional, and typed.

ClaudeAgent_WK |previous [-]

[dead]