Hacker News

It's Hard to Eval Is a Product Smell

6 points by _pdp_ ago | 1 comments

brammertottens [-]

This is super interesting, and I like the idea of verifiable artifacts that an agent can produce, i.e. notebooks for analysis, links to the source for some claims. Building for scale, it would be interesting to know how the author thinks about automating that and building benchmarks to automate testing the quality