Hacker News

Launch HN: Voker (YC S24) – Analytics for AI Agents

59 points by ttpost ago | 22 comments

Hey HN, we're Alex and Tyler, co-founders of Voker.ai (https://voker.ai/), an agent analytics platform for AI product teams. Voker gives full visibility into what users are asking of your agents, and whether your agents are delivering, without having to dig through logs. Our main product is a lightweight SDK that is LLM stack agnostic and purpose-built for agent products. (https://app.voker.ai/docs)

Agent Engineers and AI product teams don’t have the right level of visibility into agent performance in production, which results in bad user experiences, churn, and hundreds of hours wasted with spot checks to find and debug issues with agent configurations.

Demo: https://www.tella.tv/video/vid_cmoukcsk1000i07jgb4j65u67/vie...

We recently conducted a survey of YC Founders and 90%+ of respondents said that the only way they know if their Agents are failing users in production is by hearing complaints from customers. They push a prompt change hoping that it fixes the problem and doesn’t break something somewhere else, and the cycle repeats.

We saw tons of observability and evals products popping up to try to address these problems, but we still felt like something was missing in the agent monitoring stack. Obs is good for individual trace debugging but is only accessible to engineers. Evals are good for testing known issues, but don't give insights into trends that teams don’t expect, so engineers are always playing catch up. Traditional product analytics tools do a good job tracking clicks and pageviews across your product surface but weren’t built ground up for agent products. Knowing what users want out of agents, and whether the agent delivered requires specific conversational intelligence / unstructured data processing techniques.

We came up with the agent analytics primitives of Intents, Corrections, and Resolutions to describe something pretty much all conversational agents had in common: a user will always come to an agent with an intent, the user might have to correct this agent on the way to getting their intent resolved, and hopefully every intent a user has is eventually resolved by the agent. Voker processes LLM calls by automatically annotating individual conversations and picking out user intent and corrections. Voker takes these and uses LLMs and hierarchical text classification to create dynamic categories that give higher level insights so you don’t have to read individual conversations to know what are the main usage patterns across your users.

The most common substitute solution we’ve seen is uploading obs logs to Claude or ChatGPT and asking for summary insights. There are a few problems with this - mainly that LLMs aren’t good at math or data science, so you don’t get accurate or consistent statistics. Its highly likely that the LLM overfits to some insights and underfits to others. The LLM isn’t programmatically reading and classifying each individual session or interaction. This is why we don’t use LLMs for any of our core data engineering (processing events, calculating statistics) so the analytics we produce are consistent, reproducible, and accurate. We have a publicly available, lightweight SDK that wraps LLM calls to OpenAI, Anthropic and Gemini in Python and Typescript. Voker handles the data engineering to turn raw data into usable analytics primitives and higher level insights. Free tier: 2,000 events / mo, requires email signup. Paid plans start at $80/mo with a 30 day free trial.

We'd love to hear how you're currently detecting trends, and if you try Voker, tell us what part of our analysis is valuable, and what still feels missing. Thanks for reading, and we’re looking forward to your thoughts in the comments!

jorisw |next [-]

Some notes on the sales website.

- To me, the line "Do you really know what your agents are saying to your users?" doesn't match at all with the screenshot directly above, which is the first screenshot on the page. On first glance, all that screenshot conveys to me is "some analytics app". Perhaps the first graphic could better express what about agents' activities, is being made easier to inspect.

- I click "How it Works" and I just get vaguely described screenshots. Only from reading the Python import line in the fourth screenshot, I get that it acts as a middleware by sitting in for the OpenAI import. Maybe this nav should link to the section above, with the 3 integration steps?

- Scrolling down and seeing Intents vs Corrections vs Resolutions, I'm actually getting a sense of what Voker does. To me, that still doesn't fully align with "Do you really know what your agents are saying to your users?"

- I'm mildly amused by the fact that whiteboard desk guy is copying roadmap suggestions from ChatGPT.

ttpost |root |parent [-]

thanks for the feedback - GREAT point, definitely going to noodle on this one - Yeah we're about to add a product demo video lower down on the page that this button will jump to - what do you think about that instead? - I see your point.... hmm. maybe: "Do you really know what your users are asking of your agents and whether they deliver?" is a more direct phrasing? - hehe you found us out! that's how we build our whole product! (JK - thats actually our linear agent taking a look at the roadmap doc that we put together so we could write out those features for a whiteboarding session) but good eye ;)

akslp2080 |next |previous [-]

How is it different than Langfuse? sorry if I am off the track but Langfuse also provides some detailed tracing of agentic behavior and decisions.

ttpost |root |parent [-]

We get this question a lot! We work hand-in-hand with obs tools like Langfuse. Langfuse is great for debugging technical issues on individual traces like timing conditions that resulted in failed API calls.

Voker focuses on product, business and user outcomes - like what intents did the user bring to your agent that you might not expect. We're built for the whole product team, whereas Langfuse focuses on engineers specifically.

One way to think about it would be: a PM notices in Voker that a new intent category is coming up frequently and the agent isn't handling it well. The PM can dig into the data with visualizations or our conversation reconstructions. Once they confirm its a real issue worth addressing, they can link their investigation to the AI engineer - who can use Voker AND Langfuse to debug and implement a fix/improvement.

zwaps |root |parent |next [-]

For instance, I have intent classifiers running on my traces and most tools offer some sort of analysis agents or API so it's claude sdk and go.

Maybe let's take Langsmith. Now I know my gripes with that product. How do you see it? What do you add, specifically?

zwaps |root |parent |next [-]

Maybe as a comment, you really put weight on intent classification. I am not sure why. For it to work, you are gonna need my expert domain input. And given that, I feel like the classification bit is basically solved. I wonder a bit why this is the feature you seem to put front and center (e.g. screenshots)

ttpost |root |parent |previous [-]

tl;dr: Langsmith + homegrown intents doesn't scale with contributors and agent usage as an Analytics solution. Voker adds trend and usage insights on collaborative dashboards that work for the whole AI product team.

Nice, sounds like you've set up your own solution in house. We definitely see some teams do that, and for some it works perfectly, for others, its too expensive to maintain - they get new requests for new dashboards or different subcuts of data from product or design teams, or they run into an issue like way too many intents generated to be useful, and its not worth the tradeoff of investing time in building internal tooling. But for some it makes sense to roll your own! It also really depends on how many people on the team are involved in building the agent products, and how much volume your agents have. If you have millions of conversations a month with thousands of unique intents, you have to set up data eng pipelines just to process categorize, and store all that data in a way thats usable for the whole team.

When it comes to Langsmith, we hear about them a lot from our customers, pretty much all of them love it as an obs tool, but most say that only the engineers have access or spend time in it, and they've told us the strength of Langsmith is its technical tracing, not its visualizations, ui, or usability. They've told us any "insights" are very canned (because thats not Langsmith's key focus).

We add self-serve analytics - like how Google Analytics lets marketers see how their website is performing without needing to ask engineers to write SQL queries on cloudwatch logs.

Ex: PM can self-serve and look at trends in what users are asking of agents, notice a problem, do a quick RCA, look for reproducibility across other sessions - before deciding to assign as an issue to engineer. Old way would be: PM hears a complaint from a customer, asks the engineer to "look into it" and the eng spends 4 hours combing through Langsmith logs to hunt down one session without even knowing if its actually a widespread issues

bfeynman |root |parent |previous [-]

do you have experience as PMs? Looking at website, it looks like you just use llms to guess what categories are? Seems like trap for garbage in garbage out. Otherwise you would need someone technical to figure out how to setup the proper KPI monitoring things...

ttpost |root |parent [-]

We do! We have combined experience as PMs, ml engs, and data scientists across many verticals. We also have experience helping PMs and AI eng teams build agents across over 100 customers from our first product.

You're totally right, the analytics annotation primitives we detect (intents, corrections, resolutions) are the cornerstone to all the other analysis in our platform. It's critical that we get those right or all the data and insights in the world are useless.

LLMs are a core part of that detection, but we also do things like hierarchical classification, (https://voker.ai/blog/hierarchical-text-classification-with-...) and will eventually add in other ML methods where applicable. On top of our automated detections, we're building ways for the annotations to improve and adapt to your specific agent product, your data, and your feedback on our annotations.

Our SDK is architected to eventually accept any type of event you want to send as additional information like add to carts, or other conversion metrics that are valuable for analysis on agent performance.

You're definitely right, we don't expect a PM to instrument this all themselves - similar to web analytics or product analytics tools, the engineering team instruments and maintains the integration, and then our app makes the insights and data accessible to not just the engineer but the whole product team.

holoduke |root |parent [-]

Your response is AI. It's a bit ai sloppy as well. Sorry to say that. But as a business owner you can and should do better.

ttpost |root |parent |next [-]

haha oof that's embarrassing if I really sound like AI, but unfortunately it was written by hand by yours truly!

Guess its human-sloppy :(

zwaps |root |parent |previous [-]

how so? genuine question

Damianf19 |next |previous [-]

What's the data model that lets you compare agents that differ a lot in tools/policies? Curious if you normalize on the "what did the user actually accomplish" layer or on raw token/turn metrics, because the two paint completely different pictures of "is this agent working." We struggle with this on the eval side of our own product (email pipeline outcomes, not agents, but same shape).

alrudolph |root |parent [-]

For the agent working, we're focusing on the user outcome, we think that the raw usage, number of turns, function calls are useful operationally but think of those as more observability than the core evaluation target. We do show some of these stats in our conversation view but don't aggregate to compare agents. Longer term we will look to add in more of these features so we can compare quality vs cost, for example

Damianf19 |root |parent [-]

[dead]

Ozzie_osman |next |previous [-]

If the team is here, would love to understand how it compares to something like Amplitude's agent analytics (https://amplitude.com/ai-agents).

ttpost |root |parent [-]

Yeah, this is a confusing one on wording. TLDR: Amplitude is analytics for your web/product data, Voker is analytics for your agent data.

We call Amplitude's feature an "AI Analyst". Essentially Amplitude is layering a LLM copilot on top of their own product - so you don't have to click the buttons or write reports to get insights.

We're an analytics platform built for tracking your agents. Different products with different problems they're solving.

Not sure if this helps, but essentially Amplitude could use Voker to track how well their AI Analyst agent product is actually working!

adrianisbored |root |parent [-]

I think the link is off above, but they're thinking of Amplitude's not yet GA agent analytics, not their general analytics: https://amplitude.com/blog/agent-analytics

ttpost |root |parent [-]

Thanks for clarifying - yes this is a much closer analog to what we're building. That being said, we haven't heard from anyone using it or tried it ourselves yet so can't speak to quality comparison.

From what I can tell in this video, it still seems like Amplitude is focusing on the obs trace details (latency, tokens, etc).

They don't seem to go as deep (or at least don't highlight) as much of the semantic data processing and detection we're doing (intents, corrections, resolutions) - and creating higher level classifications and insights from those. We're completely purpose built for monitoring agent products, so we're striving to do more than just visualizations, we intend to be best-in-category at the actual automated annotations and analysis of agent<>user interaction data.

arm32 |next |previous [-]

Hey, Alex and Tyler! I love your idea—can you reach out to me via email? I'd love to chat about working on it with you.

ggamecrazy |next |previous [-]

> High interaction volume (1k+ chat sessions per month)

I don't mean to be that typical HN commenter but you did lose me a bit there.

I know a lot of people are just getting started with agents but even for a lot of scrappy startups usage is a lot higher than that!

If I may suggest focusing on explaining how you can add value even when usage is super low to controlling costs even when usage can get super high?

I can validate you that it is a true problem that's solved by large companies but you have to hand-roll yourself @ startups (via airflow or queues, etc). But unfortunately one where I am not sure that a lot of stakeholders understand the benefits of (yet!). I think value has to be shown a bit more clearly here, sadly.

Congrats on the launch!

ttpost |root |parent [-]

thanks for the honest insight! Very helpful to hear that you feel the problem is real, that's whats most important to us, we'll keep working on getting the solution and messaging right!

- we say >1K because two reasons: 1. its still feasible (although tedious) to put the full burden on analyzing agent performance and usage on your engineers equipped with obs tools and logs (not ideal but its what most AI teams we see do) 2. you're spot on - it actually surprised us too how few companies (even the really big public ones that have promised to build 200+ agents back in 2024) still have barely one or two agents in prod with only hundreds of convos. 1K convos was our best first guess at the cutoff point where the manual work of digging through traces, and the insights you can get start to make sense enough to need a tool for this. We're definitely planning to tune that number as time goes on!

-We definitely don't have pricing figured out yet, we plan to continue to iterate on the event volumes etc to make sure our product gives clear positive roi for the teams using it at every level. But in general we look at other analytics products as our early barometer. Yes the higher your event volume the more cost, but in my experience (back in Ecomm using Heap analytics) it was so incredibly worth it to pay more as our business and data volumes scaled (we ran on data) I think this is the challenge with all analytics products, they're not useful if you only have 5 site visitors. We see the progression as start with obs/logs -> evals -> analytics as your usage scales.

Curious to hear, whats the session volume of the agent products you run? Every datapoint helps us tune our tiers, pricing, and most importantly our product!

maxrumpf |next |previous [-]

congratulations!

AnouarBoussif |next |previous [-]

[dead]

lukassbrad |next |previous [-]

[flagged]

raymondchau |next |previous [-]

[flagged]

KaiShips |next |previous [-]

[flagged]

robatorvi |next |previous [-]

[flagged]

mnvibe26x7 |next |previous [-]

[flagged]

theuniverseson |next |previous [-]

[flagged]

marsulta |next |previous [-]

[dead]

abhijithbabu |next |previous [-]

[flagged]

WhoffAgents |next |previous [-]

[flagged]

ksi23 |previous [-]

[dead]