How do I know if my AI support is working?

Q: How do I know if my AI support is working?

Look at five numbers together: resolution rate , deflection rate , answer quality (correct and cited), escalation accuracy , and factual error rate. If resolution and deflection are climbing while quality and CSAT hold steady, your AI support is working.

Written by

Kira

Reviewed by

Katelin Teen

Last edited June 17, 2026

Expert Verified

Illustration of a support dashboard measuring whether an AI support agent is working

TL;DR

"Is my AI support working?" comes down to five numbers read together, not one: resolution rate, deflection rate, answer quality, escalation accuracy, and factual error rate. If resolution and deflection are climbing while quality and CSAT hold, it's working. If volume is up but satisfaction is sliding, or the bot is answering confidently without citing anything, it isn't, no matter how busy it looks.

The trap is judging an AI agent by activity ("it replied to 4,000 tickets!") instead of outcomes. A bot can be enormously busy and quietly wrong. The fix is to watch the small set of metrics below, learn to spot the green and red flags, and ideally simulate against your real past tickets before you ever trust it live.

I've spent the last three-plus years putting AI agents on live support queues at eesel, so this is the scorecard I'd actually use.

Why "is it working?" is harder than it looks

Here's the thing most dashboards won't tell you. The scary failure mode for AI support isn't the bot going silent, it's the bot sounding great while being wrong.

I've watched a confident-sounding agent tell a customer "yes, we support your car model" for brands that weren't in the knowledge base at all, simply because someone had written "we support all models" in a help doc. The bot wasn't broken. It was doing exactly what it was told, and it looked perfectly fluent doing it. That's the reason we now simulate every rollout against historical tickets first, instead of flipping a switch and hoping.

So before we get to numbers: "working" means the AI is resolving the right tickets correctly, handing off the rest cleanly, and not inventing things in between. Activity counts are vanity. Outcomes are the truth. If you only take one idea from this post, make it that one.

The five numbers that actually tell you

When teams ask me how to read their AI agent, I point them at the same five customer support metrics. Read together, they catch almost every failure mode.

A support health scorecard showing the five metrics that tell you if AI support is working: resolution rate, deflection rate, answer quality, escalation accuracy, and factual error rate

1. Resolution rate

The headline number: what percentage of tickets the AI closed end to end with no human touching them. This is the one that ties straight to cost, because every resolved ticket is one an agent didn't have to open.

What's "good"? It depends entirely on your ticket mix, but tier-1 is where AI earns its keep first. One gig-economy driver-analytics app on Zendesk told us, in a public G2 review, that eesel was resolving 73% of their tier-1 requests in the first month, with results showing inside a 7-day trial. At the other end, an internal IT helpdesk running on Jira started at 15% deflection and set a 55% target. Both are "working." The point isn't a universal benchmark, it's the trend: is the resolution rate climbing as you feed the agent more knowledge?

2. Deflection rate

Deflection and resolution get used interchangeably, and they shouldn't be. Resolution is a ticket closed without a human. Deflection is a customer who got their answer and never opened a ticket at all, usually through a chat widget or self-service before the conversation ever became a support case.

It's worth tracking on its own, because a high deflection rate is what quietly shrinks your queue. If you want the precise definition and the formula, we wrote up what deflection rate is and how to improve it, and a separate piece on measuring AI deflection versus human deflection so you don't double-count.

3. Answer quality

Volume without quality is the trap. So the real question under resolution is: when the AI answered, was it right, and did it show its work?

This is measurable. In one week-long sample across 581 chats, we scored chat quality at 96%. In another sample of 434 chats, the breakdown was 86% good, 7% partial, 6% deflected, and 1% outright failed, and on real webhook-triggered tickets (the harder test) it was 79% good. The exact method matters less than having one: grade a sample of answers for correctness and whether they carried a citation. An answer with no source attached is an answer you can't trust. A legal-tech founder we work with put it well, that with eesel they could "set exact guardrails on sourcing and it always provides transparent citations", which in their world is the difference between helpful and a lawsuit.

4. Escalation accuracy

A good AI agent knows what it doesn't know. So escalation accuracy is really a measure of judgment: when the AI was unsure, did it hand off to a human instead of guessing?

This is the most underrated number on the list. You want an agent that resolves confidently and escalates honestly, not one that answers everything. A support lead at an SMS platform captured the ideal in a G2 review: the AI "answers confidently but not too confidently." That second half is the whole game. Track your escalation rate and, more importantly, whether escalations land at the right moment.

5. Factual error rate

Finally, the number that should keep trending toward zero: how often the AI says something untrue. This is separate from "partial" answers. A factual error is the bot stating a wrong fact as if it were certain.

In one trial on a German jewelry retailer's real Zendesk traffic (around 1,000 tickets a month), we measured 93% triage accuracy and 100% spam detection with zero false positives, but also a 7% factual error rate that told us exactly where the knowledge base had gaps. That 7% wasn't a reason to abandon the rollout. It was a map. Every factual error points at a missing or contradictory doc, which is usually fixable, and is the bulk of what our guide on preventing AI hallucinations in support is about.

The green flags (and the red flags)

Numbers tell you the trend. But there are qualitative signals you can read in a single afternoon of skimming transcripts, and they're often faster than waiting for a month of data.

A two-column comparison of green flags versus red flags for AI support: working signs include citing sources, escalating when unsure, and a climbing resolution rate; warning signs include confident wrong answers, over-promising to customers, and volume up while CSAT is down

The green flags are the easy ones. The AI cites its sources on every answer. It escalates when it's actually unsure rather than bluffing. Your resolution rate is trending up week over week, not flat. And, the simplest tell of all, your agents stop complaining about repetitive tickets.

The red flags are where I'd spend your attention, because they hide behind good-looking dashboards:

Confident wrong answers. The car-model example from earlier. The bot is fluent, certain, and incorrect. This is the single most damaging pattern, because customers believe it.
Over-promising. I've seen agents reassure customers in ways the business can't back. One support manager flagged it bluntly to us, telling the AI to "stop telling customers that we will get them sorted. You don't know that," and "stop promising customers things we can't do." If your bot is making commitments about delivery dates or outcomes, that's a control problem, not a knowledge one.
Volume up, CSAT down. The agent is handling more, and customers are less happy. That divergence is the clearest sign that "busy" and "working" have come apart.
No citations. If you can't see where an answer came from, neither can the customer, and neither can you when you're auditing it later.

Most of these trace back to knowledge gaps or missing guardrails, which is good news, because both are fixable without ripping anything out. The common AI chatbot problems post goes deeper on the usual culprits.

Where to actually look

All of this assumes you can see what your AI agent is doing. If your tool only shows you a total count, that's the first thing to fix, because you can't manage what you can't read.

The two views I check most are the reports dashboard and the raw activity log. The reports view is where the trend lives: task volume over time, how tasks were triggered (chat, email, internal note), and how many AI actions were approved, rejected, or are still awaiting a human decision. That approval-versus-rejection ratio is a fast proxy for trust.

The eesel AI reports dashboard showing task volume, trigger events by type, and approval and rejection usage per tool

The activity log is where you read the actual work. Every conversation, its channel, the linked ticket, and whether it ended resolved or pending. This is where you go to spot-check answer quality and catch the confident-wrong cases the aggregate numbers smooth over. I'd skim ten of these a week, minimum.

The eesel AI activity log showing individual conversations with their channel, linked ticket, and resolved or pending status

If your helpdesk already runs CSAT surveys on closed conversations, tie those scores back to AI-handled tickets specifically. That single cut, CSAT on AI-resolved tickets versus human-resolved ones, settles most "is it actually good?" debates faster than anything else.

Don't wait until it's live to find out

Here's the part most teams skip, and the one I'd push hardest on: you don't have to discover whether your AI works by testing it on real customers.

The mistake is treating go-live as an on/off switch. The better model is a ramp. You graduate the AI through stages of autonomy as the numbers earn it, not before.

A trust ramp showing three stages: draft for review where a human sends every reply, semi-autonomous where the AI auto-replies only above a confidence threshold, and fully autonomous where the AI resolves tier-1 tickets on its own

You start in draft mode, where the AI writes replies but a human sends every one, so you're scoring quality with zero customer risk. As the answers prove out, you move to semi-autonomous, letting the AI auto-reply only above a confidence threshold and routing the rest to a person. Once the numbers hold, you let it run fully autonomously on the tier-1 cases it has earned.

Even before draft mode, you can run a simulation over thousands of your real past tickets to see exactly how the AI would have answered each one, with a predicted resolution rate, before a single live reply goes out. That German jewelry trial I mentioned ran a 100-ticket cross-validation precisely this way. Simulating first is how you answer "is it working?" before you've risked a single customer. It's also, honestly, the part I wish more teams did, because it turns a leap of faith into a measurement.

One fair caveat, since I work on this: eesel integrates deeply with helpdesks like Zendesk, Freshdesk, and Help Scout, so I'm not a neutral observer of the category. But the ramp-and-simulate approach holds whatever tool you use. If your AI vendor can't show you a dry run against your own tickets, that itself is a yellow flag worth asking about.

Try eesel

If you want to actually answer "is my AI support working?" with numbers instead of vibes, that's the problem eesel is built to solve. You connect your helpdesk and knowledge sources, brief the agent in plain language, and run a simulation on your past tickets to get a predicted resolution rate before going live, then ramp from draft mode to autonomy with the reporting to back every step.

The eesel AI helpdesk dashboard overview, showing connected integrations and AI activity

It runs on usage-based pricing with no per-seat fees, and there's a free tier to test it against your own queue. You can see how other teams measured their rollouts on the customer stories page, or read our practical guide to AI in customer support for the full playbook. Try eesel and find out what your real resolution rate would be.

Frequently Asked Questions

How do I know if my AI support is working?

Look at five numbers together: resolution rate, deflection rate, answer quality (correct and cited), escalation accuracy, and factual error rate. If resolution and deflection are climbing while quality and CSAT hold steady, your AI support is working.

What is a good resolution rate for an AI support agent?

It depends on your ticket mix, but for tier-1 support, 50% or more is a strong target. One eesel customer hit 73% of tier-1 requests resolved in the first month, while an internal IT helpdesk started at 15% and worked toward 55%. Start where your repetitive tickets are and grow from there with tier-1 deflection.

How is AI deflection different from AI resolution?

Deflection means the customer got their answer without ever reaching an agent; resolution means a ticket was fully closed without human help. They overlap but are not the same, which is why it helps to measure AI deflection separately from human deflection and read both alongside your core support metrics.

What are the warning signs that my AI support agent is failing?

The big one is confident wrong answers, when the bot states something untrue without flagging uncertainty. Watch for it over-promising to customers, ticket volume rising while CSAT drops, and answers with no citations. Most of these trace back to knowledge gaps, so read our guide on preventing AI hallucinations in support.

Can I test my AI support before letting it reply to customers?

Yes, and you should. eesel lets you run a simulation against thousands of your past tickets to see exactly how the AI would have answered, with a predicted resolution rate, before a single live reply. Then you ramp from draft mode to full autonomy as the numbers earn it. See the full approach in our practical guide to AI in support.

Hire your AI teammate

Set up in minutes. No credit card required.

Try for free Book a demo

Share this article

Article by

Kira

Kira is a writer at eesel AI with a Computer Science background and over a year of hands-on experience evaluating AI-powered customer service tools. She focuses on breaking down how helpdesk platforms and AI agents actually work so that support teams can make better buying decisions.