What is AA-Briefcase? The AI benchmark for real knowledge work, explained

Alicia Kirana Utomo
Written by

Alicia Kirana Utomo

Katelin Teen
Reviewed by

Katelin Teen

Last edited June 22, 2026

Expert Verified
An open briefcase spilling documents, spreadsheets, emails and chat messages while an AI figure grades them on a scorecard

What AA-Briefcase actually measures

Most AI benchmarks ask short, self-contained questions: a math problem, a coding puzzle, a multiple-choice quiz. That's fine for measuring raw reasoning, but it's nothing like how people actually use these models at work. Real knowledge work is long, ambiguous, and buried in mess.

AA-Briefcase was built to close that gap. Instead of a prompt, each model is dropped into a multi-week business project with many linked tasks and thousands of source files, and asked to produce the kind of deliverables a real analyst or PM would: financial models, board presentations, design mock-ups, strategy memos. The scenarios were developed over months by industry experts from companies including Google, McKinsey and Boston Consulting Group, so the work resembles what those firms actually do.

The numbers give a sense of the scale. There are four held-out project scenarios and 91 tasks in total, drawn from data science, product management, and corporate strategy. Across them sit nearly 2,000 source files, including more than 3,500 emails and 25,000 Slack messages, deliberately fragmented and full of realistic contradictions. The four scoring scenarios are a Data Science project, a Product Management project, a Banking Operations transformation, and a Heavy Industry Strategy build; a fifth Due Diligence scenario is public and doesn't count toward scores.

That framing matters because it mirrors the failure mode of every AI agent I've ever shipped: the model rarely struggles with the idea, it struggles with finding the one requirement hidden in file 1,400 and not contradicting the email that quietly overrode it.

How AA-Briefcase grades a model

Here's where AA-Briefcase gets clever. A single score would hide the most interesting thing about AI output, which is that looking professional and being correct are two completely different skills. So every task is graded on three separate dimensions.

How AA-Briefcase grades a model: messy files feed an AI agent in a sandbox, which produces deliverables that are scored on a rubric, analytical quality and presentation, then combined into one Elo
How AA-Briefcase grades a model: messy files feed an AI agent in a sandbox, which produces deliverables that are scored on a rubric, analytical quality and presentation, then combined into one Elo

The first is a binary rubric: pass or fail on each check, no partial credit. Did the model follow instructions, dig out requirements scattered across files, use the right evidence, and reach the correct conclusion? The second is analytical quality, judged by pairwise comparison against another model's submission: which deliverable is more thorough and better supported? The third is presentation, also pairwise: which output is more professionally put together?

Those three feed into one headline number, the AA-Briefcase Elo, which blends analytical-quality Elo, presentation Elo, and rubric pass rate using maximum-likelihood Elo aggregation. To keep any one model family from grading itself favourably, every comparison is decided by a panel of three judges: Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro Preview.

The plumbing is open too. Models run on Stirrup, Artificial Analysis' open-source agent harness, inside an offline sandbox with no internet, for up to 500 turns per task. It's a genuinely demanding setup, and it's a fair bit closer to a real agentic workflow than a chat window is.

What the results actually say

The leaderboard up top tells the happy story (Claude Fable 5 in front, capability tiers neatly stacked). The harder story is in the pass rates.

Bar chart: pass rate falls from 55% on prompt-only checks to 40% on checks needing five or more files, with a callout that the top model passes all checks on just 3% of tasks
Bar chart: pass rate falls from 55% on prompt-only checks to 40% on checks needing five or more files, with a callout that the top model passes all checks on just 3% of tasks

Even the leading model satisfies all rubric criteria on just 3% of tasks, and on 31 of the 91 tasks no model scores above 50%. Difficulty also scales with the number of required files: high-intelligence models drop from around 55% on prompt-only checks to about 40% once a task needs five or more. The more a task looks like real work, the worse everyone does.

The leaderboard has a few takeaways worth pulling out. GLM-5.2 is the clear open-weight leader and the price/performance standout, landing roughly 90 Elo below Claude Opus 4.8 for less than a quarter of the cost. MiniMax-M3 and GLM-5.2 both overperform relative to their general intelligence scores, while Google's Gemini models actually underperform on AA-Briefcase compared to where they sit on broad intelligence rankings. And as the cost view in the widget shows, the spread between the priciest and cheapest model runs over 800×, which is a useful reminder when you're weighing the real cost of an AI agent against the metrics that actually move.

The "looks right but is wrong" problem

My favourite finding in the whole release is a behavioural one, and it explains a lot about why AI work can feel untrustworthy.

Bar chart of view-image calls per task: Claude Fable 5 at 21, Claude Opus 4.8 at 12, GPT-5.4 Mini at 2, and Gemini 3.1 Pro at 0.1, which submits files it never looked at
Bar chart of view-image calls per task: Claude Fable 5 at 21, Claude Opus 4.8 at 12, GPT-5.4 Mini at 2, and Gemini 3.1 Pro at 0.1, which submits files it never looked at

The models that score best on presentation are the ones that actually look at their own rendered output. Claude Fable 5 made about 21 view-image calls per task and Opus 4.8 about 12, while some models submitted files they'd barely glanced at (Gemini 3.1 Pro Preview averaged roughly 0.1 view-image calls). It turns out "check your work before you hand it in" is as good advice for an AI as it is for a person.

There's a deeper point underneath. AA-Briefcase separates polish from correctness precisely because a confident, well-formatted answer that's quietly wrong is more dangerous than an obviously incomplete one. That's the exact risk that shows up when an AI chatbot answers a customer, and it's why preventing hallucinations is the whole ballgame in support, not a nice-to-have.

Why a leaderboard score isn't a deployment plan

So a frontier model can do real knowledge work, sometimes brilliantly, and still whiff most of the time on the hardest, most file-heavy tasks. If you take one thing from AA-Briefcase, take this: a benchmark rank is a general capability signal, not a promise about how a model behaves on your messy data.

I've watched this play out firsthand. We've spent years putting AI agents on live support queues, and the thing that bites teams isn't whether the underlying model is smart enough in the abstract, it's whether it stays accurate on their specific tickets, their product quirks, and their edge cases. A model that tops every public leaderboard can still confidently misquote your refund policy on day one, long before it ever gets to automated ticket resolution. That's not a knock on the model; it's the difference between a benchmark and production.

The fix is the same instinct AA-Briefcase is built on: grade the work against ground truth before you trust it. For a helpdesk, that means running the AI against your own historical tickets and seeing exactly what it would have replied, rather than reading a spec sheet and hoping. Think of it as running your own private AA-Briefcase, where the test set is your real support history.

Try eesel for AI support you can actually trust

If AA-Briefcase convinced you that capability and reliability aren't the same thing, that's the exact problem eesel AI is built around. eesel works like a new support teammate that plugs into your existing helpdesk and knowledge base in minutes, then lets you simulate it on thousands of your past tickets before it ever talks to a customer, so you see its real resolution rate and exact answers up front instead of guessing from a leaderboard.

eesel AI's reports dashboard, where teams forecast resolution rates and review how the AI would have handled past tickets before going live
eesel AI's reports dashboard, where teams forecast resolution rates and review how the AI would have handled past tickets before going live

You stay in control of what it's allowed to answer and when it escalates, and it's free to try on your own data. If you're evaluating AI for customer service, that simulate-first approach is the closest thing to bringing AA-Briefcase's "prove it on real work" rigour to your own queue.

Frequently Asked Questions

What is the AA-Briefcase benchmark?
AA-Briefcase is a benchmark from Artificial Analysis that tests AI models on realistic, multi-week knowledge-work projects rather than one-off questions. Each project hands the model thousands of messy source files and asks for real deliverables like financial models and board decks, then grades whether the work is actually correct. It's one of the closest public proxies for how an AI agent performs on genuine office work.
Which AI model is best at AA-Briefcase?
At launch on 18 June 2026, Claude Fable 5 tops the AA-Briefcase Elo with roughly 1587, ahead of Claude Opus 4.8 and the open-weight leader GLM-5.2. The full ranking is in the interactive leaderboard near the top of this post, and you can re-check the live numbers on the Artificial Analysis evaluation page.
How is AA-Briefcase scored?
Each task is graded on three dimensions: a binary rubric for verifiable correctness, a pairwise Elo for analytical quality, and a pairwise Elo for presentation. Those combine into a single AA-Briefcase Elo, with a three-model judge panel deciding each comparison to limit same-family bias.
Why do AI models score so low on AA-Briefcase?
The work is genuinely hard: the top model passes every rubric check on only 3% of tasks, and on 31 of 91 tasks no model clears 50%. Difficulty also rises with the number of files a task needs, which is exactly the kind of fragmented context that trips up AI in production.
Does a high AA-Briefcase score mean the model is safe to deploy?
No. A leaderboard rank tells you a model is capable in general, not that it'll be reliable on your data and workflows. The safer path is to test against your own historical work first, the way eesel lets support teams simulate an AI agent on past tickets before it ever replies to a customer.
How is AA-Briefcase different from other AI benchmarks?
Most benchmarks score short, self-contained questions. AA-Briefcase scores long-horizon projects with linked tasks and contradictory source files, and it separates outputs that look polished from outputs that are actually right. That makes it more relevant to anyone weighing AI versus human work on real business tasks.
Can I use AA-Briefcase to pick an AI tool for customer support?
It's a useful capability signal, but support tools are more than a raw model. What matters for AI customer service is how the system retrieves your knowledge, escalates, and avoids confident wrong answers. Pair the benchmark with a real trial on your own tickets, like the simulation in eesel AI, before committing.

Share this article

Alicia Kirana Utomo

Article by

Alicia Kirana Utomo

Kira is a writer at eesel AI with a Computer Science background and over a year of hands-on experience evaluating AI-powered customer service tools. She focuses on breaking down how helpdesk platforms and AI agents actually work so that support teams can make better buying decisions.

Related Posts

All posts →
Illustration contrasting an AI chatbot answering a question with an AI agent connected to Slack, email and ticketing tools
AI

AI agents vs AI chatbots: the real difference and when to use each

AI agents vs AI chatbots: chatbots answer questions, agents take actions and close tickets. Here is the real difference and when to reach for each.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Conceptual hero illustration of Thomas, an AI founder that runs its own companies
AI

What is Thomas, the AI founder? Inside YC's first non-human founder

Thomas is a Y Combinator-backed AI founder, a virtual human that starts and runs its own companies. Here's what it actually is, how it works, and what it means for AI at work.

Rama Adi NugrahaRama Adi NugrahaJun 22, 2026
Palmier, the AI-native video editor, with AI generation built into the timeline
AI

What is Palmier? The AI video editor your agents can edit

Palmier is a Mac-native AI video editor where generation lives on the timeline and agents like Claude can edit your cut directly. Here's what it actually does.

Rama Adi NugrahaRama Adi NugrahaJun 19, 2026
A non-technical person describing an app idea while AI assembles software building blocks
AI

Vibe coding for non-developers: what it actually is and how to use it safely

A plain-English guide to vibe coding for non-developers: what it means, the tools to use, where it breaks, and what's safe to build yourself.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Illustration of a person directing blocks of code that assemble themselves, representing vibe coding
AI

What is vibe coding? A plain-English guide for 2026

Vibe coding means describing what you want to an AI and letting it write the code. Here's what it is, where it came from, the risks, and when to actually use it.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Floating IT service management dashboard panels showing ticket queues, routing diagrams, and AI activity feeds
IT support

Best ITSM automation tools in 2026

A practical guide to the 5 best ITSM automation tools in 2026 - from AI overlays that work on top of your existing helpdesk to full enterprise platforms.

Alicia Kirana UtomoAlicia Kirana UtomoMay 15, 2026
GLM-5.2 open-weights model evaluated for business use, benchmark and value theme
AI

GLM-5.2 for business: is the cheap open-weights model ready for real work?

GLM-5.2 for business: a clear-eyed look at Z.ai's open-weights model, what the benchmarks and the ~1/6th price actually mean, and where it fits real work.

Rama Adi NugrahaRama Adi NugrahaJun 21, 2026
Illustration of scattered noise and masked blocks resolving into clean lines of text, with a stopwatch signalling speed
AI

Diffusion-based AI models explained: how they work and why they're suddenly fast

A plain-English guide to diffusion-based AI models: how they differ from autoregressive LLMs, why they generate text 10x faster, and what that means for businesses.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Two people speaking different languages with a live sound wave bridging them, illustrating Gemini 3.5 Live Translate
AI

What is Gemini 3.5 Live Translate?

Gemini 3.5 Live Translate is Google's real-time speech-to-speech translation model for 70+ languages. Here's what it does, how it works, and where it fits.

Riellvriany IndriawanRiellvriany IndriawanJun 17, 2026

Ready to hire your AI teammate?

Set up in minutes. No credit card required.

Get started free