What is AA-Briefcase? The AI benchmark for real knowledge work, explained

Q: Why do AI models score so low on AA-Briefcase?

The work is genuinely hard: the top model passes every rubric check on only 3% of tasks, and on 31 of 91 tasks no model clears 50%. Difficulty also rises with the number of files a task needs, which is exactly the kind of fragmented context that trips up AI in production .

Written by

Alicia Kirana Utomo

Reviewed by

Katelin Teen

Last edited June 22, 2026

Expert Verified

An open briefcase spilling documents, spreadsheets, emails and chat messages while an AI figure grades them on a scorecard

TL;DR

AA-Briefcase is a new benchmark from Artificial Analysis that grades AI models on real, multi-week knowledge work (financial models, board decks, product specs) instead of clean one-off questions. Each model gets thousands of messy files (emails, Slack threads, spreadsheets) and has to produce actual deliverables, which get scored for correctness, analytical quality, and presentation.

The headline finding is humbling: even the best model passes every rubric check on only 3% of tasks, and on 31 of 91 tasks no model clears 50%. Claude Fable 5 tops the leaderboard, with the open-weight GLM-5.2 punching well above its price.

Here's the part most coverage skips: a high benchmark score tells you a model is capable in general, not that it's safe on your data. That gap is the whole reason I think anyone shopping for AI customer service should test on their own historical work before going live, not just trust a leaderboard.

I build AI agents for a living at eesel, so a benchmark that finally measures messy real work instead of trivia is the kind of thing I drop everything to read. Below is what AA-Briefcase actually measures, how it grades, who's winning, and the one lesson I'd take from it into any AI agent rollout.

AA-Briefcase leaderboard

Approximate launch values, 18 June 2026. Toggle the view to see why the cheapest model is never the best one.

Capability (Elo) Cost per task

Claude Fable 51587

Claude Opus 4.81356

GLM-5.2 (open)1266

GPT-5.51159

MiniMax-M3 (open)1116

Claude Sonnet 4.61081

Gemini 3.5 Flash870

What AA-Briefcase actually measures

Most AI benchmarks ask short, self-contained questions: a math problem, a coding puzzle, a multiple-choice quiz. That's fine for measuring raw reasoning, but it's nothing like how people actually use these models at work. Real knowledge work is long, ambiguous, and buried in mess.

AA-Briefcase was built to close that gap. Instead of a prompt, each model is dropped into a multi-week business project with many linked tasks and thousands of source files, and asked to produce the kind of deliverables a real analyst or PM would: financial models, board presentations, design mock-ups, strategy memos. The scenarios were developed over months by industry experts from companies including Google, McKinsey and Boston Consulting Group, so the work resembles what those firms actually do.

The numbers give a sense of the scale. There are four held-out project scenarios and 91 tasks in total, drawn from data science, product management, and corporate strategy. Across them sit nearly 2,000 source files, including more than 3,500 emails and 25,000 Slack messages, deliberately fragmented and full of realistic contradictions. The four scoring scenarios are a Data Science project, a Product Management project, a Banking Operations transformation, and a Heavy Industry Strategy build; a fifth Due Diligence scenario is public and doesn't count toward scores.

That framing matters because it mirrors the failure mode of every AI agent I've ever shipped: the model rarely struggles with the idea, it struggles with finding the one requirement hidden in file 1,400 and not contradicting the email that quietly overrode it.

How AA-Briefcase grades a model

Here's where AA-Briefcase gets clever. A single score would hide the most interesting thing about AI output, which is that looking professional and being correct are two completely different skills. So every task is graded on three separate dimensions.

The first is a binary rubric: pass or fail on each check, no partial credit. Did the model follow instructions, dig out requirements scattered across files, use the right evidence, and reach the correct conclusion? The second is analytical quality, judged by pairwise comparison against another model's submission: which deliverable is more thorough and better supported? The third is presentation, also pairwise: which output is more professionally put together?

Those three feed into one headline number, the AA-Briefcase Elo, which blends analytical-quality Elo, presentation Elo, and rubric pass rate using maximum-likelihood Elo aggregation. To keep any one model family from grading itself favourably, every comparison is decided by a panel of three judges: Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro Preview.

The plumbing is open too. Models run on Stirrup, Artificial Analysis' open-source agent harness, inside an offline sandbox with no internet, for up to 500 turns per task. It's a genuinely demanding setup, and it's a fair bit closer to a real agentic workflow than a chat window is.

What the results actually say

The leaderboard up top tells the happy story (Claude Fable 5 in front, capability tiers neatly stacked). The harder story is in the pass rates.

Bar chart: pass rate falls from 55% on prompt-only checks to 40% on checks needing five or more files, with a callout that the top model passes all checks on just 3% of tasks

Even the leading model satisfies all rubric criteria on just 3% of tasks, and on 31 of the 91 tasks no model scores above 50%. Difficulty also scales with the number of required files: high-intelligence models drop from around 55% on prompt-only checks to about 40% once a task needs five or more. The more a task looks like real work, the worse everyone does.

The leaderboard has a few takeaways worth pulling out. GLM-5.2 is the clear open-weight leader and the price/performance standout, landing roughly 90 Elo below Claude Opus 4.8 for less than a quarter of the cost. MiniMax-M3 and GLM-5.2 both overperform relative to their general intelligence scores, while Google's Gemini models actually underperform on AA-Briefcase compared to where they sit on broad intelligence rankings. And as the cost view in the widget shows, the spread between the priciest and cheapest model runs over 800×, which is a useful reminder when you're weighing the real cost of an AI agent against the metrics that actually move.

The "looks right but is wrong" problem

My favourite finding in the whole release is a behavioural one, and it explains a lot about why AI work can feel untrustworthy.

Bar chart of view-image calls per task: Claude Fable 5 at 21, Claude Opus 4.8 at 12, GPT-5.4 Mini at 2, and Gemini 3.1 Pro at 0.1, which submits files it never looked at

The models that score best on presentation are the ones that actually look at their own rendered output. Claude Fable 5 made about 21 view-image calls per task and Opus 4.8 about 12, while some models submitted files they'd barely glanced at (Gemini 3.1 Pro Preview averaged roughly 0.1 view-image calls). It turns out "check your work before you hand it in" is as good advice for an AI as it is for a person.

There's a deeper point underneath. AA-Briefcase separates polish from correctness precisely because a confident, well-formatted answer that's quietly wrong is more dangerous than an obviously incomplete one. That's the exact risk that shows up when an AI chatbot answers a customer, and it's why preventing hallucinations is the whole ballgame in support, not a nice-to-have.

Why a leaderboard score isn't a deployment plan

So a frontier model can do real knowledge work, sometimes brilliantly, and still whiff most of the time on the hardest, most file-heavy tasks. If you take one thing from AA-Briefcase, take this: a benchmark rank is a general capability signal, not a promise about how a model behaves on your messy data.

I've watched this play out firsthand. We've spent years putting AI agents on live support queues, and the thing that bites teams isn't whether the underlying model is smart enough in the abstract, it's whether it stays accurate on their specific tickets, their product quirks, and their edge cases. A model that tops every public leaderboard can still confidently misquote your refund policy on day one, long before it ever gets to automated ticket resolution. That's not a knock on the model; it's the difference between a benchmark and production.

The fix is the same instinct AA-Briefcase is built on: grade the work against ground truth before you trust it. For a helpdesk, that means running the AI against your own historical tickets and seeing exactly what it would have replied, rather than reading a spec sheet and hoping. Think of it as running your own private AA-Briefcase, where the test set is your real support history.

Try eesel for AI support you can actually trust

If AA-Briefcase convinced you that capability and reliability aren't the same thing, that's the exact problem eesel AI is built around. eesel works like a new support teammate that plugs into your existing helpdesk and knowledge base in minutes, then lets you simulate it on thousands of your past tickets before it ever talks to a customer, so you see its real resolution rate and exact answers up front instead of guessing from a leaderboard.

eesel AI's reports dashboard, where teams forecast resolution rates and review how the AI would have handled past tickets before going live

You stay in control of what it's allowed to answer and when it escalates, and it's free to try on your own data. If you're evaluating AI for customer service, that simulate-first approach is the closest thing to bringing AA-Briefcase's "prove it on real work" rigour to your own queue.

Frequently Asked Questions

What is the AA-Briefcase benchmark?

AA-Briefcase is a benchmark from Artificial Analysis that tests AI models on realistic, multi-week knowledge-work projects rather than one-off questions. Each project hands the model thousands of messy source files and asks for real deliverables like financial models and board decks, then grades whether the work is actually correct. It's one of the closest public proxies for how an AI agent performs on genuine office work.

Which AI model is best at AA-Briefcase?

At launch on 18 June 2026, Claude Fable 5 tops the AA-Briefcase Elo with roughly 1587, ahead of Claude Opus 4.8 and the open-weight leader GLM-5.2. The full ranking is in the interactive leaderboard near the top of this post, and you can re-check the live numbers on the Artificial Analysis evaluation page.

How is AA-Briefcase scored?

Each task is graded on three dimensions: a binary rubric for verifiable correctness, a pairwise Elo for analytical quality, and a pairwise Elo for presentation. Those combine into a single AA-Briefcase Elo, with a three-model judge panel deciding each comparison to limit same-family bias.

Why do AI models score so low on AA-Briefcase?

The work is genuinely hard: the top model passes every rubric check on only 3% of tasks, and on 31 of 91 tasks no model clears 50%. Difficulty also rises with the number of files a task needs, which is exactly the kind of fragmented context that trips up AI in production.

Does a high AA-Briefcase score mean the model is safe to deploy?

No. A leaderboard rank tells you a model is capable in general, not that it'll be reliable on your data and workflows. The safer path is to test against your own historical work first, the way eesel lets support teams simulate an AI agent on past tickets before it ever replies to a customer.

How is AA-Briefcase different from other AI benchmarks?

Most benchmarks score short, self-contained questions. AA-Briefcase scores long-horizon projects with linked tasks and contradictory source files, and it separates outputs that look polished from outputs that are actually right. That makes it more relevant to anyone weighing AI versus human work on real business tasks.

Can I use AA-Briefcase to pick an AI tool for customer support?

It's a useful capability signal, but support tools are more than a raw model. What matters for AI customer service is how the system retrieves your knowledge, escalates, and avoids confident wrong answers. Pair the benchmark with a real trial on your own tickets, like the simulation in eesel AI, before committing.

Hire your AI teammate

Set up in minutes. No credit card required.

Try for free Book a demo

Share this article

Article by

Alicia Kirana Utomo

Kira is a writer at eesel AI with a Computer Science background and over a year of hands-on experience evaluating AI-powered customer service tools. She focuses on breaking down how helpdesk platforms and AI agents actually work so that support teams can make better buying decisions.