Anthropic's AI Was 21% Accurate. More Data Didn't Help

Anthropic connected their data to Claude and asked it real questions. The answers looked great. The kind you’d forward to your team.

Then they checked. Claude was right 21% of the time.

So they fed it thousands of past queries and confirmed it read them. Accuracy moved less than a point. The answer was in the pile 80% of the time. The model grabbed the wrong thing anyway.

Piped Salesforce, Gong, and your tickets into an AI and asked what to build next? This is your story. You just haven’t checked the answers yet.

The number that should stop you

A few weeks ago Anthropic published a post about how their data team runs self-service analytics with Claude. It reads like an internal engineering note. Buried inside it is the clearest argument I’ve seen for why most AI product discovery setups quietly fail, and the people who wrote it had no reason to make that argument. That’s what makes it worth your time.

Without a structured layer in front of the data, Claude answered analytics questions correctly 21% of the time. One in five.

Then they ran the experiment everyone reaches for first. They gave the agent direct access to thousands of past queries, every dashboard and notebook and analysis the company had ever written. They checked the transcripts to confirm Claude read the material before answering. Accuracy moved less than a single point.

So they dug into the misses. For about 80% of the questions it got wrong, the correct answer was sitting right there in the corpus the whole time. The agent saw it and used the wrong thing anyway.

Their conclusion, in their words: the bottleneck wasn’t access to the information. It was structure.

Why this should bother every product team

Read that setup again and swap the nouns. You connect Salesforce, Gong, Zendesk tickets, maybe a Slack channel and a Snowflake table, and you pipe the whole pile into Claude. You ask it what to build next. You get answers. Some of them sound good.

Anthropic just told you what the model does when you do that. It reads the pile. The pile holds the answer. It still picks the wrong thing, because reading a million signals is not the same as knowing which one answers your question.

They named three failure modes, and every one of them shows up in product feedback harder than it shows up in analytics.

Ambiguity. A request for “better reporting” lives in forty tickets, three sales calls, and a churn survey, each phrased differently. Which ones are the same need? Generic AI guesses.

Staleness. The roadmap moved, the feature shipped, the segment got renamed, and the feedback describing the old world still sits in the pile looking authoritative.

Retrieval failure. The decisive piece of evidence exists. It’s one row in a thousand. The agent walks right past it.

This is the part of the analytics post that travels. Anthropic solved it for metrics and dashboards. Product discovery has the same shape of problem, and nobody at Anthropic was thinking about discovery when they wrote it. The analogy is mine. I think it holds.

What actually moved the number

Once they accepted that structure was the problem, the fixes followed.

The headline one: they built skills, procedural layers that tell the agent which source to trust, in what order, and how to handle the gotchas a senior analyst knows by heart. Skills took accuracy from that 21% to consistently above 95%, and near 99% in some domains.

Underneath the skills sat a governed layer. One canonical definition of each metric, owned by a human, so “active users” resolved to a single answer instead of forty plausible ones. They were blunt about one thing here: they tried having an LLM auto-generate those definitions, and it backfired. It produced confident-looking definitions that baked in the exact ambiguity they were trying to kill. The working split was Claude writes the documentation, a human owns the definition.

And the layer reached the agent over MCP. That’s not a detail I’m adding for flavor. Anthropic served their skills as resources over the Model Context Protocol so the same governed answer showed up in Slack, in the IDE, and in standalone agent sessions. The structure only counts if it reaches the agent at the moment of the question. MCP is how it gets there.

Anthropic built all of this themselves

Everything above is real and it works. It’s also the product of a data engineering team at one of the best-resourced AI companies on earth, building bespoke infrastructure for their own internal use, maintaining it daily. The one month they stopped maintaining it, their accuracy drifted from 95% to 65%.

That’s the part the success story skips. You don’t buy a structured layer once. You build it, govern it, and feed it every day, or it rots. Anthropic can afford a team whose job is to keep the laundry sorted. Your product org has a PM, a pile of feedback, and a Claude seat.

You are not going to stand up a governed metric layer, a skills repo with CI hooks, and an adversarial review sub-agent between now and your next planning cycle. You shouldn’t have to. The lesson from Anthropic is sound. The build is the problem.

What the structured layer looks like for product decisions

This is the gap we built Bagel to close. Bagel is an AI-Native Product Velocity Platform, and the short version is that it is the structured layer Anthropic describes, built for product feedback instead of warehouse metrics, and built so you don’t assemble it yourself.

Map it back to the three failure modes.

Against ambiguity, Bagel resolves every incoming signal to a single feature or opportunity before you ask. The “better reporting” request scattered across forty tickets, three Gong calls, and a churn survey collapses into one mapped need with the evidence attached. Anthropic did this for metrics by hand, one canonical definition at a time. Bagel does it for raw customer voice across Gong, Salesforce, Zendesk, Slack, Jira, your product analytics, and 100+ more sources, and it does it as the signal lands.

Against staleness, Discovery OS runs validation against live signal as it lands, instead of in scheduled research sprints. Anthropic learned that a structured layer rots without daily maintenance. Discovery OS does that maintenance for you. When a feature ships, a segment gets renamed, or last quarter’s bet lands or flops, the evidence updates on its own. Staying current is the product here, not a chore you have to remember.

Against retrieval failure, every answer arrives with the case already built: the decision, the revenue context, and the customer quotes behind it. You get the conclusion, not a pointer to one row in a thousand. Anthropic’s fix for “the agent saw the answer and used the wrong thing” was to narrow the search space to a few curated, governed files before a query ran. Bagel narrows a million raw signals to one sourced, quantified decision before you ask the question.

That output has a name in our pipeline: a Decision Artifact. Signal comes in raw and messy. It moves through MCP orchestration and the intelligence layer, and it comes out as a scoped, sourced, dev-ready decision. For deeper work, Discovery OS drafts the artifact itself, the PRD or research brief or strategy doc, built from your signal so you finish the document instead of starting it.

Where this leaves the agent

Anthropic’s whole stack assumed one thing: the structured layer has to reach the agent at the moment of the question, or none of it matters. They served their skills over MCP for exactly that reason.

Bagel works the same way. Every Decision Artifact is available through the Bagel MCP, so Claude, Cursor, Codex, and any agent in your stack build against the same governed evidence. The PM asks what to build and gets a sourced decision. The coding agent picks it up and builds against the same decision. No second pile to re-interpret, no drift between the answer the PM saw and the thing that ships.

The model was never the problem. Anthropic’s own data proves it: more access moved accuracy less than a point, while a governed layer moved it from 21% to 95%. The raw pile is what breaks you, and a structured layer between the question and the pile is what fixes it.

Anthropic built theirs in-house, for analytics, with a team paid to maintain it. You can see how we built ours, for product decisions, so you don’t have to.

The Anthropic post is worth reading in full: How Anthropic enables self-service data analytics with Claude.

21% accurate, even with every tool connected: what Anthropic learned about AI and messy data

The number that should stop you

Why this should bother every product team

What actually moved the number

Anthropic built all of this themselves

What the structured layer looks like for product decisions

Where this leaves the agent

Related articles

The PM and developer used to pass work across a wall. AI knocked the wall down. Now what?

How MCP is Changing The Way Product Teams Work With AI

What is an Autonomous Product Decision Layer? A Guide for AI-Native Teams

The PM and developer used to pass work across a wall. AI knocked the wall down. Now what?

How MCP is Changing The Way Product Teams Work With AI

What is an Autonomous Product Decision Layer? A Guide for AI-Native Teams