Amazon Killed Its AI Leaderboard. How to Get Ahead by Measuring AI Value, Not AI Usage

The tokenmaxxing scandal is a symptom. The disease has a 40-year historical precedent — and the companies that understand it are already building the moat.

Jun 02, 2026

Weekly insights for Tax & Product Leaders. No slop, just interesting data points you can reflect on with your ☕ :)

Two weeks ago, I posted on LinkedIn: “Can someone please kill the internal AI leaderboards? Or at least put the right KPIs in place?”

Last week, Amazon did:

From the article:
Dave Treadwell, Senior VP, told staff to stop using AI “just for the sake of using AI” after employees “tokenmaxxed” their scores on Kiro Rank — running meaningless tasks through the system to inflate usage numbers. Amazon replaced it with “normalized deployments”: AI output that is actually shipped.

The coverage called it a leaderboard problem. To me, it really looks like the appearance of efficiency from AI use is more important than the return on AI spending, measured, for example, in ARR.

It is a measurement pattern — one that has repeated three times in 150 years.

The gaming is only half the problem

According to Goodhart’s Law, when a metric becomes a target, it ceases to be a good metric. In other words, once a specific metric is used to reward performance, people will manipulate the system to optimize that metric, inadvertently undermining its original purpose. Amazon set an 80%+ weekly AI usage target. Developers met it by running AI on tasks that didn’t need it (no wonder 🤷‍♂️).

Victorino Group has documented the same failure at three enterprises in 2026 — different tools, identical result.

But the deliberate gaming is only half the problem. METR’s May 2026 study found that workers with no leaderboard to game still overestimated AI’s effect on their own productivity by 40 percentage points on average.

So, the usage metric is wrong in two directions simultaneously — gamers inflate it deliberately, and everyone else inflates it unconsciously.

The moat always appears when measuring

Paul David’s 1990 research on electrification documented a 40-year productivity lag: factories installed electric motors in 1895 and saw no gains until the 1920s. The ones that captured the gains didn’t measure kilowatts installed. They reorganized their floors around electricity’s properties — unit drive over group drive — and measured output per reorganized unit.

Cloud/SaaS compressed the same pattern to 15 years: the companies that invented LTV/CAC and NRR in 2000–2015 built durable moats while the rest measured seat counts.

AI is in year three. As I explored in The Future Cost of TaxTech, the compliance cost stack has already changed. ‘Tis the time to establish the right metrics. Most enterprises haven’t noticed.

Those measuring outcomes reap enormous benefits from AI

PwC’s 2026 AI Performance Study (n=1,217 executives, 25 sectors) found that leaders generate 7.2× more revenue and efficiency gains than the average competitor. What separates them: twice as likely to redesign workflows around AI rather than layering tools on top, 2.8× more likely to automate decisions without human intervention. The single strongest predictor of AI financial performance is identifying growth opportunities — not cost reduction, not usage frequency. They are measuring outcomes, not activity.

Tax and Finance have two value categories that are structurally invisible

BCG’s first systematic AI ROI study in finance found that only 29% of finance executives are able to confidently measure AI ROI. The root cause is that standard measurement captures only labor (effort) savings. It misses other important metrics, such as audit risk reduction (exposure shifted before the filing) and compliance failures avoided (the penalty never incurred). Both have real expected value. Neither shows up on any dashboard.

The Tax teams that redesign workflows around AI and build frameworks for invisible value will report 2–3× higher ROI — and will know exactly where to invest next. That is what separates the 10% of AI Tax projects that actually deliver from the 90% that don’t.

The leaderboard was never the problem. The metric was.

What does your AI ROI calculation include?

REFERENCES & FURTHER READING

Financial Times — “Amazon scraps AI leaderboard to stop workers chasing usage scores” — ft.com — May 2026
The Decoder — “Amazon kills internal AI leaderboard after employees gamed it with pointless tasks” — https://the-decoder.com/amazon-kills-internal-ai-leaderboard-after-employees-gamed-it-with-pointless-tasks/ — Accessed June 2026
Victorino Group — “Three Companies, One Failure Mode: Goodhart’s Law Comes for AI Adoption” — https://victorinollc.com/thinking/goodhart-ai-adoption-tokenmaxxing-mainstream — Accessed June 2026
METR — “Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity” — https://metr.org/blog/2026-05-11-ai-usage-survey/ — May 2026
PwC — “Three-quarters of AI’s economic gains are being captured by just 20% of companies” — https://www.pwc.com/gx/en/news-room/press-releases/2026/pwc-2026-ai-performance-study.html — April 2026
PwC — “Want ROI from AI? Go for growth” (full study PDF) — https://www.pwc.com/gx/en/so-you-can/2026/content/roi-from-ai.pdf — 2026
BCG — “How Finance Leaders Can Get ROI from AI” — https://www.bcg.com/publications/2025/how-finance-leaders-can-get-roi-from-ai — 2025
Paul David — “The Dynamo and the Computer: An Historical Perspective on the Modern Productivity Paradox” — 1990. Accessible via AEI updated commentary: https://www.aei.org/articles/the-dynamo-the-computer-and-chatgpt-explaining-todays-productivity-paradox/
CNBC — “Almost every Fortune 500 is tracking overall AI usage: What that means for employees” — https://www.cnbc.com/2026/05/05/ai-use-work-employee-monitoring-tech-surveillance.html — May 2026

No BS, Just Tax Tech

Discussion about this post

Ready for more?