AI & Machine Learning

500 Bankers Tested Top AI Models. None Passed Client Review

Huma Shazia26 April 2026 at 3:13 pm5 min read

Key Takeaways

Zero AI outputs from any tested model were rated ready for client delivery
41% of AI outputs needed major rework, 27% were completely unusable
GPT-5.4 scored highest but still failed nearly half the evaluation criteria

The finance industry keeps asking: can AI replace junior bankers? A new benchmark from Handshake AI and McGill University offers an answer. It's no.

BankerToolBench tested nine top AI models, including GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro Preview, against 100 real investment banking tasks. The tasks weren't trivia questions or summarization exercises. They were actual deliverables: Excel financial models with working formulas, PowerPoint decks for client meetings, PDF reports, and Word memos.

The verdict from the roughly 500 current and former investment bankers who reviewed the outputs? Not a single result was ready to send to a client.

0%

of AI outputs across all nine models were rated client-ready by investment bankers

How the benchmark works

The research team recruited bankers from Goldman Sachs, JPMorgan, Evercore, Morgan Stanley, and Lazard. Of the 500 participants, 172 designed the tasks themselves. That work took more than 5,700 hours. Each of the 100 tasks required a human banker an average of five hours to complete. Some ran up to 21 hours.

These weren't simplified test cases. The AI agents had to dig through data rooms, pull information from market data platforms like FactSet and Capital IQ, and parse SEC filings. A single task could trigger up to 539 calls to the language model. According to the paper, 97% of those calls involved tool use or code execution.

Schematische Darstellung des BankerToolBench-Ablaufs. Oben erstellen Investmentbanker Beispiel-Deliverables und definieren Bewertungsraster, unten erhalten KI-Agenten dieselben Prompts und arbeiten in einer Umgebung mit vier Tools: SEC-Filings, Marktdaten, Firmenprofile und vorgegebene Dateien. Die von den Agenten erzeugten Dateien gehen an ein Verifier-Scoring, das die gewichtete Bestehensrate über alle Rasterpunkte berechnet. — BankerToolBench workflow: bankers create tasks, AI agents complete them, outputs are graded against 150 criteria

Each deliverable was graded against a rubric averaging 150 individual criteria. The criteria covered six areas: technical correctness, client readiness, compliance, auditability, and consistency across files. An AI verifier called Gandalf, built on Gemini 3 Flash Preview, handled the grading. It agreed with human reviewers 88.2% of the time. That's slightly better than the 84.6% agreement rate between two human reviewers.

The results: failing grades across the board

The bankers sorted AI outputs into four categories. Here's how the work landed:

0% ready to send to a client as-is
13% could pass with light edits
41% needed major rework
27% were completely unusable

Drei Balkendiagramme zur Bewertung von KI-Ergebnissen durch Investmentbanker. Null Prozent gelten als direkt versendbar, 13 Prozent brauchen leichte, 19 Prozent moderate, 41 Prozent umfassende Überarbeitungen, 27 Prozent sind unbrauchbar. 69 Prozent würden immerhin auf der KI-Ausgabe aufbauen. 55 Prozent schätzen das Risiko eines schlechten Ausgangs bei unveränderter Abgabe auf über 99 Prozent. — Banker evaluation of AI outputs: 41% need major rework, 27% are unusable, only 13% pass with light edits

GPT-5.4 performed best among the tested models but still failed nearly half the evaluation criteria. Only 16% of its outputs cleared the bar where bankers would accept them as a useful starting point. Require three consistent runs, and that drops to 13%.

Model-by-model comparison

The team tested nine models in total: GPT-5.2, GPT-5.4, Claude Opus 4.5, Claude Opus 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview, Grok 4, Qwen-3.5-397B, and GLM-5.

Balkendiagramm mit Pass-Raten von neun KI-Modellen auf BankerToolBench. GPT-5.4 führt mit 16 Prozent Pass@1 und 23 Prozent Pass@3, gefolgt von Gemini 3.1 Pro mit 10 Prozent und Claude Opus 4.6 mit 9 Prozent. Gemini 2.5 Pro liegt bei null Prozent. — Pass rates across nine AI models on BankerToolBench. GPT-5.4 leads but none approach production quality

The rubric scores tell a similar story. GPT-5.4 scored 58.1 out of 100. GPT-5.2 followed behind. The open-source models Qwen-3.5-397B and GLM-5 trailed the proprietary options.

Links Balkendiagramm mit den erreichten Rubric-Scores: GPT-5.4 (58,1), GPT-5.2 (56,1), Gemini 3.1 Pro (53,6), Claude Opus 4.6 (53,2), Claude Opus 4.5 (52,3), GLM 5 (46,8), Qwen 3.5 397B (42,6), Grok 4 (31,4), Gemini 2.5 Pro (29,4). Rechts eine Matrix mit paarweisen Siegesraten zwischen den Modellen. — Rubric scores: GPT-5.4 leads with 58.1, but no model breaks 60 points

Across the six evaluation categories, no model showed consistent strength. Technical correctness and auditability proved especially difficult. Client readiness scores were low across the board.

The silver lining: starting points, not replacements

Despite the poor scores, more than half the bankers said they would use AI outputs as a starting point. The models can generate rough drafts, pull together data, and handle some formatting. The problem is reliability. Bankers can't trust the output without checking every cell, every formula, every number.

For tasks that take five to 21 hours, that's still meaningful. An AI draft that needs two hours of cleanup beats starting from scratch. But it's far from the autonomous AI agents that some vendors promise.

ℹ️

Logicity's Take

What this means for AI deployment in finance

BankerToolBench is open-source, so other researchers can replicate and extend the work. Handshake AI, the business arm of the career platform Handshake, built the benchmark to help AI labs understand real-world performance gaps.

The benchmark matters because it tests what actually matters in finance: deliverables, not chat responses. A model that can discuss DCF analysis in conversation might still produce an Excel model with broken formulas. BankerToolBench catches that gap.

For banks evaluating AI tools, the takeaway is clear. Current models can accelerate work but require substantial human oversight. The ROI case depends on how much time review takes versus starting fresh.

Frequently Asked Questions

Which AI model performed best on investment banking tasks?

GPT-5.4 scored highest with 58.1 out of 100 on the rubric, but still failed nearly half the evaluation criteria. No model produced client-ready output.

What percentage of AI outputs were usable for investment banking?

Only 13% of outputs could pass with light edits. 41% needed major rework, 27% were completely unusable, and 0% were ready to send to clients as-is.

What tasks did BankerToolBench test AI models on?

The benchmark tested real investment banking deliverables: Excel financial models with working formulas, PowerPoint decks, PDF reports, and Word memos. Tasks involved parsing SEC filings and pulling data from platforms like FactSet and Capital IQ.

Who created the BankerToolBench AI benchmark?

Handshake AI and McGill University created BankerToolBench. About 500 current and former investment bankers from firms including Goldman Sachs, JPMorgan, and Morgan Stanley participated in designing tasks and reviewing outputs.

Can AI replace junior investment bankers?

Not yet. While 53% of bankers said AI outputs are useful as a starting point, none of the outputs were client-ready. AI can accelerate drafting but requires substantial human review.

ℹ️

Need Help Implementing This?

Source: The Decoder / Jonathan Kemper

Also Read

Trending Tech·6 min

Indian IT's AI Reset: Top 5 Firms Post Mixed FY26 Results

TCS, Infosys, HCLTech, Wipro, and Tech Mahindra closed FY26 at an inflection point. AI is compressing legacy service revenue by 2-3% annually while opening a $300-400 billion opportunity by 2030. The sector's identity is shifting from effort-based delivery to outcome-driven contracts.

Manaal Khan·26 Apr 2026

Hacks & Workarounds·7 min

ESP32-C3 vs S3 vs C6: Which Board Should You Buy?

The ESP32 family has grown from a single chip to a confusing lineup of variants. Choosing between the C3, S3, C6, and upcoming P4 comes down to three factors: power consumption, connectivity needs, and processing requirements. Here's how to pick the right one for your project.

Manaal Khan·26 Apr 2026

Hacks & Workarounds·4 min

Google AI Pro Includes Claude Access for $20/Month

If you're paying for Google One storage and considering Claude Pro separately, you might be doubling up. Google AI Pro bundles Claude access through Antigravity, Google's AI-focused IDE, alongside Gemini and 5TB of cloud storage.

Manaal Khan·26 Apr 2026

500 Bankers Tested Top AI Models. None Passed Client Review

Key Takeaways

How the benchmark works

The results: failing grades across the board

Model-by-model comparison

The silver lining: starting points, not replacements

Logicity's Take

What this means for AI deployment in finance

Frequently Asked Questions

Need Help Implementing This?

Related Articles

Zuckerberg's Superintelligence Lab Faces Setback

Muse Spark Launch Propels Meta AI App to Top 5

Meta's Muse Spark AI Model Lags Behind ChatGPT and Claude

Meta Launches Muse Spark AI To Challenge ChatGPT

Also Read

Indian IT's AI Reset: Top 5 Firms Post Mixed FY26 Results

ESP32-C3 vs S3 vs C6: Which Board Should You Buy?

Google AI Pro Includes Claude Access for $20/Month