Business professional exits a Frankfurt banking tower mid-step with printed documents, flat overcast daylight, documentary scene

BankerToolBench 2026: When AI Agents Fail the Investment Banking Test

502 investment bankers test 9 models, not one delivers client-ready results

Handshake AI and McGill University demonstrate with the BankerToolBench benchmark that no current AI model reliably handles junior banker workloads. Meanwhile, GPT-5.5 posts an 86 percent hallucination rate despite leading the rankings. For enterprises considering AI in knowledge-intensive domains, the study provides a critical quality framework, especially ahead of the EU AI Act high-risk deadline on August 2, 2026.

Summary

The BankerToolBench benchmark, developed by Handshake AI and McGill University with 502 experienced investment bankers, shows that none of the nine tested AI models produce outputs ready for client delivery without revision. The best model, GPT-5.4, achieves just 16 percent acceptable results; 27 percent of all outputs are completely unusable. In parallel, the AA-Omniscience benchmark reveals that GPT-5.5 leads performance rankings but carries an 86 percent hallucination rate, compared to 36 percent for Claude Opus 4.7. Enterprises deploying AI in regulated or knowledge-intensive domains need domain-specific quality benchmarks and must implement Human-in-the-Loop as a process standard, particularly before the EU AI Act high-risk deadline on August 2, 2026.

What BankerToolBench Found

No current AI model handles everyday investment banking tasks reliably enough for client contact. That is the central finding of the BankerToolBench benchmark, published in April 2026 on arXiv (paper 2604.11304) by Handshake AI and McGill University. The study involved 502 active and former bankers from Goldman Sachs, JPMorgan, Morgan Stanley, and Evercore, who created 100 tasks, produced reference outputs, and evaluated AI results against an average of 150 criteria per task.

0%
Client-ready outputs across all 9 models
16%
Acceptable outputs from the best model (GPT-5.4)
27%
Completely unusable outputs
502
Investment bankers from tier-1 institutions

The tasks reflect standard junior banker work: Excel financial models with working formulas, PowerPoint pitch decks, structured reports based on SEC filings and market data. Each task takes a human an average of 5 hours, with some reaching 21 hours. Not a single model passed unscathed.

Core finding

41 percent of all AI outputs required comprehensive reworking. Only 13 percent needed minor corrections. Not a single result could be forwarded to clients without revision.

Where AI Models Fail: Concrete Failure Patterns

The failure analysis shows that AI models do not simply perform slowly or incompletely, they fail in ways that are hard to detect at first glance. That is the core business risk. Claude Opus 4.6 produced visually polished spreadsheets where key variables were hardcoded as fixed values rather than calculated formulas, making scenario analysis structurally impossible.

GPT-5.4 Failure Distribution
41% code and formula errors
27% flawed business logic
18% failed data queries
13% fabricated data presented as sourced
Claude Opus 4.6 Failure Patterns
Key cell values hardcoded, not calculated
Scenario analysis structurally impossible
Visually correct outputs hiding critical flaws
9% acceptable outputs

Debt capital market models, merger calculations, and capital structure tables proved especially failure-prone. Gemini 2.5 Pro achieved a pass rate of zero percent, producing no usable output. The AI verifier "Gandalf" agreed with human reviewers in 88.2 percent of cases, confirming the validity of the evaluation rubric.

Practical risk: Hardcoded values in Excel models look like correct results. Only checking the formula layer or running scenario variations exposes the flaw. In a regulated environment or client presentation, this is a direct liability risk.
Quality data

The Hallucination Paradox: GPT-5.5 Leads but Hallucinates Most

The quality problem extends beyond specific task types. It also appears at benchmark level in a counterintuitive pattern: the current front-runner among AI models simultaneously carries one of the highest hallucination rates. The AA-Omniscience benchmark from Artificial Analysis measures factual knowledge across more than 40 subject areas and penalizes wrong answers more heavily than admitting "I don't know."

Printed benchmark comparison chart on a conference table with red pen circling a percentage column, in a German office meeting room
AA-Omniscience measures how often a model gives wrong answers with high confidence rather than admitting uncertainty.

GPT-5.5 achieves the highest accuracy at 57 percent, but also the highest hallucination rate at 86 percent. It frequently answers even when it does not know the correct answer.

Artificial Analysis, AA-Omniscience Benchmark ,
GPT-5.5 hallucination rate 86%
Gemini 3.1 Pro hallucination rate 50%
Claude Opus 4.7 hallucination rate 36%
Grok 4.20 hallucination rate (lowest) 17%

For high-risk applications in finance, law, or medicine, a confabulation pattern is more dangerous than lower overall accuracy combined with honest uncertainty signaling. A model that gets 57 percent of questions right but confidently fabricates answers in 86 percent of cases where it lacks knowledge creates a hard-to-detect error risk.

"Higher benchmark ranking does not mean lower hallucination risk. GPT-5.5 illustrates this contradiction in stark terms."

Artificial Analysis, AA-Omniscience Benchmark, April 2026

EU Perspective: The AI Act Sets the Clock

The BankerToolBench findings arrive at exactly the right moment to alert companies to an upcoming regulatory obligation. On August 2, 2026, the full high-risk requirements of the EU AI Act take effect for Annex III systems. AI in financial services, credit scoring, and insurance falls explicitly within this category.

April 2026

BankerToolBench published

502 bankers confirm: no AI model delivers client-ready outputs in investment banking.

August 2, 2026

EU AI Act High-Risk Deadline

Full obligations for Annex III systems: quality management, technical documentation, risk management, logging, CE marking.

December 2027 (proposed)

Possible Extension

The EU Commission proposed an extension in November 2025. The proposal still requires approval by Parliament and Council.

EU AI Act Requirement Mandatory for High-Risk (Annex III) Relevance for AI in Finance
Quality management system Yes, from August 2, 2026 Documented development and operational processes
Technical documentation Yes Architecture, performance metrics, test protocols
Risk management Yes Ongoing identification and mitigation of quality risks
Accuracy and robustness Yes Measurable quality metrics, e.g. hallucination rate
Logging and records Yes Automated protocol generation for all system outputs
Human oversight Yes Expert review before forwarding results to clients

Companies experimenting with AI in financial domains today should treat their quality measurements as part of future compliance documentation. Penalties for non-compliance reach up to 15 million euros or 3 percent of global annual turnover.

Challenges and Risks

BankerToolBench surfaces not just a technical but an organizational problem: enterprises that rely on marketing claims or generic benchmark rankings get no insight into the real deployment risk in specific knowledge domains. The consequences fall into four categories.

Hidden errors

Hardcoded values in Excel look correct. Only checking the formula layer or running scenario variations reveals the problem.

Domain blindness

A model can produce flawless general text and still fail systematically on financial formulas or clinical data.

Benchmark divergence

High performance in standard benchmarks does not correlate with reliability in specialized tasks. GPT-5.5 illustrates this with 86 percent hallucination despite top ranking.

Regulatory risk

Deploying AI in regulated domains without documented quality evidence creates a compliance liability that carries fines from August 2026.

At the same time, BankerToolBench demonstrates what a meaningful quality evaluation approach looks like: domain experts formulate realistic tasks, professionals serve as evaluators, outputs are assessed against 150 criteria, and consistency is measured across multiple runs. This is transferable to any knowledge-intensive industry from medicine to mechanical engineering.

Action plan

What Companies Should Do Now

Before deploying AI in knowledge-intensive domains productively, a domain-specific quality evaluation is required. Generic benchmarks are not sufficient for this purpose.

Whiteboard in a German Mittelstand workshop room with a hand-drawn AI evaluation grid for model, hallucination rate, and industry
Enterprises should develop their own evaluation matrices before deploying AI in regulated domains.
  1. Build domain-specific benchmarks

    Have subject matter experts from the affected department formulate realistic tasks and evaluate AI outputs, following the BankerToolBench approach. Plan at least 20 to 30 tasks with measurable quality criteria.

  2. Test for hallucination resistance

    Use benchmarks like AA-Omniscience as a pre-filter. Models with low hallucination rates, such as Claude Opus 4.7 at 36 percent or Grok 4.20 at 17 percent, are better suited for high-risk domains than benchmark leaders with high confabulation rates.

  3. Make Human-in-the-Loop mandatory

    All AI outputs in high-risk domains require expert review before forwarding, not as a temporary measure but as a process standard. Document the review process for EU AI Act compliance.

  4. Start EU AI Act compliance preparation now

    Begin documenting quality measurements, describe risk management processes, and plan logging architectures, even if the extension to 2027 materializes. Starting early means less effort and more documented evidence when the deadline arrives.

0%
Client-ready without review
86%
GPT-5.5 hallucination rate
Aug 2
EU AI Act deadline 2026
15M
Max fine in euros

Further Reading

Frequently Asked Questions

What is BankerToolBench? +

BankerToolBench is an open-source benchmark from Handshake AI and McGill University that tests AI agents on 100 realistic investment banking tasks. 502 active and former bankers from Goldman Sachs, JPMorgan, and Morgan Stanley created the tasks and evaluated AI outputs against an average of 150 criteria per task.

Which AI model performs best in BankerToolBench? +

GPT-5.4 achieved the best result with 16 percent acceptable outputs. With three consistent runs required, the rate drops to 13 percent. No model produced results that could be forwarded to clients without revision. Claude Opus 4.6 reached 9 percent, Gemini 2.5 Pro zero percent.

What does an 86 percent hallucination rate mean for GPT-5.5? +

An 86 percent hallucination rate means that GPT-5.5, in 86 out of 100 cases where it does not know the answer, still provides a confident but incorrect response instead of admitting uncertainty. The AA-Omniscience benchmark penalizes this pattern. By comparison, Claude Opus 4.7 hallucinates at 36 percent and Grok 4.20 at 17 percent.

When does the EU AI Act high-risk deadline take effect? +

On August 2, 2026, the full high-risk requirements of the EU AI Act take effect for Annex III systems, which explicitly includes AI in financial services. Penalties for non-compliance reach up to 15 million euros or 3 percent of global annual turnover. A possible extension to December 2027 was proposed by the EU Commission but has not yet been approved.

How can companies measure AI quality in their own operations? +

The BankerToolBench approach is transferable: domain experts formulate 20 to 30 realistic tasks and evaluate AI outputs against measurable criteria. Benchmarks like AA-Omniscience serve as a pre-filter to identify models with low hallucination rates. All tests should be documented to serve as the foundation for EU AI Act compliance evidence.

Which industries are most affected by the BankerToolBench findings? +

Directly affected are investment banking, corporate finance, and capital markets. The findings apply equally to all domains where AI produces complex professional documents: law, medicine, engineering, accounting. Wherever errors are discovered late and have direct consequences for clients, the same quality standards apply.