BankerToolBench 2026: When AI Agents Fail the Investment Banking Test
Handshake AI and McGill University demonstrate with the BankerToolBench benchmark that no current AI model reliably handles junior banker workloads. Meanwhile, GPT-5.5 posts an 86 percent hallucination rate despite leading the rankings. For enterprises considering AI in knowledge-intensive domains, the study provides a critical quality framework, especially ahead of the EU AI Act high-risk deadline on August 2, 2026.
The BankerToolBench benchmark, developed by Handshake AI and McGill University with 502 experienced investment bankers, shows that none of the nine tested AI models produce outputs ready for client delivery without revision. The best model, GPT-5.4, achieves just 16 percent acceptable results; 27 percent of all outputs are completely unusable. In parallel, the AA-Omniscience benchmark reveals that GPT-5.5 leads performance rankings but carries an 86 percent hallucination rate, compared to 36 percent for Claude Opus 4.7. Enterprises deploying AI in regulated or knowledge-intensive domains need domain-specific quality benchmarks and must implement Human-in-the-Loop as a process standard, particularly before the EU AI Act high-risk deadline on August 2, 2026.
What BankerToolBench Found
No current AI model handles everyday investment banking tasks reliably enough for client contact. That is the central finding of the BankerToolBench benchmark, published in April 2026 on arXiv (paper 2604.11304) by Handshake AI and McGill University. The study involved 502 active and former bankers from Goldman Sachs, JPMorgan, Morgan Stanley, and Evercore, who created 100 tasks, produced reference outputs, and evaluated AI results against an average of 150 criteria per task.
The tasks reflect standard junior banker work: Excel financial models with working formulas, PowerPoint pitch decks, structured reports based on SEC filings and market data. Each task takes a human an average of 5 hours, with some reaching 21 hours. Not a single model passed unscathed.
41 percent of all AI outputs required comprehensive reworking. Only 13 percent needed minor corrections. Not a single result could be forwarded to clients without revision.
Where AI Models Fail: Concrete Failure Patterns
The failure analysis shows that AI models do not simply perform slowly or incompletely, they fail in ways that are hard to detect at first glance. That is the core business risk. Claude Opus 4.6 produced visually polished spreadsheets where key variables were hardcoded as fixed values rather than calculated formulas, making scenario analysis structurally impossible.
Debt capital market models, merger calculations, and capital structure tables proved especially failure-prone. Gemini 2.5 Pro achieved a pass rate of zero percent, producing no usable output. The AI verifier "Gandalf" agreed with human reviewers in 88.2 percent of cases, confirming the validity of the evaluation rubric.
The Hallucination Paradox: GPT-5.5 Leads but Hallucinates Most
The quality problem extends beyond specific task types. It also appears at benchmark level in a counterintuitive pattern: the current front-runner among AI models simultaneously carries one of the highest hallucination rates. The AA-Omniscience benchmark from Artificial Analysis measures factual knowledge across more than 40 subject areas and penalizes wrong answers more heavily than admitting "I don't know."
GPT-5.5 achieves the highest accuracy at 57 percent, but also the highest hallucination rate at 86 percent. It frequently answers even when it does not know the correct answer.
For high-risk applications in finance, law, or medicine, a confabulation pattern is more dangerous than lower overall accuracy combined with honest uncertainty signaling. A model that gets 57 percent of questions right but confidently fabricates answers in 86 percent of cases where it lacks knowledge creates a hard-to-detect error risk.
"Higher benchmark ranking does not mean lower hallucination risk. GPT-5.5 illustrates this contradiction in stark terms."
Artificial Analysis, AA-Omniscience Benchmark, April 2026EU Perspective: The AI Act Sets the Clock
The BankerToolBench findings arrive at exactly the right moment to alert companies to an upcoming regulatory obligation. On August 2, 2026, the full high-risk requirements of the EU AI Act take effect for Annex III systems. AI in financial services, credit scoring, and insurance falls explicitly within this category.
BankerToolBench published
502 bankers confirm: no AI model delivers client-ready outputs in investment banking.
EU AI Act High-Risk Deadline
Full obligations for Annex III systems: quality management, technical documentation, risk management, logging, CE marking.
Possible Extension
The EU Commission proposed an extension in November 2025. The proposal still requires approval by Parliament and Council.
| EU AI Act Requirement | Mandatory for High-Risk (Annex III) | Relevance for AI in Finance |
|---|---|---|
| Quality management system | Yes, from August 2, 2026 | Documented development and operational processes |
| Technical documentation | Yes | Architecture, performance metrics, test protocols |
| Risk management | Yes | Ongoing identification and mitigation of quality risks |
| Accuracy and robustness | Yes | Measurable quality metrics, e.g. hallucination rate |
| Logging and records | Yes | Automated protocol generation for all system outputs |
| Human oversight | Yes | Expert review before forwarding results to clients |
Companies experimenting with AI in financial domains today should treat their quality measurements as part of future compliance documentation. Penalties for non-compliance reach up to 15 million euros or 3 percent of global annual turnover.
Challenges and Risks
BankerToolBench surfaces not just a technical but an organizational problem: enterprises that rely on marketing claims or generic benchmark rankings get no insight into the real deployment risk in specific knowledge domains. The consequences fall into four categories.
Hidden errors
Hardcoded values in Excel look correct. Only checking the formula layer or running scenario variations reveals the problem.
Domain blindness
A model can produce flawless general text and still fail systematically on financial formulas or clinical data.
Benchmark divergence
High performance in standard benchmarks does not correlate with reliability in specialized tasks. GPT-5.5 illustrates this with 86 percent hallucination despite top ranking.
Regulatory risk
Deploying AI in regulated domains without documented quality evidence creates a compliance liability that carries fines from August 2026.
At the same time, BankerToolBench demonstrates what a meaningful quality evaluation approach looks like: domain experts formulate realistic tasks, professionals serve as evaluators, outputs are assessed against 150 criteria, and consistency is measured across multiple runs. This is transferable to any knowledge-intensive industry from medicine to mechanical engineering.
What Companies Should Do Now
Before deploying AI in knowledge-intensive domains productively, a domain-specific quality evaluation is required. Generic benchmarks are not sufficient for this purpose.
-
Build domain-specific benchmarks
Have subject matter experts from the affected department formulate realistic tasks and evaluate AI outputs, following the BankerToolBench approach. Plan at least 20 to 30 tasks with measurable quality criteria.
-
Test for hallucination resistance
Use benchmarks like AA-Omniscience as a pre-filter. Models with low hallucination rates, such as Claude Opus 4.7 at 36 percent or Grok 4.20 at 17 percent, are better suited for high-risk domains than benchmark leaders with high confabulation rates.
-
Make Human-in-the-Loop mandatory
All AI outputs in high-risk domains require expert review before forwarding, not as a temporary measure but as a process standard. Document the review process for EU AI Act compliance.
-
Start EU AI Act compliance preparation now
Begin documenting quality measurements, describe risk management processes, and plan logging architectures, even if the extension to 2027 materializes. Starting early means less effort and more documented evidence when the deadline arrives.
Further Reading
Frequently Asked Questions
BankerToolBench is an open-source benchmark from Handshake AI and McGill University that tests AI agents on 100 realistic investment banking tasks. 502 active and former bankers from Goldman Sachs, JPMorgan, and Morgan Stanley created the tasks and evaluated AI outputs against an average of 150 criteria per task.
GPT-5.4 achieved the best result with 16 percent acceptable outputs. With three consistent runs required, the rate drops to 13 percent. No model produced results that could be forwarded to clients without revision. Claude Opus 4.6 reached 9 percent, Gemini 2.5 Pro zero percent.
An 86 percent hallucination rate means that GPT-5.5, in 86 out of 100 cases where it does not know the answer, still provides a confident but incorrect response instead of admitting uncertainty. The AA-Omniscience benchmark penalizes this pattern. By comparison, Claude Opus 4.7 hallucinates at 36 percent and Grok 4.20 at 17 percent.
On August 2, 2026, the full high-risk requirements of the EU AI Act take effect for Annex III systems, which explicitly includes AI in financial services. Penalties for non-compliance reach up to 15 million euros or 3 percent of global annual turnover. A possible extension to December 2027 was proposed by the EU Commission but has not yet been approved.
The BankerToolBench approach is transferable: domain experts formulate 20 to 30 realistic tasks and evaluate AI outputs against measurable criteria. Benchmarks like AA-Omniscience serve as a pre-filter to identify models with low hallucination rates. All tests should be documented to serve as the foundation for EU AI Act compliance evidence.
Directly affected are investment banking, corporate finance, and capital markets. The findings apply equally to all domains where AI produces complex professional documents: law, medicine, engineering, accounting. Wherever errors are discovered late and have direct consequences for clients, the same quality standards apply.