Qwen3.7-Max: 35 Hours of Autonomous Coding and Alibaba's Vertical AI Strategy
At the Alibaba Cloud Summit in Hangzhou on May 20, 2026, Alibaba presented a result that sets new benchmarks for autonomous AI coding agents: Qwen3.7-Max made 1,158 tool calls, optimized a chip kernel by a factor of 10, all without a single human intervention over 35 hours.
Alibaba's Qwen3.7-Max demonstrated for the first time publicly an AI coding agent running autonomously for 35 hours, making 1,158 tool calls and accelerating software for a self-developed AI chip by a factor of 10. The model scores 80.4 on SWE-Verified, statistically tied with Claude Opus 4.6 Max (80.8), but is exclusively API-based with no independent verification of the 35-hour run. For European enterprises, using Chinese AI APIs requires a GDPR Data Protection Impact Assessment and legal analysis under China's 2017 National Intelligence Law before any production deployment.
An AI Agent Running 35 Hours Without Intervention
At its Cloud Summit on May 20-21, 2026, in Hangzhou, Alibaba presented a result not previously demonstrated publicly for coding agents: Qwen3.7-Max ran autonomously for 35 hours , made 1,158 tool calls, and optimized the operating software of a self-developed chip by a factor of 10. This is not a controlled laboratory scenario but a practical demonstration of so-called long-horizon tasks.
Alibaba launched three products simultaneously: Qwen3.7-Max as the language model, the Zhenwu M890 chip, and the Panjiu AL128 rack-scale system with 128 accelerators. The figures from the 35-hour run are Alibaba's own first-party claims; independent reproductions have not yet been published.
The 35-hour run is not a benchmark score but a demonstration of the capacity for long-horizon reasoning: an autonomous decision chain that runs for hours. This is structurally new, even if the specific figures still await external verification.
Vertical Integration: Chip, Model, and Server From One Source
Alibaba is pursuing an approach that Western providers have only partially achieved: complete control over the processor, language model, and server infrastructure. The Zhenwu M890 carries 144 GB of HBM3 memory , 50% more than its predecessor, and achieves 800 GB/s chip-to-chip bandwidth. Combined with Qwen3.7-Max, this enables hardware-software optimizations that are structurally impossible with Nvidia-based models.
The Panjiu AL128 rack system links 128 M890 units. Qwen3.7-Max natively supports the most widely used agent frameworks: OpenClaw, Hermes Agent, Claude Code , Qwen Paw, and Coder. Even after partial easing of US export restrictions in 2025, Alibaba continues this chip development track, not as a workaround but as a durable strategic advantage.
| Component | Product | Key specs |
|---|---|---|
| AI Chip | Zhenwu M890 | 144 GB HBM3, 800 GB/s inter-chip BW, 3x vs. predecessor |
| Rack System | Panjiu AL128 | 128 M890 accelerators linked |
| Language Model | Qwen3.7-Max | 1M token context, API-only |
Benchmarks: Strong, But No Clear Lead
On standardized benchmarks, Qwen3.7-Max delivers results on par with leading Western models without clearly outperforming them. More relevant for enterprise use is a separate weakness: factual reliability has declined compared to the predecessor , which can create blockers in agent workflows that require decisions.
On the LM Arena leaderboard, Qwen3.7-Max ranks 13th globally, 7th in math, 9th in coding. On SWE-Verified, the most widely cited benchmark for real-world software engineering tasks, performance is statistically identical to Claude Opus 4.6 Max. An attempt rate of only 48% on factual queries means the model declines to answer rather than hallucinate in more than half of uncertain factual cases. This improves safety but creates blockers in workflows that require a decision.
Qwen3.7-Max achieves an Intelligence Index score of 57, behind GPT-5.5 (60.2) and Claude Opus 4.7 (57.3).
European Perspective
For European enterprises, Qwen3.7-Max is a technically interesting but legally complex model to evaluate. Two risk dimensions require attention that do not apply to European providers in the same way.
GDPR Assessment
Any API use by a Chinese provider requires a Data Protection Impact Assessment (DPIA) and a documented legal basis for third-country transfers under GDPR Art. 44 ff. before going live with personal data.
EU AI Act
High-risk applications in finance, healthcare, or critical infrastructure require verifiable human oversight. Agents running for 35 hours need defined checkpoints, termination conditions, and audit trails built into the architecture.
Entry-Level Alternative
Qwen3.6 is available with open weights and known pricing (from $1.30 per million input tokens). That is the safer current choice until Qwen3.7-Max API terms and prices are published.
The EU AI Act requires verifiable human oversight for high-risk applications. An AI agent running 35 hours without a checkpoint is only compatible with this requirement if termination conditions, escalation paths, and audit logs are embedded in the system architecture from the start. That is a design requirement, not a compliance checkbox.
Within the competitive landscape, Kimi K2.6 and now Qwen3.7-Max show that Chinese AI companies are systematically challenging Western models on agentic benchmarks. Simultaneously, DeepSeek is assembling a team to build a harness tool competing directly with Claude Code and OpenAI's Codex.
Challenges and Risks
The impressive figures from the Alibaba launch deserve both respect and careful scrutiny. Three limitations are critical for a realistic assessment.
No External Replication
The 35-hour run is documented exclusively by Alibaba. External researchers have not yet reproduced the figures. This is not a reason to dismiss the result, but it is not proof of generally available performance either.
No Open Weights, Unknown Pricing
Qwen3.7-Max is not available for local deployment. API pricing was not announced at launch. Agentic workflows with thousands of tool calls can become very expensive without accurate cost modeling in advance.
48% Attempt Rate on Factual Queries
The model prefers refusing to answer over hallucinating under uncertainty. This reduces error rates but creates blockers in decision-requiring workflows. For many enterprise processes, this is a critical point to test before deployment.
A model that declines to answer more than half of all factual queries is structurally unsuitable for high-decision-density workflows, even if the answers it does give are rarely wrong.
innobu analysis, May 2026What Companies Should Do Now
The Qwen3.7-Max launch is a prompt to clarify two fundamental questions about your agentic AI strategy: How long can an autonomous agent run in your environment? And what vendor legal review have you completed for non-EU services?
-
Define Maximum Agent Run Times
35 hours without human oversight is not acceptable in most European enterprise environments. Establish checkpoint intervals, termination conditions, and escalation paths before deploying agents in production. The EU AI Act makes this mandatory for high-risk applications.
-
Complete a DPIA and Third-Country Analysis
Before integrating any non-European provider's API, complete a structured Data Protection Impact Assessment and document a legal basis for data transfer. Chinese providers require particular attention due to the 2017 National Intelligence Law.
-
Use Qwen3.6 for Initial Testing
Open weights, known pricing, no API dependency risk. This is the practically deployable Qwen option today. Wait for Qwen3.7-Max until API pricing and terms are published before migrating.
-
Validate Benchmarks on Your Own Codebases
The SWE-Verified parity with Claude Opus 4.6 Max indicates equivalence, not superiority. Test on your actual tasks, not published benchmarks. Alibaba's demonstration figures come from a controlled Alibaba-specific scenario.
-
Build Model Portability Into Your Architecture
API-only models can change pricing, access terms, and availability at any time. Design model exchangeability into your agentic architecture from the start. This applies broadly but especially to non-European providers.
Further Reading
Frequently Asked Questions
Qwen3.7-Max is Alibaba's flagship language model for the agentic era, unveiled on May 20, 2026, at the Alibaba Cloud Summit in Hangzhou. It features a 1-million-token context window, is designed for autonomous long-horizon tasks, and in an internal test ran for 35 hours without human intervention, making 1,158 tool calls.
In Alibaba's internal test, Qwen3.7-Max ran autonomously for 35 hours, made 1,158 tool calls and 432 kernel evaluations, and optimized the Extend Attention kernel of the new Zhenwu M890 chip by a factor of 10 (geometric mean). These figures are Alibaba's first-party claims and have not yet been independently reproduced.
On SWE-Verified, the leading software engineering benchmark, Qwen3.7-Max scores 80.4, statistically tied with Claude Opus 4.6 Max (80.8). On the Artificial Analysis Intelligence Index it scores 57, behind GPT-5.5 (60.2) and Claude Opus 4.7 (57.3). On GPQA Diamond (92.4 vs. 91.3) and HLE (41.4 vs. 40.0) it scores marginally higher than Claude Opus 4.6.
Two main risks: First, China's National Intelligence Law (2017, Art. 7) obliges Chinese companies to support state intelligence agencies, which may affect international API customers. Second, GDPR requires a Data Protection Impact Assessment and a legal basis for third-country transfers (Art. 44 ff.) before any production deployment involving personal data.
No. Qwen3.7-Max is exclusively available via API; model weights are not publicly released. For local deployment, Qwen3.6 is the current option with open weights and known pricing (from $1.30 per million input tokens via OpenRouter).
The Zhenwu M890 is Alibaba's new AI accelerator, developed by its chip subsidiary T-Head. It carries 144 GB of HBM3 memory (50% more than the predecessor), achieves 800 GB/s inter-chip bandwidth, and Alibaba claims it delivers three times the performance of the previous Zhenwu 810E. Combined with Qwen3.7-Max, it enables hardware-software co-optimization without third-party processors.