Developer on a train platform holding a tablet showing a running AI agent terminal session with green output lines

Qwen3.7-Max: 35 Hours of Autonomous Coding and Alibaba's Vertical AI Strategy

Alibaba's new flagship model ran for 35 hours without human intervention, writing software for a custom-built chip along the way

At the Alibaba Cloud Summit in Hangzhou on May 20, 2026, Alibaba presented a result that sets new benchmarks for autonomous AI coding agents: Qwen3.7-Max made 1,158 tool calls, optimized a chip kernel by a factor of 10, all without a single human intervention over 35 hours.

Summary

Alibaba's Qwen3.7-Max demonstrated for the first time publicly an AI coding agent running autonomously for 35 hours, making 1,158 tool calls and accelerating software for a self-developed AI chip by a factor of 10. The model scores 80.4 on SWE-Verified, statistically tied with Claude Opus 4.6 Max (80.8), but is exclusively API-based with no independent verification of the 35-hour run. For European enterprises, using Chinese AI APIs requires a GDPR Data Protection Impact Assessment and legal analysis under China's 2017 National Intelligence Law before any production deployment.

An AI Agent Running 35 Hours Without Intervention

At its Cloud Summit on May 20-21, 2026, in Hangzhou, Alibaba presented a result not previously demonstrated publicly for coding agents: Qwen3.7-Max ran autonomously for 35 hours , made 1,158 tool calls, and optimized the operating software of a self-developed chip by a factor of 10. This is not a controlled laboratory scenario but a practical demonstration of so-called long-horizon tasks.

Long-horizon tasks are tasks requiring hundreds or thousands of sequential decision steps before a result is produced. Classical AI assistant interactions take seconds to minutes; long-horizon tasks can span hours or days.

Hours of continuous autonomous execution

1,158

Tool calls in the test run

10x

Speedup on the Extend Attention kernel

432

Kernel evaluations, 5 architecture redesigns

Alibaba launched three products simultaneously: Qwen3.7-Max as the language model, the Zhenwu M890 chip, and the Panjiu AL128 rack-scale system with 128 accelerators. The figures from the 35-hour run are Alibaba's own first-party claims; independent reproductions have not yet been published.

Key Insight

The 35-hour run is not a benchmark score but a demonstration of the capacity for long-horizon reasoning: an autonomous decision chain that runs for hours. This is structurally new, even if the specific figures still await external verification.

Vertical Integration: Chip, Model, and Server From One Source

Alibaba is pursuing an approach that Western providers have only partially achieved: complete control over the processor, language model, and server infrastructure. The Zhenwu M890 carries 144 GB of HBM3 memory , 50% more than its predecessor, and achieves 800 GB/s chip-to-chip bandwidth. Combined with Qwen3.7-Max, this enables hardware-software optimizations that are structurally impossible with Nvidia-based models.

Vertical Integration (Alibaba)

Chip, model, and server from one source

Hardware-software co-optimization possible

No dependency on Nvidia export licenses

Long-term competitive advantage within China

Model on Third-Party Chips (Standard)

Chip and model from separate vendors

Optimization limited to software layer

Subject to export and procurement rules

Broader tool ecosystem compatibility

The Panjiu AL128 rack system links 128 M890 units. Qwen3.7-Max natively supports the most widely used agent frameworks: OpenClaw, Hermes Agent, Claude Code , Qwen Paw, and Coder. Even after partial easing of US export restrictions in 2025, Alibaba continues this chip development track, not as a workaround but as a durable strategic advantage.

Component	Product	Key specs
AI Chip	Zhenwu M890	144 GB HBM3, 800 GB/s inter-chip BW, 3x vs. predecessor
Rack System	Panjiu AL128	128 M890 accelerators linked
Language Model	Qwen3.7-Max	1M token context, API-only

Benchmarks: Strong, But No Clear Lead

On standardized benchmarks, Qwen3.7-Max delivers results on par with leading Western models without clearly outperforming them. More relevant for enterprise use is a separate weakness: factual reliability has declined compared to the predecessor , which can create blockers in agent workflows that require decisions.

Senior software architect reviewing AI model benchmark comparison results on a wall-mounted monitor in a Frankfurt dev office — Benchmark comparisons between Qwen3.7-Max, Claude Opus 4.6 Max, and GPT-5.5 show parity on SWE-Verified, the leading software engineering measure.

80.4

SWE-Verified (Claude Opus 4.6 Max: 80.8)

92.4

GPQA Diamond (Claude Opus 4.6: 91.3)

Artificial Analysis Index (GPT-5.5: 60.2)

48%

Factual attempt rate (lowest among frontier models)

On the LM Arena leaderboard, Qwen3.7-Max ranks 13th globally, 7th in math, 9th in coding. On SWE-Verified, the most widely cited benchmark for real-world software engineering tasks, performance is statistically identical to Claude Opus 4.6 Max. An attempt rate of only 48% on factual queries means the model declines to answer rather than hallucinate in more than half of uncertain factual cases. This improves safety but creates blockers in workflows that require a decision.

Qwen3.7-Max achieves an Intelligence Index score of 57, behind GPT-5.5 (60.2) and Claude Opus 4.7 (57.3).

Artificial Analysis , May 2026

European Perspective

For European enterprises, Qwen3.7-Max is a technically interesting but legally complex model to evaluate. Two risk dimensions require attention that do not apply to European providers in the same way.

Legal Risk: China's National Intelligence Law (2017), Article 7, obliges Chinese companies to cooperate with state intelligence agencies. The precise enforcement scope for international API customers is legally contested. No official clarification from Alibaba exists. European companies handling sensitive data should assess this before any integration.

GDPR Assessment

Any API use by a Chinese provider requires a Data Protection Impact Assessment (DPIA) and a documented legal basis for third-country transfers under GDPR Art. 44 ff. before going live with personal data.

EU AI Act

High-risk applications in finance, healthcare, or critical infrastructure require verifiable human oversight. Agents running for 35 hours need defined checkpoints, termination conditions, and audit trails built into the architecture.

Entry-Level Alternative

Qwen3.6 is available with open weights and known pricing (from $1.30 per million input tokens ). That is the safer current choice until Qwen3.7-Max API terms and prices are published.

The EU AI Act requires verifiable human oversight for high-risk applications. An AI agent running 35 hours without a checkpoint is only compatible with this requirement if termination conditions, escalation paths, and audit logs are embedded in the system architecture from the start. That is a design requirement, not a compliance checkbox.

Within the competitive landscape, Kimi K2.6 and now Qwen3.7-Max show that Chinese AI companies are systematically challenging Western models on agentic benchmarks. Simultaneously, DeepSeek is assembling a team to build a harness tool competing directly with Claude Code and OpenAI's Codex.

Challenges and Risks

The impressive figures from the Alibaba launch deserve both respect and careful scrutiny. Three limitations are critical for a realistic assessment.

First-Party Result

No External Replication

The 35-hour run is documented exclusively by Alibaba. External researchers have not yet reproduced the figures. This is not a reason to dismiss the result, but it is not proof of generally available performance either.

API-Only

No Open Weights, Unknown Pricing

Qwen3.7-Max is not available for local deployment. API pricing was not announced at launch. Agentic workflows with thousands of tool calls can become very expensive without accurate cost modeling in advance.

Safety Trade-Off

48% Attempt Rate on Factual Queries

The model prefers refusing to answer over hallucinating under uncertainty. This reduces error rates but creates blockers in decision-requiring workflows. For many enterprise processes, this is a critical point to test before deployment.

A model that declines to answer more than half of all factual queries is structurally unsuitable for high-decision-density workflows, even if the answers it does give are rarely wrong.

innobu analysis, May 2026

What Companies Should Do Now

The Qwen3.7-Max launch is a prompt to clarify two fundamental questions about your agentic AI strategy: How long can an autonomous agent run in your environment? And what vendor legal review have you completed for non-EU services?

Empty meeting room with IT vendor risk assessment matrix on the table, prepared for review before AI deployment — Before integrating Chinese AI APIs, enterprises should complete a structured vendor risk assessment including GDPR analysis and review of the National Intelligence Law implications.

Define Maximum Agent Run Times

35 hours without human oversight is not acceptable in most European enterprise environments. Establish checkpoint intervals, termination conditions, and escalation paths before deploying agents in production. The EU AI Act makes this mandatory for high-risk applications.
Complete a DPIA and Third-Country Analysis

Before integrating any non-European provider's API, complete a structured Data Protection Impact Assessment and document a legal basis for data transfer. Chinese providers require particular attention due to the 2017 National Intelligence Law.
Use Qwen3.6 for Initial Testing

Open weights, known pricing, no API dependency risk. This is the practically deployable Qwen option today. Wait for Qwen3.7-Max until API pricing and terms are published before migrating.
Validate Benchmarks on Your Own Codebases

The SWE-Verified parity with Claude Opus 4.6 Max indicates equivalence, not superiority. Test on your actual tasks, not published benchmarks. Alibaba's demonstration figures come from a controlled Alibaba-specific scenario.
Build Model Portability Into Your Architecture

API-only models can change pricing, access terms, and availability at any time. Design model exchangeability into your agentic architecture from the start. This applies broadly but especially to non-European providers.

For your AI coding strategy in 2026: Qwen3.7-Max is a signal, not a ready-to-deploy tool. The signal is: long-horizon autonomy is no longer theoretical. The practical question is not whether, but how you build governance structures for autonomous agents that run for hours or days.

Frequently Asked Questions

What is Qwen3.7-Max? +

Qwen3.7-Max is Alibaba's flagship language model for the agentic era, unveiled on May 20, 2026, at the Alibaba Cloud Summit in Hangzhou. It features a 1-million-token context window, is designed for autonomous long-horizon tasks, and in an internal test ran for 35 hours without human intervention, making 1,158 tool calls.

What did Qwen3.7-Max achieve in the 35-hour run? +

In Alibaba's internal test, Qwen3.7-Max ran autonomously for 35 hours, made 1,158 tool calls and 432 kernel evaluations, and optimized the Extend Attention kernel of the new Zhenwu M890 chip by a factor of 10 (geometric mean). These figures are Alibaba's first-party claims and have not yet been independently reproduced.

How does Qwen3.7-Max compare to Claude and GPT? +

On SWE-Verified, the leading software engineering benchmark, Qwen3.7-Max scores 80.4, statistically tied with Claude Opus 4.6 Max (80.8). On the Artificial Analysis Intelligence Index it scores 57, behind GPT-5.5 (60.2) and Claude Opus 4.7 (57.3). On GPQA Diamond (92.4 vs. 91.3) and HLE (41.4 vs. 40.0) it scores marginally higher than Claude Opus 4.6.

What legal risks does Qwen3.7-Max pose for European enterprises? +

Two main risks: First, China's National Intelligence Law (2017, Art. 7) obliges Chinese companies to support state intelligence agencies, which may affect international API customers. Second, GDPR requires a Data Protection Impact Assessment and a legal basis for third-country transfers (Art. 44 ff.) before any production deployment involving personal data.

Can I run Qwen3.7-Max locally? +

No. Qwen3.7-Max is exclusively available via API; model weights are not publicly released. For local deployment, Qwen3.6 is the current option with open weights and known pricing (from $1.30 per million input tokens via OpenRouter).

What is the Zhenwu M890? +

The Zhenwu M890 is Alibaba's new AI accelerator, developed by its chip subsidiary T-Head. It carries 144 GB of HBM3 memory (50% more than the predecessor), achieves 800 GB/s inter-chip bandwidth, and Alibaba claims it delivers three times the performance of the previous Zhenwu 810E. Combined with Qwen3.7-Max, it enables hardware-software co-optimization without third-party processors.