Server hardware being delivered to the loading dock of a European data center where Chinese open-weight models can be self-hosted

China's AI Models 2026: When Benchmark Promises Meet Independent Tests

Four open models in 18 days, all claiming frontier parity. What is left when independent tests, not vendor figures, count.

In April 2026 four Chinese labs released open AI models within 18 days, all claiming top-tier performance and undercutting Western models on price by six to thirty times. The independent NIST evaluation of DeepSeek V4 paints a more sober picture. This article sorts out what the models actually deliver and under which conditions European companies should deploy them.

Summary

Between April 7 and 24, 2026, Z.ai (GLM-5.1), Moonshot (Kimi K2.6), MiniMax (M2.7) and DeepSeek (V4) released open models that cluster within about three points on the coding benchmark SWE-Bench Pro and narrowly passed GPT-5.4 and Claude Opus 4.6 on individual scores. The independent evaluation of DeepSeek V4 Pro by the NIST institute CAISI in May 2026, however, shows a gap of about eight months to the frontier: an estimated Elo score of 800 versus 1260 for GPT-5.5. The real lever is price, not absolute performance. DeepSeek V4 Pro costs $1.74 and $3.48 per million tokens versus $5 and $30 for GPT-5.5. For European companies the question shifts from model strength to deployment location: running open weights on-premise on EU servers is seen as the safest path to GDPR compliance, but the EU AI Act holds the operator responsible, not the maker.

What actually happened in April 2026

Four Chinese labs released open AI models between April 7 and 24, 2026, all claiming top-tier performance. Z.ai shipped GLM-5.1, Moonshot the Kimi K2.6 model, MiniMax the M2.7 variant, and DeepSeek the two versions V4 Pro and V4 Flash. On SWE-Bench Pro, a benchmark for real coding tasks, these models cluster within about three points, and two of them temporarily led Artificial Analysis' open intelligence ranking.

4
open frontier models in 18 days
April 7 to 24, 2026
58.4
GLM-5.1 on SWE-Bench Pro
ahead of GPT-5.4 at 57.7
754B
parameters in GLM-5.1 (MoE)
MIT license, weights on Hugging Face
80.2 %
Kimi K2.6 on SWE-Bench Verified
first open model ahead of GPT-5.4
1.6T
parameters in DeepSeek V4 Pro
context window of 1 million tokens
~1,509
Chinese language models in 2025
about 40 % of all new models worldwide

The context matters: the releases were not a single event but part of a rapid sequence. What is important is that the Western frontier moved on in the same period. Anthropic released Claude Opus 4.7 on April 16, OpenAI shipped GPT-5.5 on April 23, one day before DeepSeek V4. Anyone reading the April wave as proof that China has closed the gap overlooks that the target moved too.

April 7, 2026

GLM-5.1 by Z.ai

754 billion parameters, MIT license, 58.4 on SWE-Bench Pro and thus narrowly ahead of GPT-5.4 and Claude Opus 4.6.

April 16, 2026

Claude Opus 4.7 by Anthropic

The Western frontier moves on in the middle of the Chinese release wave.

April 20, 2026

Kimi K2.6 by Moonshot

First open model to beat GPT-5.4 on SWE-Bench Pro, with 80.2 percent on SWE-Bench Verified.

April 23, 2026

GPT-5.5 by OpenAI

One day before DeepSeek V4. The gap to the frontier is remeasured before the next Chinese model appears.

April 24, 2026

DeepSeek V4 Pro and Flash

1.6 trillion parameters, one million tokens of context, prices far below the Western frontier models.

The benchmark gap narrowed. It did not close, because the Western frontier moved on in the same period.

Paraphrased from the analysis by Can Demir, May 2026

Benchmark promises and independent tests

The self-reported numbers and the independent evaluation diverge sharply. The Center for AI Standards and Innovation (CAISI) at the US institute NIST tested DeepSeek V4 Pro independently in May 2026 and arrives at a gap of about eight months to the frontier. Where DeepSeek's own tables place the model around GPT-5.4 and Opus 4.6, CAISI ranks it closer to GPT-5, the model that appeared eight months earlier.

SWE-Bench Pro is a benchmark that measures AI models on real software tasks drawn from actual code repositories. A high score points to strong coding ability but says little about security, reasoning under uncertainty, or behavior in production.

On familiar tasks the gap is small, on hard ones it grows. The table below shows the CAISI scores for DeepSeek V4 Pro compared with two Western models. The gap on ARC-AGI-2 and on the security benchmark CTF-Archive-Diamond stands out.

Benchmark DeepSeek V4 Pro GPT-5.5 Opus 4.6
GPQA-Diamond 90 % 96 % 91 %
SWE-Bench Verified 74 % 81 % 79 %
FrontierScience 74 % 79 % 72 %
ARC-AGI-2 (semi-private) 46 % 79 % 63 %
CTF-Archive-Diamond (security) 32 % 71 % 46 %
Estimated Elo score 800 1260 999
Key point

Vendor benchmarks are marketing, not a seal of approval. The gap between self-reported and independently measured performance can amount to several months of development lead. Do not rely on the provider's tables when selecting a model for production use.

That does not mean the models are weak. On knowledge and coding tasks they perform close to the frontier. But the blanket claim of frontier parity does not survive scrutiny once hard reasoning and security tasks are added. A similar finding on the gap between perception and measured performance is described in the article on the Stanford AI Index 2026 and the trust gap .

Cost and performance compared

The real lever of Chinese models is price, not absolute top performance. DeepSeek V4 Pro costs $1.74 and $3.48 per million tokens for input and output, the Flash variant $0.14 and $0.28. By comparison GPT-5.5 sits at $5 and $30, Claude Opus 4.7 at $5 and $25. On many tasks that means a cost saving of six to thirty times at only a slightly lower hit rate.

6 to 30x
cheaper than Western frontier models
54
Intelligence Index Kimi K2.6, top of the open models
$5.6M
training cost DeepSeek R1, versus $80 to 100M
Model Input / output per 1M tokens Context window Notable
DeepSeek V4 Pro $1.74 / $3.48 1M tokens 1.6T parameters, Flash variant from $0.14
Kimi K2.6 $0.95 / $4.00 262K tokens Intelligence Index 54, top of the open models
MiniMax M2.7 $0.30 / $1.20 196K tokens only 10B active parameters, under a third of GLM-5
GPT-5.5 $5.00 / $30.00 Western frontier model CAISI Elo 1260, reference for the measurement

Be careful with list prices: the raw price per token says little about total cost in production. If a model generates longer answers and more reasoning tokens for the same task, the cost advantage shrinks. The honest calculation only emerges when you run the same workload across several models and count the actual tokens, not the table values.

European perspective

For European companies the question shifts from model strength to deployment location. Open weights can run on your own or European servers without data flowing to China. That is precisely what is seen as the safest way to use the cost advantage while staying GDPR compliant. More than 180,000 models derived from Alibaba's Qwen already run on European infrastructure.

A systems administrator slides a GPU server into a rack in a European on-premise server room, with a terminal showing a running model
Running open weights on-premise on EU servers is seen as the safest way to use the cost advantage of Chinese models while staying GDPR compliant.

Deployment location beats origin: If you self-host an open model, you alone decide which data it processes and where that data flows. The Chinese origin of the weights does not change that, as long as the model runs on your infrastructure and sends no telemetry outward.

The legal knot lies elsewhere. China's Personal Information Protection Law partly exempts authorities, which critics read as a de facto right of access to data stored in China. With on-premise deployment on EU servers this path disappears, because no data reaches Chinese data centers. When using a Chinese provider's hosted API, by contrast, this question is real.

The operator is liable, not the maker: The EU AI Act applies extraterritorially and holds the company that deploys an AI system responsible, regardless of whether the model comes from China, the US, or Europe. Fines reach up to 35 million euros or 7 percent of global annual turnover. New models will be reviewed by the EU AI Office from 2026, existing ones from 2027.

For high-risk applications such as hiring, lending, or critical infrastructure, full documentation and human oversight are required, whichever model sits underneath. Set up the architecture cleanly and you can run a Chinese open-weight model just as compliantly as a Western one. For more depth, see the article on the EU AI Act high-risk deadlines 2026 and the piece on local AI models on your own hardware .

Challenges and risks

Alongside price there are real risks that call for sober scrutiny. Vendor benchmarks are marketing, and the CAISI results show how far self-reported and independently measured figures can diverge. On top come security, license, and supply-chain questions.

Benchmark gaming

Models can be optimized for known test sets without performing as well in real use. A top score on a public benchmark is no proof that the model solves your specific task. That is exactly why your own measurement with real data is indispensable.

Security

On the security benchmark CTF-Archive-Diamond, DeepSeek V4 scored 32 percent, well below GPT-5.5 at 71 percent. For security-critical automation, such as handling vulnerabilities or code review, this gap is relevant and should be measured before deployment.

Supply chain

Reports of bottlenecks in Huawei chips have temporarily delayed the schedules of Chinese labs. Anyone who relies operationally on a specific model should include its dependence on hardware and update supply in the risk assessment.

License details

The MIT license for GLM-5.1 is clear and permits commercial use and fine-tuning. Other models have usage clauses that must be checked before production use. The license is not a detail but decides whether you may legally build the model into your product at all.

What companies should do now

The most important step is your own measurement instead of trusting vendor numbers. When selecting a model, test it on real, internal tasks and align the choice with use case, data-protection needs, and deployment model. Four steps help.

  1. Set up your own evaluation

    Assemble representative tasks from your own daily work and let several models process the same workload. Measure hit rate, token consumption, and latency. Public benchmarks serve the shortlist, not the decision.

  2. Set the deployment model by data-protection needs

    For sensitive data, open weights belong on your own or European servers. For non-critical tasks a cheap hosted API can make sense. Make this decision per use case, not blanket for the whole company.

  3. Clarify EU AI Act governance early

    Classify each use case by risk level, document the processing of personal data, and plan human oversight from the start. That is cheaper when it happens before deployment, not after.

  4. Stay multi-track

    Keep open Chinese models as a cost option alongside Western frontier models rather than locking in to one. That way you can pick the right model per task and react to price or license changes without rebuilding your architecture.

Close-up of a developer's hands at the keyboard with a GPU card on an anti-static mat during a model test
Own tests with real tasks instead of public benchmarks: only running on your own hardware shows the actual performance.

Concrete assessments of individual models are offered by the articles on Qwen3.7-Max and Alibaba's vertical AI strategy and on Kimi K2.6 as an open-weight agent in the enterprise .

Key point

The right question is not which Chinese model is strongest, but which model solves your specific task at acceptable cost and where it is allowed to run. You answer both with your own measurements and clean governance, not with the provider's tables.

Further reading

Frequently asked questions

Which Chinese AI models launched in April 2026? +

Between April 7 and 24, 2026, four Chinese labs released open models: Z.ai with GLM-5.1 (April 7), Moonshot with Kimi K2.6 (April 20), MiniMax with M2.7, and DeepSeek with the V4 Pro and V4 Flash variants (April 24). On the coding benchmark SWE-Bench Pro they cluster within about three points and narrowly passed GPT-5.4 and Claude Opus 4.6 on individual scores.

Do the benchmark claims of Chinese AI models hold up to independent tests? +

Only partly. The US institute NIST evaluated DeepSeek V4 Pro independently in May 2026 through its Center for AI Standards and Innovation (CAISI) and found a gap of about eight months to the frontier. The estimated Elo score is 800 versus 1260 for GPT-5.5. Where DeepSeek's own tables place the model around GPT-5.4 level, CAISI ranks it closer to GPT-5.

How much cheaper are Chinese AI models? +

Substantially. DeepSeek V4 Pro costs $1.74 and $3.48 per million tokens for input and output, versus $5 and $30 for GPT-5.5. Across many tasks that is a cost saving of six to thirty times. MiniMax M2.7 costs $0.30 and $1.20, less than a third of GLM-5. The advantage can shrink in production when models generate longer answers and more reasoning tokens.

Can European companies deploy Chinese AI models? +

Yes, with the right architecture. Open weights can run on your own or European servers without data flowing to China, which is seen as the safest path to GDPR compliance. More than 180,000 models derived from Alibaba's Qwen already run on European infrastructure. The EU AI Act holds the operator responsible regardless of the model's origin, with fines up to 35 million euros or 7 percent of global annual turnover.

What should companies do before deploying a Chinese AI model? +

Run your own measurement instead of trusting vendor numbers. When selecting a model, test it on real internal tasks rather than public benchmarks, set the deployment model by data-protection needs (on-premise or EU hosting for sensitive data), and clarify EU AI Act governance early. It is sensible to keep open Chinese models as a cost option alongside Western frontier models rather than locking in to one.