Open-weights models like Qwen3.5 and Kimi 2.5 now run on hardware that fits under your desk. For businesses, that raises a concrete question: Is local inference a viable alternative or complement to the cloud?
Just a year ago, local LLM inference was mostly frustrating. The models were noticeably worse than their commercial counterparts, the hardware expensive or loud, the setup cumbersome. Anyone serious about working with AI had no choice but to rely on OpenAI, Anthropic, or Google.
2026 looks different. Models like Qwen3.5-35B from Alibaba deliver results on many standard tasks that approach commercial cloud models. At the same time, dedicated inference devices like NVIDIA's DGX Spark or the Asus GX10 are available from around EUR 3,000. Small, quiet, with pre-installed Linux. Plug in, load a model, done.
Open-weights means: the model weights are freely available. No API key needed, no subscription, no third-party terms of service. The model runs on your own device. Data never leaves your own network.
For a range of tasks, local inference is already practical for daily use:
Source code reviews, refactoring suggestions, and documentation run reliably on local models. Especially with proprietary code, a clear advantage: nothing leaves your network.
Summaries, reviews, brainstorming, and drafts for internal documents. Many everyday tasks that previously went to ChatGPT or Claude can be handled locally.
Automated processes with clearly defined context deliver usable results. Particularly suited for recurring, structured tasks.
Email drafts, meeting summaries, research notes. The bulk of daily AI usage can be covered locally.
Local models hit their limits when it comes to highly complex reasoning tasks with long context . Multimodal applications (image analysis, video) are barely usable locally. And if you need the best available model for a specific task, you will still end up with the major cloud providers.
Claude, GPT, or Gemini in their strongest variants are still ahead of local alternatives for demanding tasks.
That is not a flaw. It describes the current state of the art. The question is not whether local models can fully replace the cloud. Rather, it is about what share of daily work can sensibly be handled locally.
Every request to an external API transfers data to the provider. With coding assistants, that means your entire source code. With chat tools, the full conversation history. With agents, files and system contexts on top. Many users are not aware of how much data is actually transmitted. With local inference, the question does not arise.
Cloud inference is billed per token. With heavy use, costs scale up. A local device has fixed acquisition costs and manageable operating costs. A GB10 device runs at about EUR 500 per year in electricity under full load. Significantly less in normal operation.
No rate limiting, no API outages, no unilateral price changes. Your model runs when you need it, as often as you need it.
Local processing eliminates data processing agreements, third-country transfers, and the complexity of DPAs. This simplifies the data protection assessment considerably, especially for businesses in regulated industries.
Local inference is not a turnkey solution. A few points that tend to get overlooked in the enthusiasm:
A local device needs administration. OS updates, model changes, monitoring, network configuration. This requires expertise that not every company has in-house. Cloud APIs abstract away this complexity.
That Qwen3.5-35B approaches commercial models on benchmarks does not mean the results are equivalent in every situation. In daily work with complex prompts or niche topics, the gaps can be more noticeable.
Local hardware is an investment in today's state of the art. Cloud providers continuously roll out new models without requiring you to swap hardware. In two years, the hardware may be outdated.
A single device is enough for a team of two. Not for 50 concurrent users. Cluster solutions like exo exist but increase both cost and complexity significantly.
Benchmarks measure specific capabilities under defined conditions. In everyday work with unusual contexts or complex prompts, the differences between local and cloud models can be more pronounced than the numbers suggest.
An argument gaining weight: the major AI providers are based in the US and China. Regulatory interventions, export restrictions, or political conflicts can affect the availability of AI services.
Anyone who builds their business processes on a single provider operating in a foreign jurisdiction takes on a risk that is not technical in nature.
The counterargument applies equally: Open-weights models currently come predominantly from China (Alibaba, Moonshot AI, MiniMax). Using local inference reduces operational dependency. The strategic dependency on the manufacturers of models and hardware (NVIDIA, Apple) remains.
The market for local inference hardware is moving fast. An overview of the relevant options:
| Hardware | Price from | Memory | Suited for | Limitations |
|---|---|---|---|---|
| NVIDIA GB10 (DGX Spark, Asus GX10) | approx. EUR 3,000 | 128 GB | LLM inference, entry-level, teams | Linux knowledge helpful |
| Apple Mac (M-chip, 16-64 GB) | approx. EUR 1,500 | 16-64 GB | Smaller models up to 14B parameters | Limited to smaller models |
| Apple Mac Studio (256-512 GB) | approx. EUR 8,000 | 256-512 GB | Large models, high bandwidth | High price |
| AMD Strix Halo Mini-PCs | approx. EUR 2,000 | variable | Experimental, early adopters | No CUDA, immature ecosystem |
| Used RTX 3090 (2-3x) | approx. EUR 1,500 | 48-72 GB VRAM | Startups, Linux-experienced teams | Loud, power-hungry, high-maintenance |
| exo cluster | approx. EUR 15,000 | variable | Very large models, teams | High cost and complexity |
For getting started, NVIDIA GB10 devices are currently the most practical option: compact, quiet, optimised for LLM inference, and with a comparatively low entry barrier.
Local inference does not have to replace cloud AI. A pragmatic split based on data classification works better in practice than an either-or approach:
HR data, contracts, customer data, internal strategy documents, proprietary code. Everything your company would not want in someone else's hands.
Public research, marketing copy, generic code tasks without sensitive context. Tasks where the best available model makes the difference.
If you want cloud inference but prefer to avoid US providers: Nebius (data centres in France and Finland) or AKI.IO (German and European servers) offer open-weights models via API in full GDPR compliance.
Local AI inference is practical in 2026. Not for everything, but for a relevant share of daily AI usage in businesses. The hardware is affordable, the models are good enough, and the arguments for data sovereignty and cost control are real.
At the same time, local inference is neither a turnkey solution nor a silver bullet. It requires technical expertise, ties up resources for operations and maintenance, and cannot match the best cloud models for complex tasks.
The strategically smart decision is probably not an either-or choice. Rather, it is a deliberate split: local inference for everyday work and sensitive data, cloud models for the heavy lifting. With clear rules about which data goes where.
If you are interested in getting started, do not wait for perfect hardware. The GB10 devices are a good starting point for gaining experience. The real work is not the setup. It is deciding which tasks and data should be processed locally going forward.
For getting started, NVIDIA GB10 devices like the DGX Spark or Asus GX10 are available from around EUR 3,000. They offer 128 GB of memory and are optimised for LLM inference. Apple Macs with M-chips work for smaller models up to 14B parameters, while larger models require a Mac Studio with 256 or 512 GB of Unified Memory.
For many standard tasks such as code analysis, text work, or brainstorming, open-weights models like Qwen3.5-35B deliver comparable results. However, for very complex reasoning, long contexts, or multimodal tasks, cloud models from OpenAI, Anthropic, and Google still lead the field.
Electricity costs for a GB10 device run at about EUR 500 per year under full load, significantly less in normal operation. Add one-time acquisition costs starting from around EUR 3,000. Compared to cloud APIs that charge per token, costs are often lower with heavy use and, more importantly, much more predictable.
Yes, local processing simplifies GDPR compliance considerably. It eliminates data processing agreements, third-country transfers, and the complexity of DPAs. Data never leaves your own network, making the data protection assessment significantly easier.
No, a complete replacement is not practical at this point. The best strategy is a deliberate split: sensitive data such as HR records, contracts, or proprietary code is processed locally. For public research, marketing copy, or particularly demanding tasks, cloud models remain the better choice.
Open-weights models are AI language models whose trained weights are freely available. You can download them and run them on your own hardware without an API key, subscription, or third-party terms of service. Well-known examples include Qwen3.5 from Alibaba, Kimi 2.5 from Moonshot AI, and Llama from Meta.