Woman working with local AI on a laptop in a bright home office with an AI and Coffee mug

Local AI Models on Your Own Hardware

What works, what doesn't, and who benefits

Open-weights models like Qwen3.5 and Kimi 2.5 now run on hardware that fits under your desk. For businesses, that raises a concrete question: Is local inference a viable alternative or complement to the cloud?

What has changed

Just a year ago, local LLM inference was mostly frustrating. The models were noticeably worse than their commercial counterparts, the hardware expensive or loud, the setup cumbersome. Anyone serious about working with AI had no choice but to rely on OpenAI, Anthropic, or Google.

2026 looks different. Models like Qwen3.5-35B from Alibaba deliver results on many standard tasks that approach commercial cloud models. At the same time, dedicated inference devices like NVIDIA's DGX Spark or the Asus GX10 are available from around EUR 3,000. Small, quiet, with pre-installed Linux. Plug in, load a model, done.

Open-weights means: the model weights are freely available. No API key needed, no subscription, no third-party terms of service. The model runs on your own device. Data never leaves your own network.

EUR 3,000
Entry price for dedicated inference hardware (NVIDIA GB10)
128 GB
Unified memory in GB10 devices for large language models
EUR 500/yr
Electricity cost of a GB10 device under full load

Where local inference works

For a range of tasks, local inference is already practical for daily use:

Code analysis and refactoring

Source code reviews, refactoring suggestions, and documentation run reliably on local models. Especially with proprietary code, a clear advantage: nothing leaves your network.

Text work

Summaries, reviews, brainstorming, and drafts for internal documents. Many everyday tasks that previously went to ChatGPT or Claude can be handled locally.

Agentic workflows

Automated processes with clearly defined context deliver usable results. Particularly suited for recurring, structured tasks.

Everyday tasks

Email drafts, meeting summaries, research notes. The bulk of daily AI usage can be covered locally.

Where the cloud still leads

Local models hit their limits when it comes to highly complex reasoning tasks with long context . Multimodal applications (image analysis, video) are barely usable locally. And if you need the best available model for a specific task, you will still end up with the major cloud providers.

Claude, GPT, or Gemini in their strongest variants are still ahead of local alternatives for demanding tasks.

That is not a flaw. It describes the current state of the art. The question is not whether local models can fully replace the cloud. Rather, it is about what share of daily work can sensibly be handled locally.

The arguments for local inference

Data sovereignty

Every request to an external API transfers data to the provider. With coding assistants, that means your entire source code. With chat tools, the full conversation history. With agents, files and system contexts on top. Many users are not aware of how much data is actually transmitted. With local inference, the question does not arise.

Predictable costs

Cloud inference is billed per token. With heavy use, costs scale up. A local device has fixed acquisition costs and manageable operating costs. A GB10 device runs at about EUR 500 per year in electricity under full load. Significantly less in normal operation.

Availability

No rate limiting, no API outages, no unilateral price changes. Your model runs when you need it, as often as you need it.

GDPR compliance

Local processing eliminates data processing agreements, third-country transfers, and the complexity of DPAs. This simplifies the data protection assessment considerably, especially for businesses in regulated industries.

Counterarguments and limitations

Local inference is not a turnkey solution. A few points that tend to get overlooked in the enthusiasm:

Maintenance and operations

A local device needs administration. OS updates, model changes, monitoring, network configuration. This requires expertise that not every company has in-house. Cloud APIs abstract away this complexity.

Model quality is a spectrum

That Qwen3.5-35B approaches commercial models on benchmarks does not mean the results are equivalent in every situation. In daily work with complex prompts or niche topics, the gaps can be more noticeable.

Pace of development

Local hardware is an investment in today's state of the art. Cloud providers continuously roll out new models without requiring you to swap hardware. In two years, the hardware may be outdated.

Scaling

A single device is enough for a team of two. Not for 50 concurrent users. Cluster solutions like exo exist but increase both cost and complexity significantly.

Benchmarks measure specific capabilities under defined conditions. In everyday work with unusual contexts or complex prompts, the differences between local and cloud models can be more pronounced than the numbers suggest.

Geopolitical dimension

An argument gaining weight: the major AI providers are based in the US and China. Regulatory interventions, export restrictions, or political conflicts can affect the availability of AI services.

Anyone who builds their business processes on a single provider operating in a foreign jurisdiction takes on a risk that is not technical in nature.

The counterargument applies equally: Open-weights models currently come predominantly from China (Alibaba, Moonshot AI, MiniMax). Using local inference reduces operational dependency. The strategic dependency on the manufacturers of models and hardware (NVIDIA, Apple) remains.

Hardware overview

The market for local inference hardware is moving fast. An overview of the relevant options:

Hardware Price from Memory Suited for Limitations
NVIDIA GB10 (DGX Spark, Asus GX10) approx. EUR 3,000 128 GB LLM inference, entry-level, teams Linux knowledge helpful
Apple Mac (M-chip, 16-64 GB) approx. EUR 1,500 16-64 GB Smaller models up to 14B parameters Limited to smaller models
Apple Mac Studio (256-512 GB) approx. EUR 8,000 256-512 GB Large models, high bandwidth High price
AMD Strix Halo Mini-PCs approx. EUR 2,000 variable Experimental, early adopters No CUDA, immature ecosystem
Used RTX 3090 (2-3x) approx. EUR 1,500 48-72 GB VRAM Startups, Linux-experienced teams Loud, power-hungry, high-maintenance
exo cluster approx. EUR 15,000 variable Very large models, teams High cost and complexity

For getting started, NVIDIA GB10 devices are currently the most practical option: compact, quiet, optimised for LLM inference, and with a comparatively low entry barrier.

A sensible split

Local inference does not have to replace cloud AI. A pragmatic split based on data classification works better in practice than an either-or approach:

Process locally

HR data, contracts, customer data, internal strategy documents, proprietary code. Everything your company would not want in someone else's hands.

Keep in the cloud

Public research, marketing copy, generic code tasks without sensitive context. Tasks where the best available model makes the difference.

European cloud alternatives

If you want cloud inference but prefer to avoid US providers: Nebius (data centres in France and Finland) or AKI.IO (German and European servers) offer open-weights models via API in full GDPR compliance.

Conclusion without euphoria

Local AI inference is practical in 2026. Not for everything, but for a relevant share of daily AI usage in businesses. The hardware is affordable, the models are good enough, and the arguments for data sovereignty and cost control are real.

At the same time, local inference is neither a turnkey solution nor a silver bullet. It requires technical expertise, ties up resources for operations and maintenance, and cannot match the best cloud models for complex tasks.

The strategically smart decision is probably not an either-or choice. Rather, it is a deliberate split: local inference for everyday work and sensitive data, cloud models for the heavy lifting. With clear rules about which data goes where.

If you are interested in getting started, do not wait for perfect hardware. The GB10 devices are a good starting point for gaining experience. The real work is not the setup. It is deciding which tasks and data should be processed locally going forward.

Further reading

Frequently asked questions

What hardware do I need for local AI models? +

For getting started, NVIDIA GB10 devices like the DGX Spark or Asus GX10 are available from around EUR 3,000. They offer 128 GB of memory and are optimised for LLM inference. Apple Macs with M-chips work for smaller models up to 14B parameters, while larger models require a Mac Studio with 256 or 512 GB of Unified Memory.

Are local AI models as good as ChatGPT or Claude? +

For many standard tasks such as code analysis, text work, or brainstorming, open-weights models like Qwen3.5-35B deliver comparable results. However, for very complex reasoning, long contexts, or multimodal tasks, cloud models from OpenAI, Anthropic, and Google still lead the field.

What does local AI inference cost to run? +

Electricity costs for a GB10 device run at about EUR 500 per year under full load, significantly less in normal operation. Add one-time acquisition costs starting from around EUR 3,000. Compared to cloud APIs that charge per token, costs are often lower with heavy use and, more importantly, much more predictable.

Is local AI inference GDPR-compliant? +

Yes, local processing simplifies GDPR compliance considerably. It eliminates data processing agreements, third-country transfers, and the complexity of DPAs. Data never leaves your own network, making the data protection assessment significantly easier.

Can local AI completely replace the cloud? +

No, a complete replacement is not practical at this point. The best strategy is a deliberate split: sensitive data such as HR records, contracts, or proprietary code is processed locally. For public research, marketing copy, or particularly demanding tasks, cloud models remain the better choice.

What are open-weights models? +

Open-weights models are AI language models whose trained weights are freely available. You can download them and run them on your own hardware without an API key, subscription, or third-party terms of service. Well-known examples include Qwen3.5 from Alibaba, Kimi 2.5 from Moonshot AI, and Llama from Meta.