Huawei's CloudMatrix-Infer outpaces Nvidia H100 with 384 AI chips

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Huawei unveiled CloudMatrix-Infer, a massive AI inference system featuring 384 Ascend 910C NPUs and 192 Kunpeng CPUs interconnected through a 2.8Tbps Unified Bus architecture. The system represents Huawei’s most ambitious challenge to Nvidia’s AI infrastructure dominance, offering superior throughput performance while demonstrating the company’s complete vertical AI stack from silicon to software.

What you should know: CloudMatrix-Infer delivers impressive performance metrics that outpace Nvidia’s H100 chips on similar workloads.

The system achieves 6,688 tokens/sec prefill and 1,943 decode/sec per NPU, surpassing Nvidia H100 performance on comparable tasks.
Huawei compensates for individual chip limitations by networking five times more processors than typical Nvidia configurations, turning quantity into a performance advantage.
The architecture supports advanced features like disaggregated prefill-decode-caching, INT8 quantization, and microbatch pipelining optimized for large-scale inference.

The big picture: Huawei’s vertical integration strategy positions the company as a credible alternative to Nvidia’s CUDA ecosystem, particularly for massive AI model deployment.

The company’s CANN (Compute Architecture for Neural Networks) stack has matured to version 7.0, mirroring CUDA’s layered structure across driver, runtime, and libraries.
CloudMatrix demonstrates system-level optimization specifically designed for next-generation inference workloads, especially models exceeding 700B parameters.
The hybrid design pairs Kunpeng CPUs with Ascend NPUs to handle control-plane tasks like distributed resource scheduling and fault recovery, preventing non-AI workloads from bottlenecking performance.

Technical advantages: The Unified Bus architecture represents a fundamental departure from Nvidia’s NVLink approach, implementing an all-to-all interconnect topology.

Huawei uses 6,912 optical modules to create a “flat” network connecting all 384 NPUs and 192 CPUs across 16 racks, eliminating hierarchical communication hops.
CANN supports massive-scale expert parallelism (EP320), allowing one expert per NPU die—a capability that even CUDA ecosystems struggle to achieve.
The architecture is purpose-built for massive Mixture of Experts (MoE) models with bandwidth-first design principles.

In plain English: Think of traditional AI chip networks like a corporate hierarchy where messages must travel up and down through managers. Huawei’s Unified Bus creates a flat organization where every chip can talk directly to every other chip, eliminating the bottlenecks. This is particularly useful for massive AI models that split their “expertise” across many different specialized components—like having a team of specialists who can all communicate instantly rather than passing notes through a chain of command.

Remaining challenges: Despite impressive performance gains, Huawei faces significant hurdles in competing with Nvidia’s established ecosystem.

CloudMatrix consumes 3.9× more power than Nvidia’s GB200, creating viability concerns in regions with expensive energy or strict carbon targets.
CUDA maintains advantages in ecosystem maturity, documentation quality, third-party library support, and global developer community size.
Code migration from CUDA to CANN still requires significant effort, despite existing PyTorch and TensorFlow adapters.
Community trust and familiarity continue to favor Nvidia, particularly outside Asia, with some organizations legally restricted from using Huawei technology.

Why this matters: The development signals intensifying competition in AI infrastructure as geopolitical tensions drive demand for alternatives to Nvidia’s dominance.

Chinese organizations increasingly view Huawei as the de facto choice for AI infrastructure, while Western markets remain largely committed to Nvidia.
The system’s optimization for massive MoE models positions Huawei to compete in the next generation of AI applications requiring unprecedented computational scale.
For regions where CloudMatrix is available, the architecture presents a viable option for deploying 700B+ parameter models, potentially reshaping AI infrastructure decision-making.

What the experts say: Forrester analysts emphasize that winning the AI race requires more than just faster chips.

“Huawei isn’t just catching up with its CANN stack and the new CloudMatrix architecture, it’s redefining how AI infrastructure works,” the analysts noted.
“As someone who’s built applications (in my past life), I’d rely on CUDA’s mature, frictionless ecosystem,” they acknowledged, highlighting the practical challenges facing Huawei’s adoption.
The analysts concluded that diversifying AI infrastructure “is no longer optional — it’s a strategic imperative” given ongoing geopolitical uncertainty.

AI Race: Can Huawei Close The AI Gap?

Forrester

Menu

Huawei’s CloudMatrix-Infer outpaces Nvidia H100 with 384 AI chips

Recent News

AI program Xbow becomes top US vulnerability researcher, finding 1,000+ bugs

NiCE rebrands as human-centered AI platform to augment customer service

We tested 3 AI travel agents and here’s what actually works

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Huawei’s CloudMatrix-Infer outpaces Nvidia H100 with 384 AI chips

Recent News

AI program Xbow becomes top US vulnerability researcher, finding 1,000+ bugs

NiCE rebrands as human-centered AI platform to augment customer service

We tested 3 AI travel agents and here’s what actually works

Join the revolution

CO/AI

Resources

Join the revolution