HomeSample Page

Sample Page Title


Maia 200 is Microsoft’s new in home AI accelerator designed for inference in Azure datacenters. It targets the price of token era for giant language fashions and different reasoning workloads by combining slender precision compute, a dense on chip reminiscence hierarchy and an Ethernet based mostly scale up cloth.

Why Microsoft constructed a devoted inference chip?

Coaching and inference stress {hardware} in several methods. Coaching wants very massive all to all communication and lengthy operating jobs. Inference cares about tokens per second, latency and tokens per greenback. Microsoft positions Maia 200 as its most effective inference system, with about 30 % higher efficiency per greenback than the most recent {hardware} in its fleet.

Maia 200 is a part of a heterogeneous Azure stack. It is going to serve a number of fashions, together with the most recent GPT 5.2 fashions from OpenAI, and can energy workloads in Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence group will use the chip for artificial information era and reinforcement studying to enhance in home fashions.

Core silicon and numeric specs

Every Maia 200 die is fabricated on TSMC’s 3 nanometer course of. The chip integrates greater than 140 billion transistors.

The compute pipeline is constructed round native FP8 and FP4 tensor cores. A single chip delivers greater than 10 petaFLOPS in FP4 and greater than 5 petaFLOPS in FP8, inside a 750W SoC TDP envelope.

Reminiscence is break up between stacked HBM and on die SRAM. Maia 200 offers 216 GB of HBM3e with about 7TB per second of bandwidth and 272MB of on die SRAM. The SRAM is organized into tile stage SRAM and cluster stage SRAM and is absolutely software program managed. Compilers and runtimes can place working units explicitly to maintain consideration and GEMM kernels near compute.

Tile based mostly microarchitecture and reminiscence hierarchy

The Maia 200 microarchitecture is hierarchical. The bottom unit is the tile. A tile is the smallest autonomous compute and storage unit on the chip. Every tile features a Tile Tensor Unit for prime throughput matrix operations and a Tile Vector Processor as a programmable SIMD engine. Tile SRAM feeds each items and tile DMA engines transfer information out and in of SRAM with out stalling compute. A Tile Management Processor orchestrates the sequence of tensor and DMA work.

A number of tiles type a cluster. Every cluster exposes a bigger multi banked Cluster SRAM that’s shared throughout tiles in that cluster. Cluster stage DMA engines transfer information between Cluster SRAM and the co packaged HBM stacks. A cluster core coordinates multi tile execution and makes use of redundancy schemes for tiles and SRAM to enhance yield whereas holding the identical programming mannequin.

This hierarchy lets the software program stack pin totally different components of the mannequin in several tiers. For instance, consideration kernels can hold Q, Okay, V tensors in tile SRAM, whereas collective communication kernels can stage payloads in cluster SRAM and cut back HBM strain. The design aim is sustained excessive utilization when fashions develop in dimension and sequence size.

On chip information motion and Ethernet scale up cloth

Inference is usually restricted by information motion, not peak compute. Maia 200 makes use of a customized Community on Chip together with a hierarchy of DMA engines. The Community on Chip spans tiles, clusters, reminiscence controllers and I/O items. It has separate planes for giant tensor visitors and for small management messages. This separation retains synchronization and small outputs from being blocked behind massive transfers.

Past the chip boundary, Maia 200 integrates its personal NIC and an Ethernet based mostly scale up community that runs the AI Transport Layer protocol. The on-die NIC exposes about 1.4 TB per second in every course, or 2.8 TB per second bidirectional bandwidth, and scales to six,144 accelerators in a two tier area.

Inside every tray, 4 Maia accelerators type a Absolutely Related Quad. These 4 units have direct non switched hyperlinks to one another. Most tensor parallel visitors stays inside this group, whereas solely lighter collective visitors goes out to switches. This improves latency and reduces swap port depend for typical inference collectives.

Azure system integration and cooling

At system stage, Maia 200 follows the identical rack, energy and mechanical requirements as Azure GPU servers. It helps air cooled and liquid cooled configurations and makes use of a second era closed loop liquid cooling Warmth Exchanger Unit for prime density racks. This permits blended deployments of GPUs and Maia accelerators in the identical datacenter footprint.

The accelerator integrates with the Azure management airplane. Firmware administration, well being monitoring and telemetry use the identical workflows as different Azure compute companies. This allows fleet extensive rollouts and upkeep with out disrupting operating AI workloads.

Key Takeaways

Listed below are 5 concise, technical takeaways:

  • Inference first design: Maia 200 is Microsoft’s first silicon and system platform constructed just for AI inference, optimized for giant scale token era in fashionable reasoning fashions and huge language fashions.
  • Numeric specs and reminiscence hierarchy: The chip is fabricated on TSMCs 3nm, integrates about 140 billion transistors and delivers greater than 10 PFLOPS FP4 and greater than 5 PFLOPS FP8, with 216 GB HBM3e at 7TB per second together with 272 MB on chip SRAM break up into tile SRAM and cluster SRAM and managed in software program.
  • Efficiency versus different cloud accelerators: Microsoft experiences about 30 % higher efficiency per greenback than the most recent Azure inference programs and claims 3 occasions FP4 efficiency of third era Amazon Trainium and better FP8 efficiency than Google TPU v7 on the accelerator stage.
  • Tile based mostly structure and Ethernet cloth: Maia 200 organizes compute into tiles and clusters with native SRAM, DMA engines and a Community on Chip, and exposes an built-in NIC with about 1.4 TB per second per course Ethernet bandwidth that scales to six,144 accelerators utilizing Absolutely Related Quad teams because the native tensor parallel area.


Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling complicated datasets into actionable insights.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles