Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Machine LLMs

The brand new LiteRT NeuroPilot Accelerator from Google and MediaTek is a concrete step towards operating actual generative fashions on telephones, laptops, and IoT {hardware} with out transport each request to an information heart. It takes the present LiteRT runtime and wires it instantly into MediaTek’s NeuroPilot NPU stack, so builders can deploy LLMs and embedding fashions with a single API floor as an alternative of per chip customized code.

What’s LiteRT NeuroPilot Accelerator?

LiteRT is the successor of TensorFlow Lite. It’s a excessive efficiency runtime that sits on system, runs fashions in .tflite FlatBuffer format, and may goal CPU, GPU and now NPU backends via a unified {hardware} acceleration layer.

LiteRT NeuroPilot Accelerator is the brand new NPU path for MediaTek {hardware}. It replaces the older TFLite NeuroPilot delegate with a direct integration to the NeuroPilot compiler and runtime. As a substitute of treating the NPU as a skinny delegate, LiteRT now makes use of a Compiled Mannequin API that understands Forward of Time (AOT) compilation and on system compilation, and exposes each via the identical C++ and Kotlin APIs.

On the {hardware} aspect, the mixing presently targets MediaTek Dimensity 7300, 8300, 9000, 9200, 9300 and 9400 SoCs, which collectively cowl a big a part of the Android mid vary and flagship system house.

Why Builders Care, Unified Workflow For Fragmented NPUs??

Traditionally, on system ML stacks had been CPU and GPU first. NPU SDKs shipped as vendor particular toolchains that required separate compilation flows per SoC, customized delegates, and guide runtime packaging. The outcome was a combinatorial explosion of binaries and numerous system particular debugging.

LiteRT NeuroPilot Accelerator replaces that with a three step workflow that’s the similar no matter which MediaTek NPU is current:

Convert or load a .tflite mannequin as ordinary.
Optionally use the LiteRT Python instruments to run AOT compilation and produce an AI Pack that’s tied to a number of goal SoCs.
Ship the AI Pack via Play for On-device AI (PODAI), then choose Accelerator.NPU at runtime. LiteRT handles system concentrating on, runtime loading, and falls again to GPU or CPU if the NPU is just not obtainable.

For you as an engineer, the primary change is that system concentrating on logic strikes right into a structured configuration file and Play supply, whereas the app code largely interacts with CompiledModel and Accelerator.NPU.

AOT and on system compilation are each supported. AOT compiles for a recognized SoC forward of time and is advisable for bigger fashions as a result of it removes the price of compiling on the person system. On system compilation is healthier for small fashions and generic .tflite distribution, at the price of greater first run latency. The weblog exhibits that for a mannequin comparable to Gemma-3-270M, pure on system compilation can take greater than 1 minute, which makes AOT the practical choice for manufacturing LLM use.

Gemma, Qwen, And Embedding Fashions On MediaTek NPU

The stack is constructed round open weight fashions reasonably than a single proprietary NLU path. Google and MediaTek checklist express, manufacturing oriented assist for:

Qwen3 0.6B, for textual content era in markets comparable to mainland China.
Gemma-3-270M, a compact base mannequin that’s simple to tremendous tune for duties like sentiment evaluation and entity extraction.
Gemma-3-1B, a multilingual textual content solely mannequin for summarization and normal reasoning.
Gemma-3n E2B, a multimodal mannequin that handles textual content, audio and imaginative and prescient for issues like actual time translation and visible query answering.
EmbeddingGemma 300M, a textual content embedding mannequin for retrieval augmented era, semantic search and classification.

On the most recent Dimensity 9500, operating on a Vivo X300 Professional, the Gemma 3n E2B variant reaches greater than 1600 tokens per second in prefill and 28 tokens per second in decode at a 4K context size when executed on the NPU.

For textual content era use instances, LiteRT-LM sits on prime of LiteRT and exposes a stateful engine with a textual content in textual content out API. A typical C++ circulate is to create ModelAssets, construct an Engine with litert::lm::Backend::NPU, then create a Session and name GenerateContent per dialog. For embedding workloads, EmbeddingGemma makes use of the decrease degree LiteRT CompiledModel API in a tensor in tensor out configuration, once more with the NPU chosen via {hardware} accelerator choices.

Developer Expertise, C++ Pipeline And Zero Copy Buffers

LiteRT introduces a brand new C++ API that replaces the older C entry factors and is designed round express Setting, Mannequin, CompiledModel and TensorBuffer objects.

For MediaTek NPUs, this API integrates tightly with Android’s AHardwareBuffer and GPU buffers. You’ll be able to assemble enter TensorBuffer situations instantly from OpenGL or OpenCL buffers with TensorBuffer::CreateFromGlBuffer, which lets picture processing code feed NPU inputs with out an intermediate copy via CPU reminiscence. That is vital for actual time digital camera and video processing the place a number of copies per body rapidly saturate reminiscence bandwidth.

A typical excessive degree C++ path on system appears like this, omitting error dealing with for readability:

// Load mannequin compiled for NPU
auto mannequin = Mannequin::CreateFromFile("mannequin.tflite");
auto choices = Choices::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);

// Create compiled mannequin
auto compiled = CompiledModel::Create(*env, *mannequin, *choices);

// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Learn(output_span);

The identical Compiled Mannequin API is used whether or not you might be concentrating on CPU, GPU or the MediaTek NPU, which reduces the quantity of conditional logic in software code.

Key Takeaways

LiteRT NeuroPilot Accelerator is the brand new, first-class NPU integration between LiteRT and MediaTek NeuroPilot, changing the previous TFLite delegate and exposing a unified Compiled Mannequin API with AOT and on system compilation on supported Dimensity SoCs.
The stack targets concrete open weight fashions, together with Qwen3-0.6B, Gemma-3-270M, Gemma-3-1B, Gemma-3n-E2B and EmbeddingGemma-300M, and runs them via LiteRT and LiteRT LM on MediaTek NPUs with a single accelerator abstraction.
AOT compilation is strongly advisable for LLMs, for instance Gemma-3-270M can take greater than 1 minute to compile on system, so manufacturing deployments ought to compile as soon as within the pipeline and ship AI Packs through Play for On system AI.
On a Dimensity 9500 class NPU, Gemma-3n-E2B can attain greater than 1600 tokens per second in prefill and 28 tokens per second in decode at 4K context, with measured throughput as much as 12 instances CPU and 10 instances GPU for LLM workloads.
For builders, the C++ and Kotlin LiteRT APIs present a standard path to pick out Accelerator.NPU, handle compiled fashions and use zero copy tensor buffers, so CPU, GPU and MediaTek NPU targets can share one code path and one deployment workflow.

Take a look at the Docs and Technical particulars. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be part of us on telegram as properly.

Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking advanced datasets into actionable insights.

🙌 Comply with MARKTECHPOST: Add us as a most popular supply on Google.

Sample Page Title

What’s LiteRT NeuroPilot Accelerator?

Why Builders Care, Unified Workflow For Fragmented NPUs??

Gemma, Qwen, And Embedding Fashions On MediaTek NPU

Developer Expertise, C++ Pipeline And Zero Copy Buffers

Key Takeaways

Related Articles

Extra About Move Or Breakeven EA – Different – 1 April 2026

Decide briefly halts Trump’s $400m White Home ballroom venture | Donald Trump Information

Unmasking Security Nets: The Untold Story Behind Your Enterprise’s Lifeline – The Errors and Omissions Insurance coverage Coverage

LEAVE A REPLY Cancel reply

Latest Articles

Extra About Move Or Breakeven EA – Different – 1 April 2026

Decide briefly halts Trump’s $400m White Home ballroom venture | Donald Trump Information

Unmasking Security Nets: The Untold Story Behind Your Enterprise’s Lifeline – The Errors and Omissions Insurance coverage Coverage

Don’t Get Burned Attempting To Save Cash: The $8 Magnificence Software That Can Trigger Chemical Burns

Flexa Retires SPEDN After 7 Years, Shifts to Scalable Crypto Cost Infrastructure – Featured Bitcoin Information

EDITOR PICKS

Extra About Move Or Breakeven EA – Different – 1 April...

Decide briefly halts Trump’s $400m White Home ballroom venture | Donald...

Unmasking Security Nets: The Untold Story Behind Your Enterprise’s Lifeline –...

POPULAR POSTS

Qubic’s Mining Pool Attacking Monero Falls Beneath Assault

What’s nano-texture glass and do I would like it?

Feedback on the brand new buying and selling dialog in Metatrader...

POPULAR CATEGORY