Sample Page Title

February 15, 2026

5

The panorama of generative audio is shifting towards effectivity. A brand new open-source contender, Kani-TTS-2, has been launched by the workforce at nineninesix.ai. This mannequin marks a departure from heavy, compute-expensive TTS techniques. As an alternative, it treats audio as a language, delivering high-fidelity speech synthesis with a remarkably small footprint.

Kani-TTS-2 provides a lean, high-performance different to closed-source APIs. It’s at present out there on Hugging Face in each English (EN) and Portuguese (PT) variations.

The Structure: LFM2 and NanoCodec

Kani-TTS-2 follows the ‘Audio-as-Language‘ philosophy. The mannequin doesn’t use conventional mel-spectrogram pipelines. As an alternative, it converts uncooked audio into discrete tokens utilizing a neural codec.

The system depends on a two-stage course of:

The Language Spine: The mannequin is constructed on LiquidAI’s LFM2 (350M) structure. This spine generates ‘audio intent’ by predicting the subsequent audio tokens. As a result of LFM (Liquid Basis Fashions) are designed for effectivity, they supply a sooner different to plain transformers.
The Neural Codec: It makes use of the NVIDIA NanoCodec to show these tokens into 22kHz waveforms.

Through the use of this structure, the mannequin captures human-like prosody—the rhythm and intonation of speech—with out the ‘robotic’ artifacts present in older TTS techniques.

Effectivity: 10,000 Hours in 6 Hours

The coaching metrics for Kani-TTS-2 are a masterclass in optimization. The English mannequin was educated on 10,000 hours of high-quality speech information.

Whereas that scale is spectacular, the pace of coaching is the actual story. The analysis workforce educated the mannequin in solely 6 hours utilizing a cluster of 8 NVIDIA H100 GPUs. This proves that large datasets not require weeks of compute time when paired with environment friendly architectures like LFM2.

Zero-Shot Voice Cloning and Efficiency

The standout characteristic for builders is zero-shot voice cloning. Not like conventional fashions that require fine-tuning for brand spanking new voices, Kani-TTS-2 makes use of speaker embeddings.

The way it works: You present a brief reference audio clip.
The end result: The mannequin extracts the distinctive traits of that voice and applies them to the generated textual content immediately.

From a deployment perspective, the mannequin is extremely accessible:

Parameter Rely: 400M (0.4B) parameters.
Velocity: It encompasses a Actual-Time Issue (RTF) of 0.2. This implies it could generate 10 seconds of speech in roughly 2 seconds.
{Hardware}: It requires solely 3GB of VRAM, making it suitable with consumer-grade GPUs just like the RTX 3060 or 4050.
License: Launched underneath the Apache 2.0 license, permitting for industrial use.

Key Takeaways

Environment friendly Structure: The mannequin makes use of a 400M parameter spine primarily based on LiquidAI’s LFM2 (350M). This ‘Audio-as-Language’ method treats speech as discrete tokens, permitting for sooner processing and extra human-like intonation in comparison with conventional architectures.
Speedy Coaching at Scale: Kani-TTS-2-EN was educated on 10,000 hours of high-quality speech information in simply 6 hours utilizing 8 NVIDIA H100 GPUs.
Immediate Zero-Shot Cloning: There is no such thing as a want for fine-tuning to duplicate a particular voice. By offering a brief reference audio clip, the mannequin makes use of speaker embeddings to immediately synthesize textual content within the goal speaker’s voice.
Excessive Efficiency on Edge {Hardware}: With a Actual-Time Issue (RTF) of 0.2, the mannequin can generate 10 seconds of audio in roughly 2 seconds. It requires solely 3GB of VRAM, making it totally practical on consumer-grade GPUs just like the RTX 3060.
Developer-Pleasant Licensing: Launched underneath the Apache 2.0 license, Kani-TTS-2 is prepared for industrial integration. It provides a local-first, low-latency different to costly closed-source TTS APIs.

Take a look at the Mannequin Weight. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be part of us on telegram as properly.

Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling advanced datasets into actionable insights.

Sample Page Title

The Structure: LFM2 and NanoCodec

Effectivity: 10,000 Hours in 6 Hours

Zero-Shot Voice Cloning and Efficiency

Key Takeaways

Related Articles

Bitcoin Sees Largest Shorts Liquidation Occasion Since 2024 — What Occurred?

The Hidden Value of Failing Prop Agency Challenges (And The way to Remove It) – Buying and selling Techniques – 14 February 2026

Observe It Stay First – Prop-Type Pattern EA (MT4 & MT5) – Statistics – 15 February 2026

LEAVE A REPLY Cancel reply

Latest Articles

Bitcoin Sees Largest Shorts Liquidation Occasion Since 2024 — What Occurred?

The Hidden Value of Failing Prop Agency Challenges (And The way to Remove It) – Buying and selling Techniques – 14 February 2026

Observe It Stay First – Prop-Type Pattern EA (MT4 & MT5) – Statistics – 15 February 2026

‘Very hopeful’: Cautious optimism amongst Gen Z Bangladeshis after key vote | Bangladesh Election 2026 Information

What macOS model can my Mac run? Full compatibility information (Tahoe & older)

EDITOR PICKS

Bitcoin Sees Largest Shorts Liquidation Occasion Since 2024 — What Occurred?

The Hidden Value of Failing Prop Agency Challenges (And The way...

Observe It Stay First – Prop-Type Pattern EA (MT4 & MT5)...

POPULAR POSTS

What’s nano-texture glass and do I would like it?

Qubic’s Mining Pool Attacking Monero Falls Beneath Assault

Mock Take a look at English – SEM 1

POPULAR CATEGORY