
Picture by Writer | Canva
# Introduction
Open-source AI is experiencing a major second. With developments in giant language fashions, common machine studying, and now speech applied sciences, open-source fashions are quickly narrowing the hole with proprietary techniques. One of the crucial thrilling entrants on this house is Microsoft’s open-source voice stack, VibeVoice. This mannequin household is designed for pure, expressive, and interactive dialog, rivaling the standard of top-tier industrial choices.
On this article, we are going to discover VibeVoice, obtain the mannequin, and run inference on Google Colab utilizing the GPU runtime. Moreover, we are going to deal with troubleshooting widespread points that will come up whereas operating mannequin inference.
# Introduction to VibeVoice
VibeVoice is a next-generation Textual content-to-Speech (TTS) framework for creating expressive, long-form, multi-speaker audio similar to podcasts and dialogues. Not like conventional TTS, it excels in scalability, speaker consistency, and pure turn-taking.
Its core innovation lies in steady acoustic and semantic tokenizers working at 7.5 Hz, paired with a Massive Language Mannequin (Qwen2.5-1.5B) and a diffusion head for producing high-fidelity audio. This design permits as much as 90 minutes of speech with 4 distinct audio system, surpassing prior techniques.
VibeVoice is obtainable as an open-source mannequin on Hugging Face, with community-maintained code for straightforward experimentation and use.

# Getting Began with VibeVoice-1.5B
On this information, we are going to learn to clone the VibeVoice repository and run the demo by offering it with a textual content file to generate multi-speaker pure speech. It solely takes round 5 minutes from setup to producing the audio.
// 1. Clone the neighborhood repository & set up
First, clone the neighborhood model of the VibeVoice repository (vibevoice-community/VibeVoice), set up the required Python packages, and likewise set up the Hugging Face Hub library to obtain the mannequin utilizing the Python API.
Word: Earlier than beginning the Colab session, guarantee your runtime sort is ready to T4 GPU.
!git clone -q --depth 1 https://github.com/vibevoice-community/VibeVoice.git /content material/VibeVoice
%pip set up -q -e /content material/VibeVoice
%pip set up -q -U huggingface_hub
// 2. Obtain the mannequin snapshot from Hugging Face
Obtain the mannequin repository utilizing the Hugging Face snapshot API. This may obtain all of the recordsdata from the microsoft/VibeVoice-1.5B
repository.
from huggingface_hub import snapshot_download
snapshot_download(
"microsoft/VibeVoice-1.5B",
local_dir="/content material/fashions/VibeVoice-1.5B",
local_dir_use_symlinks=False
)
// 3. Create a transcript with audio system
We are going to create a textual content file inside Google Colab. For that, we are going to use the magic operate %%writefile
to offer the content material. Beneath is a pattern dialog between two audio system about KDnuggets.
%%writefile /content material/my_transcript.txt
Speaker 1: Have you ever learn the most recent article on KDnuggets?
Speaker 2: Sure, it is among the best sources for knowledge science and AI.
Speaker 1: I like how KDnuggets at all times retains up with the most recent traits.
Speaker 2: Completely, it is a go-to platform for anybody within the AI neighborhood.
// 4. Run inference (multi-speaker)
Now, we are going to run the demo Python script inside the VibeVoice repository. The script requires the mannequin path, textual content file path, and speaker names.
Run #1: Map Speaker 1 → Alice, Speaker 2 → Frank
!python /content material/VibeVoice/demo/inference_from_file.py
--model_path /content material/fashions/VibeVoice-1.5B
--txt_path /content material/my_transcript.txt
--speaker_names Alice Frank
Because of this, you will notice the next output. The mannequin will use CUDA to generate the audio, with Frank and Alice as the 2 audio system. It’ll additionally present a abstract that you need to use for evaluation.
Utilizing system: cuda
Discovered 9 voice recordsdata in /content material/VibeVoice/demo/voices
Obtainable voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Studying script from: /content material/my_transcript.txt
Discovered 4 speaker segments:
1. Speaker 1
Textual content preview: Speaker 1: Have you ever learn the most recent article on KDnuggets?...
2. Speaker 2
Textual content preview: Speaker 2: Sure, it is among the best sources for knowledge science and AI....
3. Speaker 1
Textual content preview: Speaker 1: I like how KDnuggets at all times retains up with the most recent traits....
4. Speaker 2
Textual content preview: Speaker 2: Completely, it is a go-to platform for anybody within the AI neighborhood....
Speaker mapping:
Speaker 2 -> Frank
Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & mannequin from /content material/fashions/VibeVoice-1.5B
==================================================
GENERATION SUMMARY
==================================================
Enter file: /content material/my_transcript.txt
Output file: ./outputs/my_transcript_generated.wav
Speaker names: ['Alice', 'Frank']
Variety of distinctive audio system: 2
Variety of segments: 4
Prefilling tokens: 368
Generated tokens: 118
Whole tokens: 486
Technology time: 28.27 seconds
Audio length: 15.47 seconds
RTF (Actual Time Issue): 1.83x
==================================================
Play the audio in pocket book:
We are going to now use the IPython operate to take heed to the generated audio inside Colab.
from IPython.show import Audio, show
out_path = "/content material/outputs/my_transcript_generated.wav"
show(Audio(out_path))

It took 28 seconds to generate the audio, and it sounds clear, pure, and clean. I like it!
Strive once more with totally different voice actors.
Run #2: Strive totally different voices (Mary for Speaker 1, Carter for Speaker 2)
!python /content material/VibeVoice/demo/inference_from_file.py
--model_path /content material/fashions/VibeVoice-1.5B
--txt_path /content material/my_transcript.txt
--speaker_names Mary Carter
The audio generated was even higher, with background music at first and a clean transition between audio system.
Discovered 9 voice recordsdata in /content material/VibeVoice/demo/voices
Obtainable voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Studying script from: /content material/my_transcript.txt
Discovered 4 speaker segments:
1. Speaker 1
Textual content preview: Speaker 1: Have you ever learn the most recent article on KDnuggets?...
2. Speaker 2
Textual content preview: Speaker 2: Sure, it is among the best sources for knowledge science and AI....
3. Speaker 1
Textual content preview: Speaker 1: I like how KDnuggets at all times retains up with the most recent traits....
4. Speaker 2
Textual content preview: Speaker 2: Completely, it is a go-to platform for anybody within the AI neighborhood....
Speaker mapping:
Speaker 2 -> Carter
Speaker 1 -> Mary
Speaker 1 ('Mary') -> Voice: en-Mary_woman_bgm.wav
Speaker 2 ('Carter') -> Voice: en-Carter_man.wav
Loading processor & mannequin from /content material/fashions/VibeVoice-1.5B
Tip: If you’re uncertain which names can be found, the script prints “Obtainable voices:” on startup.
Frequent ones embrace:
en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
# Troubleshooting
// 1. Repo Doesn’t Have Demo Scripts?
The official Microsoft VibeVoice repository has been pulled and reset. Neighborhood reviews point out that some code and demos have been eliminated or are not accessible within the authentic location. In the event you discover that the official repository is lacking inference examples, please examine a neighborhood mirror or archive that has preserved the unique demos and directions: https://github.com/vibevoice-community/VibeVoice
// 2. Sluggish Technology or CUDA Errors in Colab
Confirm you’re on a GPU runtime: Runtime → Change runtime sort → {Hardware} accelerator: GPU (T4 or any obtainable GPU).
// 3. CUDA OOM (Out of Reminiscence)
To attenuate the load, you possibly can take a number of steps. Start by shortening the enter textual content and decreasing the technology size. Think about reducing the audio pattern fee and/or adjusting inner chunk sizes if the script permits it. Set the batch dimension to 1 and go for a smaller mannequin variant.
// 4. No Audio or Lacking Outputs Folder
The script usually prints the ultimate output path within the console; scroll as much as discover the precise location
discover /content material -name "*generated.wav"
// 5. Voice Names Not Discovered?
Copy the precise names listed underneath Obtainable voices. Use the alias names (Alice, Frank, Mary, Carter) proven within the demo. They correspond to the .wav
belongings.
# Ultimate Ideas
For a lot of tasks, I might select an open-source stack like VibeVoice over paid APIs as a consequence of a number of compelling causes. Firstly, it’s simple to combine and presents flexibility for personalization, making it appropriate for a variety of purposes. Moreover, it’s surprisingly mild on GPU necessities, which is usually a vital benefit in resource-constrained environments.
VibeVoice is open supply, which means that sooner or later, you possibly can anticipate higher frameworks that allow quicker technology even on CPUs.
Abid Ali Awan (@1abidaliawan) is an authorized knowledge scientist skilled who loves constructing machine studying fashions. At the moment, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in know-how administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids scuffling with psychological sickness.