HomeSample Page

Sample Page Title


Microsoft has launched VibeVoice-ASR as a part of the VibeVoice household of open supply frontier voice AI fashions. VibeVoice-ASR is described as a unified speech-to-text mannequin that may deal with 60-minute long-form audio in a single go and output structured transcriptions that encode Who, When, and What, with help for Custom-made Hotwords.

VibeVoice sits in a single repository that hosts Textual content-to-Speech, actual time TTS, and Automated Speech Recognition fashions beneath an MIT license. VibeVoice makes use of steady speech tokenizers that run at 7.5 Hz and a next-token diffusion framework the place a Massive Language Mannequin causes over textual content and dialogue and a diffusion head generates acoustic element. This framework is principally documented for TTS, but it surely defines the general design context wherein VibeVoice-ASR lives.

https://huggingface.co/microsoft/VibeVoice-ASR

Lengthy type ASR with a single world context

Not like typical ASR (Automated Speech Recognition) methods that first reduce audio into quick segments after which run diarization and alignment as separate elements, VibeVoice-ASR is designed to just accept as much as 60 minutes of steady audio enter inside a 64K token size price range. The mannequin retains one world illustration of the total session. This implies the mannequin can preserve speaker identification and matter context throughout all the hour as a substitute of resetting each few seconds.

60-minute Single-Go Processing

The first key characteristic is that many typical ASR methods course of lengthy audio by slicing it into quick segments, which may lose world context. VibeVoice-ASR as a substitute takes as much as 60 minutes of steady audio inside a 64K token window so it could possibly preserve constant speaker monitoring and semantic context throughout all the recording.

That is vital for duties like assembly transcription, lectures, and lengthy help calls. A single go over the entire sequence simplifies the pipeline. There is no such thing as a must implement customized logic to merge partial hypotheses or restore speaker labels at boundaries between audio chunks.

Custom-made Hotwords for area accuracy

Custom-made Hotwords are the second key characteristic. Customers can present hotwords resembling product names, group names, technical phrases, or background context. The mannequin makes use of these hotwords to information the popularity course of.

This lets you bias decoding towards the proper spelling and pronunciation for area particular tokens with out retraining the mannequin. For instance, a dev-user can go inner undertaking names or buyer particular phrases at inference time. That is helpful when deploying the identical base mannequin throughout a number of merchandise that share comparable acoustic situations however very completely different vocabularies.

Microsoft additionally ships a finetuning-asr listing with LoRA based mostly tremendous tuning scripts for VibeVoice-ASR. Collectively, hotwords and LoRA tremendous tuning give a path for each gentle weight adaptation and deeper area specialization.

Wealthy Transcription, diarization, and timing

The third characteristic is Wealthy Transcription with Who, When, and What. The mannequin collectively performs ASR, diarization, and timestamping, and returns a structured output that signifies who mentioned what and when.

See under the three analysis figures named DER, cpWER, and tcpWER.

https://huggingface.co/microsoft/VibeVoice-ASR
  • DER is Diarization Error Price, it measures how properly the mannequin assigns speech segments to the proper speaker
  • cpWER and tcpWER are phrase error fee metrics computed beneath conversational settings

These graphs summarize how properly the mannequin performs on multi speaker lengthy type information, which is the first goal setting for this ASR system.

The structured output format is properly fitted to downstream processing like speaker particular summarization, motion merchandise extraction, or analytics dashboards. Since segments, audio system, and timestamps already come from a single mannequin, downstream code can deal with the transcript as a time aligned occasion log.

Key Takeaways

  • VibeVoice-ASR is a unified speech to textual content mannequin that handles 60 minute lengthy type audio in a single go inside a 64K token context.
  • The mannequin collectively performs ASR, diarization, and timestamping so it outputs structured transcripts that encode Who, When, and What in a single inference step.
  • Custom-made Hotwords let customers inject area particular phrases resembling product names or technical jargon to enhance recognition accuracy with out retraining the mannequin.
  • Analysis with DER, cpWER, and tcpWER focuses on multi speaker conversational eventualities which aligns the mannequin with conferences, lectures, and lengthy calls.
  • VibeVoice-ASR is launched within the VibeVoice open supply stack beneath MIT license with official weights, tremendous tuning scripts, and an internet Playground for experimentation.

Try the Mannequin Weights, Repo and Playground. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as properly.


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles