HomeSample Page

Sample Page Title


Google Well being AI crew has launched MedASR, an open weights medical speech to textual content mannequin that targets medical dictation and doctor affected person conversations and is designed to plug instantly into fashionable AI workflows.

What MedASR is and the place it matches?

MedASR is a speech to textual content mannequin based mostly on the Conformer structure and is pre skilled for medical dictation and transcription. It’s positioned as a place to begin for builders who wish to construct healthcare based mostly voice functions reminiscent of radiology dictation instruments or go to notice seize programs.

The mannequin has 105 million parameters and accepts mono channel audio at 16000 hertz with 16 bit integer waveforms. It produces textual content solely output, so it drops instantly into downstream pure language processing or generative fashions reminiscent of MedGemma.

MedASR sits contained in the Well being AI Developer Foundations portfolio, alongside MedGemma, MedSigLIP and different area particular medical fashions that share widespread phrases of use and a constant governance story.

Coaching information and area specialization

MedASR is skilled on a various corpus of de recognized medical speech. The dataset consists of about 5000 hours of doctor dictations and medical conversations throughout radiology, inner medication and household medication.

The coaching pairs audio segments with transcripts and metadata. Subsets of the conversational information are annotated with medical named entities together with signs, medicines and circumstances. This offers the mannequin robust protection of medical vocabulary and phrasing patterns that seem in routine documentation.

The mannequin is English solely, and most coaching audio comes from audio system for whom English is a primary language and who had been raised in the US. The documentation notes that efficiency could also be decrease for different speaker profiles or noisy microphones and recommends positive tuning for such settings.

Structure and decoding

MedASR follows the Conformer encoder design. Conformer combines convolution blocks with self consideration layers so it might seize native acoustic patterns and longer vary temporal dependencies in the identical stack.

The mannequin is uncovered as an automatic speech detector with a CTC type interface. Within the reference implementation, builders use AutoProcessor to create enter options from waveform audio and AutoModelForCTC to supply token sequences. Decoding makes use of grasping decoding by default. The mannequin can be paired with an exterior six gram language mannequin with beam search of measurement 8 to enhance phrase error price.

MedASR coaching makes use of JAX and ML Pathways on TPUv4p, TPUv5p and TPUv5e {hardware}. These programs present the size wanted for giant speech fashions and align with Google’s broader basis mannequin coaching stack.

Efficiency on medical speech duties

Key outcomes, with grasping decoding and with a six gram language mannequin, are:

  • RAD DICT, radiologist dictation: MedASR grasping 6.6 p.c, MedASR plus language mannequin 4.6 p.c, Gemini 2.5 Professional 10.0 p.c, Gemini 2.5 Flash 24.4 p.c, Whisper v3 Massive 25.3 p.c.
  • GENERAL DICT, normal and inner medication: MedASR grasping 9.3 p.c, MedASR plus language mannequin 6.9 p.c, Gemini 2.5 Professional 16.4 p.c, Gemini 2.5 Flash 27.1 p.c, Whisper v3 Massive 33.1 p.c.
  • FM DICT, household medication: MedASR grasping 8.1 p.c, MedASR plus language mannequin 5.8 p.c, Gemini 2.5 Professional 14.6 p.c, Gemini 2.5 Flash 19.9 p.c, Whisper v3 Massive 32.5 p.c.
  • Eye Gaze, dictation on 998 MIMIC chest X ray instances: MedASR grasping 6.6 p.c, MedASR plus language mannequin 5.2 p.c, Gemini 2.5 Professional 5.9 p.c, Gemini 2.5 Flash 9.3 p.c, Whisper v3 Massive 12.5 p.c.

Developer workflow and deployment choices

A minimal pipeline instance is:

from transformers import pipeline
import huggingface_hub

audio = huggingface_hub.hf_hub_download("google/medasr", "test_audio.wav")
pipe = pipeline("automatic-speech-recognition", mannequin="google/medasr")
outcome = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(outcome)

For extra management, builders load AutoProcessor and AutoModelForCTC, resample audio to 16000 hertz with librosa, transfer tensors to CUDA if accessible and name mannequin.generate adopted by processor.batch_decode.

Key Takeaways

  1. MedASR is a light-weight, open weights Conformer based mostly medical ASR mannequin: It has 105M parameters, is skilled particularly for medical dictation and transcription, and is launched underneath the Well being AI Developer Foundations program as an English solely mannequin for healthcare builders.
  2. Area particular coaching on about 5000 hours of de recognized medical audio: MedASR is pre skilled on doctor dictations and medical conversations throughout specialties like radiology, inner medication and household medication, which provides it robust protection of medical terminology in comparison with normal objective ASR programs.
  3. Aggressive or higher phrase error charges on medical dictation benchmarks: On inner radiology, normal medication, household medication and Eye Gaze datasets, MedASR with grasping or language mannequin decoding matches or outperforms massive normal fashions reminiscent of Gemini 2.5 Professional, Gemini 2.5 Flash and Whisper v3 Massive on phrase error price for English medical speech.

Take a look at the Repo, Mannequin on HF and Technical particulars. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles