Waymo is introducing the Waymo World Mannequin, a frontier generative mannequin that drives its subsequent technology of autonomous driving simulation. The system is constructed on high of Genie 3, Google DeepMind’s general-purpose world mannequin, and adapts it to supply photorealistic, controllable, multi-sensor driving scenes at scale.
Waymo already reviews practically 200 million totally autonomous miles on public roads. Behind the scenes, the Driver trains and is evaluated on billions of further miles in digital worlds. The Waymo World Mannequin is now the primary engine producing these worlds, with the specific purpose of exposing the stack to uncommon, safety-critical ‘long-tail’ occasions which might be nearly unimaginable to see typically sufficient in actuality.
From Genie 3 to a driving-specific world mannequin
Genie 3 is a general-purpose world mannequin that turns textual content prompts into interactive environments you may navigate in actual time at roughly 24 frames per second, usually at 720p decision. It learns the dynamics of scenes straight from giant video corpora and helps fluid management by person inputs.
Waymo makes use of Genie 3 because the spine and post-trains it for the driving area. The Waymo World Mannequin retains Genie 3’s skill to generate coherent 3D worlds, however aligns the outputs with Waymo’s sensor suite and working constraints. It generates high-fidelity digicam photos and lidar level clouds that evolve persistently over time, matching how the Waymo Driver really perceives the setting.
This isn’t simply video rendering. The mannequin produces multi-sensor, temporally constant observations that downstream autonomous driving programs can devour below the identical circumstances as real-world logs.
Emergent multimodal world information
Most AV simulators are skilled solely on on-road fleet knowledge. That limits them to the climate, infrastructure, and visitors patterns a fleet really encountered. Waymo as a substitute leverages Genie 3’s pre-training on a particularly giant and various set of movies to import broad ‘world information’ into the simulator.
Waymo then applies specialised post-training to switch this data from 2D video into 3D lidar outputs tailor-made to its {hardware}. Cameras present wealthy look and lighting. Lidar contributes exact geometry and depth. The Waymo World Mannequin collectively generates these modalities, so a simulated scene comes with each RGB streams and life like 4D level clouds.
Due to the variety of the pre-training knowledge, the mannequin can synthesize circumstances that Waymo’s fleet has indirectly seen. The Waymo staff exhibits examples similar to gentle snow on the Golden Gate Bridge, tornadoes, flooded cul-de-sacs, tropical streets unusually lined in snow, and driving out of a roadway fireplace. It additionally handles uncommon objects and edge instances like elephants, Texas longhorns, lions, pedestrians dressed as T-rexes, and car-sized tumbleweed.
The vital level is that these behaviors are emergent. The mannequin will not be explicitly programmed with guidelines for elephants or twister fluid dynamics. As a substitute, it reuses generic spatiotemporal construction discovered from movies and adapts it to driving scenes.
Three axes of controllability
A key design purpose is robust simulation controllability. The Waymo World Mannequin exposes three essential management mechanisms: driving motion management, scene format management, and language management.
Driving motion management: The simulator responds to particular driving inputs, permitting ‘what if’ counterfactuals on high of recorded logs. Devs can ask whether or not the Waymo Driver might have pushed extra assertively as a substitute of yielding in a previous scene, after which simulate that different habits. As a result of the mannequin is totally generative, it maintains realism even when the simulated route diverges removed from the unique trajectory, the place purely reconstructive strategies like 3D Gaussian Splatting (3DGS) would endure from lacking viewpoints.
Scene format management: The mannequin may be conditioned on modified highway geometry, visitors sign states, and different highway customers. Waymo can insert or reposition automobiles and pedestrians or apply mutations to highway layouts to synthesize focused interplay eventualities. This helps systematic stress testing of yielding, merging, and negotiation behaviors past what seems in uncooked logs.
Language management: Pure language prompts act as a versatile, high-level interface for enhancing time-of-day, climate, and even producing solely artificial scenes. The Waymo staff demonstrates ‘World Mutation’ sequences the place the identical base metropolis scene is rendered at daybreak, morning, midday, afternoon, night, and night time, after which below cloudy, foggy, wet, snowy, and sunny circumstances.
This tri-axis management is near a structured API: numeric driving actions, structural format edits, and semantic textual content prompts all steer the identical underlying world mannequin.
Turning extraordinary movies into multimodal simulations
The Waymo World Mannequin can convert common cell or dashcam recordings into multimodal simulations that present how the Waymo Driver would understand the identical scene.
Waymo showcases examples from scenic drives in Norway, Arches Nationwide Park, and Loss of life Valley. Given solely the video, the mannequin reconstructs a simulation with aligned digicam photos and lidar output. This creates eventualities with sturdy realism and factuality as a result of the generated world is anchored to precise footage, whereas nonetheless being controllable by way of the three mechanisms above.
Virtually, this implies a big corpus of consumer-style video may be reused as structured simulation enter with out requiring lidar recordings in these places.
Scalable inference and lengthy rollouts
Lengthy-horizon maneuvers similar to threading a slender lane with oncoming visitors or navigating dense neighborhoods require many simulation steps. Naive generative fashions endure from high quality drift and excessive compute value over lengthy rollouts.
Waymo staff reviews an environment friendly variant of the Waymo World Mannequin that helps lengthy sequences with a dramatic discount in compute whereas sustaining realism. They present 4x-speed playback of prolonged scenes like freeway navigation round an in-lane stopper, busy neighborhood driving, climbing steep streets round motorcyclists, and dealing with SUV U-turns.
For coaching and regression testing, this reduces the {hardware} price range per situation and makes giant check suites extra tractable.
Key Takeaways
- Genie 3–based mostly world mannequin: Waymo World Mannequin adapts Google DeepMind’s Genie 3 right into a driving-specific world mannequin that generates photorealistic, interactive, multi-sensor 3D environments for AV simulation.
- Multi-sensor, 4D outputs aligned with the Waymo Driver: The simulator collectively produces temporally constant digicam imagery and lidar level clouds, aligned with Waymo’s actual sensor stack, so downstream autonomy programs can devour simulation like actual logs.
- Emergent protection of uncommon and long-tail eventualities: By leveraging large-scale video pre-training, the mannequin can synthesize uncommon circumstances and objects, similar to snow on uncommon roads, floods, fires, and animals like elephants or lions, that the fleet has by no means straight noticed.
- Tri-axis controllability for focused stress testing: Driving motion management, scene format management, and language management let devs run counterfactuals, edit highway geometry and visitors contributors, and mutate time-of-day or climate by way of textual content prompts in the identical generative setting.
- Environment friendly long-horizon and video-anchored simulation: An optimized variant helps lengthy rollouts at diminished compute value, and the system may convert extraordinary dashcam or cell movies into controllable multimodal simulations, increasing the pool of life like eventualities.
Try the Technical particulars. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be part of us on telegram as properly.

