Methods to Construct a Netflix VOID Video Object Removing and Inpainting Pipeline with CogVideoX, Customized Prompting, and Finish-to-Finish Pattern Inference

On this tutorial, we construct and run a complicated pipeline for Netflix’s VOID mannequin. We arrange the atmosphere, set up all required dependencies, clone the repository, obtain the official base mannequin and VOID checkpoint, and put together the pattern inputs wanted for video object removing. We additionally make the workflow extra sensible by permitting safe terminal-style secret enter for tokens and optionally utilizing an OpenAI mannequin to generate a cleaner background immediate. As we transfer via the tutorial, we load the mannequin parts, configure the pipeline, run inference on a built-in pattern, and visualize each the generated consequence and a side-by-side comparability, giving us a full hands-on understanding of how VOID works in follow. Try the Full Codes

import os, sys, json, shutil, subprocess, textwrap, gc
from pathlib import Path
from getpass import getpass


def run(cmd, verify=True):
   print(f"n[RUN] {cmd}")
   consequence = subprocess.run(cmd, shell=True, textual content=True)
   if verify and consequence.returncode != 0:
       elevate RuntimeError(f"Command failed with exit code {consequence.returncode}: {cmd}")


print("=" * 100)
print("VOID — ADVANCED GOOGLE COLAB TUTORIAL")
print("=" * 100)


attempt:
   import torch
   gpu_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
   print(f"PyTorch already accessible. CUDA: {torch.cuda.is_available()} | Machine: {gpu_name}")
besides Exception:
   run(f"{sys.executable} -m pip set up -q torch torchvision torchaudio")
   import torch
   gpu_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
   print(f"CUDA: {torch.cuda.is_available()} | Machine: {gpu_name}")


if not torch.cuda.is_available():
   elevate RuntimeError("This tutorial wants a GPU runtime. In Colab, go to Runtime > Change runtime kind > GPU.")


print("nThis repo is heavy. The official pocket book notes 40GB+ VRAM is advisable.")
print("A100 works finest. T4/L4 could fail or be extraordinarily gradual even with CPU offload.n")


HF_TOKEN = getpass("Enter your Hugging Face token (enter hidden, press Enter if already logged in): ").strip()
OPENAI_API_KEY = getpass("Enter your OpenAI API key for OPTIONAL immediate help (press Enter to skip): ").strip()


run(f"{sys.executable} -m pip set up -q --upgrade pip")
run(f"{sys.executable} -m pip set up -q huggingface_hub hf_transfer")
run("apt-get -qq replace && apt-get -qq set up -y ffmpeg git")
run("rm -rf /content material/void-model")
run("git clone https://github.com/Netflix/void-model.git /content material/void-model")
os.chdir("/content material/void-model")


if HF_TOKEN:
   os.environ["HF_TOKEN"] = HF_TOKEN
   os.environ["HUGGINGFACE_HUB_TOKEN"] = HF_TOKEN


os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"


run(f"{sys.executable} -m pip set up -q -r necessities.txt")


if OPENAI_API_KEY:
   run(f"{sys.executable} -m pip set up -q openai")
   os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY


from huggingface_hub import snapshot_download, hf_hub_download

We arrange the complete Colab atmosphere and ready the system for working the VOID pipeline. We set up the required instruments, verify whether or not GPU assist is on the market, securely acquire the Hugging Face and elective OpenAI API keys, and clone the official repository into the Colab workspace. We additionally configure atmosphere variables and set up mission dependencies so the remainder of the workflow can run easily with out guide setup later.

print("nDownloading base CogVideoX inpainting mannequin...")
snapshot_download(
   repo_id="alibaba-pai/CogVideoX-Enjoyable-V1.5-5b-InP",
   local_dir="./CogVideoX-Enjoyable-V1.5-5b-InP",
   token=HF_TOKEN if HF_TOKEN else None,
   local_dir_use_symlinks=False,
   resume_download=True,
)


print("nDownloading VOID Go 1 checkpoint...")
hf_hub_download(
   repo_id="netflix/void-model",
   filename="void_pass1.safetensors",
   local_dir=".",
   token=HF_TOKEN if HF_TOKEN else None,
   local_dir_use_symlinks=False,
)


sample_options = ["lime", "moving_ball", "pillow"]
print(f"nAvailable built-in samples: {sample_options}")
sample_name = enter("Select a pattern [lime/moving_ball/pillow] (default: lime): ").strip() or "lime"
if sample_name not in sample_options:
   print("Invalid pattern chosen. Falling again to 'lime'.")
   sample_name = "lime"


use_openai_prompt_helper = False
custom_bg_prompt = None


if OPENAI_API_KEY:
   ans = enter("nUse OpenAI to generate an alternate background immediate for the chosen pattern? [y/N]: ").strip().decrease()
   use_openai_prompt_helper = ans == "y"

We obtain the bottom CogVideoX inpainting mannequin and the VOID Go 1 checkpoint required for inference. We then current the accessible built-in pattern choices and let ourselves select which pattern video we need to course of. We additionally initialize the elective prompt-helper movement to determine whether or not to generate a refined background immediate with OpenAI.

if use_openai_prompt_helper:
   from openai import OpenAI
   shopper = OpenAI(api_key=OPENAI_API_KEY)


   sample_context = {
       "lime": {
           "removed_object": "the glass",
           "scene_hint": "A lime falls on the desk."
       },
       "moving_ball": {
           "removed_object": "the rubber duckie",
           "scene_hint": "A ball rolls off the desk."
       },
       "pillow": {
           "removed_object": "the kettlebell being positioned on the pillow",
           "scene_hint": "Two pillows are on the desk."
       },
   }


   helper_prompt = f"""
You're serving to put together a clear background immediate for a video object removing mannequin.


Guidelines:
- Describe solely what ought to stay within the scene after eradicating the goal object/motion.
- Don't point out removing, deletion, masks, enhancing, or inpainting.
- Hold it quick, concrete, and bodily believable.
- Return just one sentence.


Pattern title: {sample_name}
Goal being eliminated: {sample_context[sample_name]['removed_object']}
Recognized scene trace from the repo: {sample_context[sample_name]['scene_hint']}
"""
   attempt:
       response = shopper.chat.completions.create(
           mannequin="gpt-4o-mini",
           temperature=0.2,
           messages=[
               {"role": "system", "content": "You write short, precise scene descriptions for video generation pipelines."},
               {"role": "user", "content": helper_prompt},
           ],
       )
       custom_bg_prompt = response.decisions[0].message.content material.strip()
       print(f"nOpenAI-generated background immediate:n{custom_bg_prompt}n")
   besides Exception as e:
       print(f"OpenAI immediate helper failed: {e}")
       custom_bg_prompt = None


prompt_json_path = Path(f"./pattern/{sample_name}/immediate.json")
if custom_bg_prompt:
   backup_path = prompt_json_path.with_suffix(".json.bak")
   if not backup_path.exists():
       shutil.copy(prompt_json_path, backup_path)
   with open(prompt_json_path, "w") as f:
       json.dump({"bg": custom_bg_prompt}, f)
   print(f"Up to date immediate.json for pattern '{sample_name}'.")

We use the elective OpenAI immediate helper to generate a cleaner and extra centered background description for the chosen pattern. We outline the scene context, ship it to the mannequin, seize the generated immediate, after which replace the pattern’s immediate.json file when a customized immediate is on the market. This permits us to make the pipeline a bit extra versatile whereas nonetheless maintaining the unique pattern construction intact.

import numpy as np
import torch.nn.purposeful as F
from safetensors.torch import load_file
from diffusers import DDIMScheduler
from IPython.show import Video, show


from videox_fun.fashions import (
   AutoencoderKLCogVideoX,
   CogVideoXTransformer3DModel,
   T5EncoderModel,
   T5Tokenizer,
)
from videox_fun.pipeline import CogVideoXFunInpaintPipeline
from videox_fun.utils.fp8_optimization import convert_weight_dtype_wrapper
from videox_fun.utils.utils import get_video_mask_input, save_videos_grid, save_inout_row


BASE_MODEL_PATH = "./CogVideoX-Enjoyable-V1.5-5b-InP"
TRANSFORMER_CKPT = "./void_pass1.safetensors"
DATA_ROOTDIR = "./pattern"
SAMPLE_NAME = sample_name


SAMPLE_SIZE = (384, 672)
MAX_VIDEO_LENGTH = 197
TEMPORAL_WINDOW_SIZE = 85
NUM_INFERENCE_STEPS = 50
GUIDANCE_SCALE = 1.0
SEED = 42
DEVICE = "cuda"
WEIGHT_DTYPE = torch.bfloat16


print("nLoading VAE...")
vae = AutoencoderKLCogVideoX.from_pretrained(
   BASE_MODEL_PATH,
   subfolder="vae",
).to(WEIGHT_DTYPE)


video_length = int(
   (MAX_VIDEO_LENGTH - 1) // vae.config.temporal_compression_ratio * vae.config.temporal_compression_ratio
) + 1
print(f"Efficient video size: {video_length}")


print("nLoading base transformer...")
transformer = CogVideoXTransformer3DModel.from_pretrained(
   BASE_MODEL_PATH,
   subfolder="transformer",
   low_cpu_mem_usage=True,
   use_vae_mask=True,
).to(WEIGHT_DTYPE)

We import the deep studying, diffusion, video show, and VOID-specific modules required for inference. We outline key configuration values, corresponding to mannequin paths, pattern dimensions, video size, inference steps, seed, machine, and information kind, after which load the VAE and base transformer parts. This part presents the core mannequin objects that type the underpino inpainting pipeline.

print(f"Loading VOID checkpoint from {TRANSFORMER_CKPT} ...")
state_dict = load_file(TRANSFORMER_CKPT)


param_name = "patch_embed.proj.weight"
if state_dict[param_name].measurement(1) != transformer.state_dict()[param_name].measurement(1):
   latent_ch, feat_scale = 16, 8
   feat_dim = latent_ch * feat_scale
   new_weight = transformer.state_dict()[param_name].clone()
   new_weight[:, :feat_dim] = state_dict[param_name][:, :feat_dim]
   new_weight[:, -feat_dim:] = state_dict[param_name][:, -feat_dim:]
   state_dict[param_name] = new_weight
   print(f"Tailored {param_name} channels for VAE masks.")


missing_keys, unexpected_keys = transformer.load_state_dict(state_dict, strict=False)
print(f"Lacking keys: {len(missing_keys)}, Surprising keys: {len(unexpected_keys)}")


print("nLoading tokenizer, textual content encoder, and scheduler...")
tokenizer = T5Tokenizer.from_pretrained(BASE_MODEL_PATH, subfolder="tokenizer")
text_encoder = T5EncoderModel.from_pretrained(
   BASE_MODEL_PATH,
   subfolder="text_encoder",
   torch_dtype=WEIGHT_DTYPE,
)
scheduler = DDIMScheduler.from_pretrained(BASE_MODEL_PATH, subfolder="scheduler")


print("nBuilding pipeline...")
pipe = CogVideoXFunInpaintPipeline(
   tokenizer=tokenizer,
   text_encoder=text_encoder,
   vae=vae,
   transformer=transformer,
   scheduler=scheduler,
)


convert_weight_dtype_wrapper(pipe.transformer, WEIGHT_DTYPE)
pipe.enable_model_cpu_offload(machine=DEVICE)
generator = torch.Generator(machine=DEVICE).manual_seed(SEED)


print("nPreparing pattern enter...")
input_video, input_video_mask, immediate, _ = get_video_mask_input(
   SAMPLE_NAME,
   sample_size=SAMPLE_SIZE,
   keep_fg_ids=[-1],
   max_video_length=video_length,
   temporal_window_size=TEMPORAL_WINDOW_SIZE,
   data_rootdir=DATA_ROOTDIR,
   use_quadmask=True,
   dilate_width=11,
)


negative_prompt = (
   "Watermark current in every body. The background is strong. "
   "Unusual physique and unusual trajectory. Distortion."
)


print(f"nPrompt: {immediate}")
print(f"Enter video tensor form: {tuple(input_video.form)}")
print(f"Masks video tensor form: {tuple(input_video_mask.form)}")


print("nDisplaying enter video...")
input_video_path = os.path.be a part of(DATA_ROOTDIR, SAMPLE_NAME, "input_video.mp4")
show(Video(input_video_path, embed=True, width=672))

We load the VOID checkpoint, align the transformer weights when wanted, and initialize the tokenizer, textual content encoder, scheduler, and last inpainting pipeline. We then allow CPU offloading, seed the generator for reproducibility, and put together the enter video, masks video, and immediate from the chosen pattern. By the top of this part, we could have every thing prepared for precise inference, together with the detrimental immediate and the enter video preview.

print("nRunning VOID Go 1 inference...")
with torch.no_grad():
   pattern = pipe(
       immediate,
       num_frames=TEMPORAL_WINDOW_SIZE,
       negative_prompt=negative_prompt,
       peak=SAMPLE_SIZE[0],
       width=SAMPLE_SIZE[1],
       generator=generator,
       guidance_scale=GUIDANCE_SCALE,
       num_inference_steps=NUM_INFERENCE_STEPS,
       video=input_video,
       mask_video=input_video_mask,
       energy=1.0,
       use_trimask=True,
       use_vae_mask=True,
   ).movies


print(f"Output form: {tuple(pattern.form)}")


output_dir = Path("/content material/void_outputs")
output_dir.mkdir(mother and father=True, exist_ok=True)


output_path = str(output_dir / f"{SAMPLE_NAME}_void_pass1.mp4")
comparison_path = str(output_dir / f"{SAMPLE_NAME}_comparison.mp4")


print("nSaving output video...")
save_videos_grid(pattern, output_path, fps=12)


print("Saving side-by-side comparability...")
save_inout_row(input_video, input_video_mask, pattern, comparison_path, fps=12)


print(f"nSaved output to: {output_path}")
print(f"Saved comparability to: {comparison_path}")


print("nDisplaying generated consequence...")
show(Video(output_path, embed=True, width=672))


print("nDisplaying comparability (enter | masks | output)...")
show(Video(comparison_path, embed=True, width=1344))


print("nDone.")

We run the precise VOID Go 1 inference on the chosen pattern utilizing the ready immediate, masks, and mannequin pipeline. We save the generated output video and likewise create a side-by-side comparability video so we will examine the enter, masks, and last consequence collectively. We show the generated movies straight in Colab, which helps us confirm that the complete video object-removal workflow works finish to finish.

In conclusion, we created a whole, Colab-ready implementation of the VOID mannequin and ran an end-to-end video inpainting workflow inside a single, streamlined pipeline. We went past fundamental setup by dealing with mannequin downloads, immediate preparation, checkpoint loading, mask-aware inference, and output visualization in a method that’s sensible for experimentation and adaptation. We additionally noticed how the completely different mannequin parts come collectively to take away objects from video whereas preserving the encompassing scene as naturally as doable. On the finish, we efficiently ran the official pattern and constructed a robust working basis that helps us prolong the pipeline for customized movies, prompts, and extra superior analysis use circumstances.

Try the Full Codes. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be a part of us on telegram as properly.

Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

Sample Page Title

Related Articles

Medicare Warning: Inpatient Hospital Copays Hit Their Highest Ranges Ever — What Seniors Now Owe Per Day

How Japan’s Crypto Journey Rule Modification Offers FSA New Transaction Surveillance Powers – Regulation Bitcoin Information

The Common TFSA Steadiness at 55 — and The best way to Enhance Yours

LEAVE A REPLY Cancel reply

Latest Articles

Medicare Warning: Inpatient Hospital Copays Hit Their Highest Ranges Ever — What Seniors Now Owe Per Day

How Japan’s Crypto Journey Rule Modification Offers FSA New Transaction Surveillance Powers – Regulation Bitcoin Information

The Common TFSA Steadiness at 55 — and The best way to Enhance Yours

Institutional Evaluation For The Week of April 6 – April 10, 2026 – Analytics & Forecasts – 5 April 2026

What to show your children to arrange them for an AI-scrambled job market

EDITOR PICKS

Medicare Warning: Inpatient Hospital Copays Hit Their Highest Ranges Ever —...

How Japan’s Crypto Journey Rule Modification Offers FSA New Transaction Surveillance...

The Common TFSA Steadiness at 55 — and The best way...

POPULAR POSTS

Qubic’s Mining Pool Attacking Monero Falls Beneath Assault

What’s nano-texture glass and do I would like it?

Feedback on the brand new buying and selling dialog in Metatrader...

POPULAR CATEGORY