InfiniteTalk ComfyUI Guide

InfiniteTalk is an audio-driven video generation tool for creating talking avatar videos. It follows the MultiTalk family of methods and focuses on speech-driven motion: clear lip syncing with natural head and body movement. A notable capability is long-duration generation. You can extend beyond short clips into minutes, provided your hardware has enough RAM and VRAM. InfiniteTalk supports both image-to-video and video-to-video.

What You Will Build

You will run InfiniteTalk inside ComfyUI through custom nodes based on a Wan video wrapper. The workflow loads a base image-to-video model, the InfiniteTalk model weights, your image or video, and your audio. It then generates a talking portrait that follows the voice track.

Key Idea

Audio drives motion. InfiniteTalk synchronizes lip shapes and coordinates head and upper‑body motion with the speech rhythm. The pipeline supports long output by processing the video in chunks with overlap, which helps preserve continuity across segments.

Before You Start

Update the Wan video wrapper custom nodes in ComfyUI to a recent version that includes InfiniteTalk support.
Download the InfiniteTalk model files prepared for ComfyUI. There are two variants: Single (one speaking subject) and Multi (multiple subjects).
Place the InfiniteTalk weights in your ComfyUI models directory (for example, under a diffusion models subfolder).

Model Choices

InfiniteTalk Single: for a single talking avatar.
InfiniteTalk Multi: for multiple speaking subjects with separate audio inputs and masks.
Base image‑to‑video model: e.g., Wan image‑to‑video, often used at 480p for reliability; 720p is available on stronger hardware.

Recommended Workflow in ComfyUI

Load the base image‑to‑video model and VAE, along with text and audio encoders if required by your nodes.
Load the InfiniteTalk Single model for a single speaker scenario.
Provide inputs:
- Image‑to‑video: a single portrait image and a clean speech audio file.
- Video‑to‑video: a short reference video and the new speech audio.
Set generation size to 480p initially. Increase to 720p after validating your setup.
Pick sampling steps (for example, 4–8 with acceleration methods or 40 for standard runs).
Choose guidance scales:
- Without LoRA: text ≈ 5, audio ≈ 4.
- With LoRA: text ≈ 1, audio ≈ 2.
Enable chunk overlap (for example, ~25 frames of overlap for segments around ~80 frames) to improve continuity.
Render a short preview. If lip timing or motion needs improvement, adjust steps or guidance values and preview again.

Observed Behavior and Tuning Notes

Image‑to‑video is a good baseline for evaluating lip syncing and head motion. Start at 480p and moderate steps.
For very long videos, rely on chunking with overlap to connect segments smoothly and reduce motion jumps.
If eye blinking appears too frequent, modest FPS interpolation can help smooth motion.
Video‑to‑video can mimic the overall feel of the source camera movement. For strict camera behavior on short clips, a stronger editing mode can help, but may introduce color shift; prefer standard mode for longer content.

Example Use Case Walkthrough

Suppose you want to produce a ~50 second talking avatar based on a prepared script. Provide a clear voice sample as the audio input and a single well‑lit portrait as the image input. Generate the audio via a text‑to‑speech node if needed, then compute duration to keep sampler settings in sync with the audio length. Use 480p, 4 steps with an acceleration method or around 40 steps in a standard pipeline. Keep overlap between chunks and monitor the preview. If timing appears slightly behind, increase audio guidance within the suggested range and try again. Once satisfied, export the result and optionally apply frame interpolation to increase FPS and reduce flicker.

Hardware Tips

For low VRAM, set persistent parameters in the DiT to zero where the node supports it, and prefer 480p.
For faster throughput, consider multi‑GPU execution. Quantized variants can reduce memory pressure for single‑GPU runs.
Always validate with a short preview before long renders.

Summary

InfiniteTalk in ComfyUI offers a practical path to speech‑aligned portrait animation. Start simple, tune guidance and steps, keep chunk overlap for continuity, and scale up once previews look right. The Single model is ideal for one speaker; the Multi model extends the workflow to multiple subjects and audio tracks.