Avatar Models

Two avatar models are available — for both companions and digital twins. Both produce real-time video avatars. The difference is the reference input and the resulting quality.

	v1	v2
Input	Still image (`pictureUrl`)	Video of the person speaking (`videoUrl`)
Speed	Fast generation	Slower generation (can take up to several hours)
Quality	Good quality, strong photorealism	Highest quality
Lip sync	Good	Best accuracy
Expressions	Standard	Natural facial expressions
Facial hair	Standard	Better results

Set the version field to "v1" or "v2" when creating a companion or digital twin. If omitted, it defaults to "v1".

v1 — Image input

Provide a pictureUrl — a publicly reachable HTTPS URL of a reference image. For best results, use a 16:9 image with the person looking into the camera, framed waist up, and well-lit.

{
  "pictureUrl": "https://example.com/images/avatar.png"
}

v1 is fast and produces strong photorealism. It works well for quick iteration, internal testing, and use cases where speed matters more than pixel-perfect accuracy.

v2 — Video input

Provide a videoUrl — a publicly reachable HTTPS URL of a reference video showing the person speaking.

{
  "videoUrl": "https://example.com/videos/person-speaking.mp4",
  "version": "v2"
}

Why v2 produces better results

A still image gives the system a single frame to work from — it has to guess how the person's face moves when speaking. A video gives it the real thing. The system learns the person's unique mouth movements, facial micro-expressions, head motion patterns, and how their features shift during natural speech. This means:

Lip sync stays accurate even during fast speech, complex words, and language switching
Expressions feel natural — subtle brow raises, eye movements, and smiles that match the tone of what's being said
Facial hair renders cleanly — beards, mustaches, and stubble that move naturally with the jaw instead of clipping or blurring

When to choose v2

Use v2 when the avatar represents someone your users will interact with directly — customer-facing support agents, executive avatars, brand representatives, sales leads, or any use case where the realism of the avatar affects trust and engagement.

Tips for a great reference video

The quality of the output depends heavily on the quality of the input. For the best v2 results:

Duration: 1–5 minutes of continuous natural speech
Content: The person should speak naturally, not read a script — conversational tone produces better results than stiff delivery
Framing: Head and shoulders, centered, with some space above the head
Lighting: Even, front-facing light. Avoid harsh shadows, backlighting, or overhead fluorescents
Background: Clean and uncluttered. A plain wall or blurred background works best
Camera: Eye-level, stable (tripod or resting on a surface). Avoid handheld footage
Audio: Clear audio helps the system align lip movements — use a quiet room

v2 generation can take up to several hours. Plan accordingly — create the avatar ahead of time, not right before you need it.