Avatar Models
Choose between v1 and v2 for avatar generation
Two avatar models are available — for both companions and digital twins. Both produce real-time video avatars. The difference is the reference input and the resulting quality.
| v1 | v2 | |
|---|---|---|
| Input | Still image (pictureUrl) | Video of the person speaking (videoUrl) |
| Speed | Fast generation | Slower generation (can take up to several hours) |
| Quality | Good quality, strong photorealism | Highest quality |
| Lip sync | Good | Best accuracy |
| Expressions | Standard | Natural facial expressions |
| Facial hair | Standard | Better results |
Set the version field to "v1" or "v2" when creating a companion or digital twin. If omitted, it defaults to "v1".
v1 — Image input
Provide a pictureUrl — a publicly reachable HTTPS URL of a reference image. For best results, use a 16:9 image with the person looking into the camera, framed waist up, and well-lit.
{
"pictureUrl": "https://example.com/images/avatar.png"
}v1 is fast and produces strong photorealism. It works well for quick iteration, internal testing, and use cases where speed matters more than pixel-perfect accuracy.
v2 — Video input
Provide a videoUrl — a publicly reachable HTTPS URL of a reference video showing the person speaking.
{
"videoUrl": "https://example.com/videos/person-speaking.mp4",
"version": "v2"
}Why v2 produces better results
A still image gives the system a single frame to work from — it has to guess how the person's face moves when speaking. A video gives it the real thing. The system learns the person's unique mouth movements, facial micro-expressions, head motion patterns, and how their features shift during natural speech. This means:
- Lip sync stays accurate even during fast speech, complex words, and language switching
- Expressions feel natural — subtle brow raises, eye movements, and smiles that match the tone of what's being said
- Facial hair renders cleanly — beards, mustaches, and stubble that move naturally with the jaw instead of clipping or blurring
When to choose v2
Use v2 when the avatar represents someone your users will interact with directly — customer-facing support agents, executive avatars, brand representatives, sales leads, or any use case where the realism of the avatar affects trust and engagement.
Tips for a great reference video
The quality of the output depends heavily on the quality of the input. For the best v2 results:
- Duration: 1–5 minutes of continuous natural speech
- Content: The person should speak naturally, not read a script — conversational tone produces better results than stiff delivery
- Framing: Head and shoulders, centered, with some space above the head
- Lighting: Even, front-facing light. Avoid harsh shadows, backlighting, or overhead fluorescents
- Background: Clean and uncluttered. A plain wall or blurred background works best
- Camera: Eye-level, stable (tripod or resting on a surface). Avoid handheld footage
- Audio: Clear audio helps the system align lip movements — use a quiet room
v2 generation can take up to several hours. Plan accordingly — create the avatar ahead of time, not right before you need it.