Build Your Omniagent

Avatar Models

Choose between v1 and v2 for avatar generation

Two avatar models are available — for both companions and digital twins. Both produce real-time video avatars. The difference is the reference input and the resulting quality.

v1v2
InputStill image (pictureUrl)Video of the person speaking (videoUrl)
SpeedFast generationSlower generation (can take up to several hours)
QualityGood quality, strong photorealismHighest quality
Lip syncGoodBest accuracy
ExpressionsStandardNatural facial expressions
Facial hairStandardBetter results

Set the version field to "v1" or "v2" when creating a companion or digital twin. If omitted, it defaults to "v1".

v1 — Image input

Provide a pictureUrl — a publicly reachable HTTPS URL of a reference image. For best results, use a 16:9 image with the person looking into the camera, framed waist up, and well-lit.

{
  "pictureUrl": "https://example.com/images/avatar.png"
}

v1 is fast and produces strong photorealism. It works well for quick iteration, internal testing, and use cases where speed matters more than pixel-perfect accuracy.

v2 — Video input

Provide a videoUrl — a publicly reachable HTTPS URL of a reference video showing the person speaking.

{
  "videoUrl": "https://example.com/videos/person-speaking.mp4",
  "version": "v2"
}

Why v2 produces better results

A still image gives the system a single frame to work from — it has to guess how the person's face moves when speaking. A video gives it the real thing. The system learns the person's unique mouth movements, facial micro-expressions, head motion patterns, and how their features shift during natural speech. This means:

  • Lip sync stays accurate even during fast speech, complex words, and language switching
  • Expressions feel natural — subtle brow raises, eye movements, and smiles that match the tone of what's being said
  • Facial hair renders cleanly — beards, mustaches, and stubble that move naturally with the jaw instead of clipping or blurring

When to choose v2

Use v2 when the avatar represents someone your users will interact with directly — customer-facing support agents, executive avatars, brand representatives, sales leads, or any use case where the realism of the avatar affects trust and engagement.

Tips for a great reference video

The quality of the output depends heavily on the quality of the input. For the best v2 results:

  • Duration: 1–5 minutes of continuous natural speech
  • Content: The person should speak naturally, not read a script — conversational tone produces better results than stiff delivery
  • Framing: Head and shoulders, centered, with some space above the head
  • Lighting: Even, front-facing light. Avoid harsh shadows, backlighting, or overhead fluorescents
  • Background: Clean and uncluttered. A plain wall or blurred background works best
  • Camera: Eye-level, stable (tripod or resting on a surface). Avoid handheld footage
  • Audio: Clear audio helps the system align lip movements — use a quiet room

v2 generation can take up to several hours. Plan accordingly — create the avatar ahead of time, not right before you need it.

On this page