How to Build a Real-Time AI Avatar Assistant with Streaming APIs

Real-time AI avatars have shifted from research demos to production features. Users now expect to see a digital face respond to them in under a second, with lip-synced speech and contextual conversation. The technical challenge is not whether this is possible (it clearly is), but how to wire together the streaming pieces without introducing perceptible latency.

This guide walks through the architecture of a real-time AI avatar assistant, from the WebSocket connection layer through LLM streaming, text-to-speech synthesis, and lip-sync rendering. We will use concrete code examples, discuss the tradeoffs at each layer, and build something you can actually ship.

Architecture Overview: The Four Streaming Layers

A real-time avatar assistant is essentially four streaming systems chained together, each feeding into the next:

Input capture – user speech (via browser MediaRecorder or typed text) sent to the server
LLM inference – the user's message hits a language model that streams tokens back
Text-to-speech (TTS) – streamed tokens are converted to audio chunks in real time
Avatar rendering – audio chunks drive lip-sync on a 3D or 2D avatar in the browser

The critical insight is that these layers overlap in time. You do not wait for the LLM to finish generating before starting TTS. You do not wait for TTS to finish before starting lip-sync. Each layer begins processing the moment it receives its first chunk from the layer above. This pipelining is what makes sub-second response times possible even with a 2-3 second total generation time.

Layer 1: The WebSocket Connection

HTTP request-response is the wrong transport for this use case. You need a persistent, bidirectional channel. WebSockets give you exactly that: a single TCP connection where both client and server can push messages at any time.

Here is a minimal server setup using Node.js and the ws library:

import { WebSocketServer } from 'ws';

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws) => {
  console.log('Client connected');

  ws.on('message', async (data) => {
    const message = JSON.parse(data.toString());

    if (message.type === 'user-input') {
      // Start the streaming pipeline
      await streamAvatarResponse(ws, message.text);
    }
  });

  ws.on('close', () => console.log('Client disconnected'));
});

console.log('Avatar WebSocket server running on ws://localhost:8080');

On the client side, the connection is straightforward:

const ws = new WebSocket('ws://localhost:8080');

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'user-input',
    text: 'Tell me about your product features'
  }));
};

ws.onmessage = (event) => {
  const chunk = JSON.parse(event.data);
  handleStreamChunk(chunk);
};

One important production consideration: use a reconnection strategy. WebSocket connections drop. Mobile networks are flaky. Implement exponential backoff with a maximum retry count, and queue unsent messages during disconnection so the user does not lose their input.

Layer 2: Streaming LLM Inference

Most major LLM providers now support streaming responses. OpenAI, Anthropic, Google, and open-source models via vLLM or Ollama all emit tokens as they are generated rather than waiting for the complete response.

Here is how you stream from OpenAI's API and forward tokens over the WebSocket:

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function streamLLMResponse(ws, userMessage, systemPrompt) {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userMessage }
    ],
    stream: true,
  });

  let buffer = '';
  let sentenceCount = 0;

  for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content || '';
    buffer += token;

    // Flush on sentence boundaries for natural TTS chunks
    const sentenceEnd = buffer.match(/[.!?]\s/);
    if (sentenceEnd) {
      const sentence = buffer.substring(0, sentenceEnd.index + 1);
      buffer = buffer.substring(sentenceEnd.index + 2);

      ws.send(JSON.stringify({
        type: 'llm-sentence',
        text: sentence.trim(),
        index: sentenceCount++,
        done: false
      }));
    }
  }

  // Flush remaining buffer
  if (buffer.trim()) {
    ws.send(JSON.stringify({
      type: 'llm-sentence',
      text: buffer.trim(),
      index: sentenceCount++,
      done: true
    }));
  }
}

The key detail here is sentence-level chunking. Raw token streaming produces fragments that are too small for TTS ("The", " product", " has", " three"). Accumulating into full sentences gives the TTS engine enough context to produce natural prosody while keeping latency low. The first sentence typically arrives within 300-500ms of the request.

Choosing Your Sentence Boundary Strategy

Simple regex on punctuation works for English but breaks on abbreviations ("Dr. Smith"), URLs, and decimal numbers. A more robust approach uses a small finite state machine that tracks whether a period is likely a sentence terminator or part of an abbreviation. For most avatar use cases, the simple approach is fine because the LLM's output is conversational, not technical documentation.

You can also chunk on commas for even faster first-byte time, at the cost of slightly less natural TTS output. This is a useful tradeoff for applications where perceived responsiveness matters more than perfect speech quality, like customer support avatars handling high volumes.

Layer 3: Real-Time Text-to-Speech

This is where the pipeline gets interesting. You need a TTS system that can accept text chunks and return audio chunks without waiting for the full text. Not all TTS APIs support this.

The three main approaches in 2026:

Option A: Streaming TTS APIs

Services like ElevenLabs, Play.ht, and Cartesia offer streaming endpoints that return audio as PCM or MP3 chunks while still processing. ElevenLabs' streaming API is the most mature:

async function streamTTS(ws, sentence, voiceId, sentenceIndex) {
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream`,
    {
      method: 'POST',
      headers: {
        'xi-api-key': process.env.ELEVENLABS_API_KEY,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        text: sentence,
        model_id: 'eleven_turbo_v2_5',
        output_format: 'pcm_24000',
      }),
    }
  );

  const reader = response.body.getReader();
  let chunkIndex = 0;

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    // Send raw PCM audio chunk to client
    ws.send(JSON.stringify({
      type: 'audio-chunk',
      audio: Buffer.from(value).toString('base64'),
      sentenceIndex,
      chunkIndex: chunkIndex++,
      format: 'pcm_24000'
    }));
  }
}

Option B: Local TTS with Coqui or Piper

If you want to avoid per-character API costs, open-source TTS models like Piper (fast, lightweight) or Coqui XTTS (higher quality, voice cloning) can run on your own GPU. Piper processes a sentence in 50-100ms on a modern CPU, making it viable for real-time streaming without a GPU. The quality gap has narrowed significantly in the past year.

Option C: Browser-Native TTS

The Web Speech API (speechSynthesis) is free and requires no server-side processing, but the voice quality is noticeably worse than neural TTS, and you have limited control over timing for lip-sync. It is useful for prototyping but rarely suitable for production avatar experiences.

Layer 4: Avatar Rendering and Lip-Sync

The final layer takes audio chunks and drives a visual avatar. This happens entirely in the browser using WebGL (Three.js, Babylon.js) or a 2D canvas.

Viseme-Based Lip-Sync

The standard approach extracts visemes (visual mouth shapes corresponding to phonemes) from the audio stream. There are roughly 15 distinct visemes in English. You map each viseme to a blend shape on your 3D model's face mesh.

// Using the Web Audio API to analyze audio for viseme extraction
class LipSyncAnalyzer {
  constructor(audioContext) {
    this.analyser = audioContext.createAnalyser();
    this.analyser.fftSize = 256;
    this.dataArray = new Float32Array(this.analyser.frequencyBinCount);
  }

  getViseme() {
    this.analyser.getFloatFrequencyData(this.dataArray);

    // Simplified: map frequency energy bands to mouth openness
    const lowEnergy = this.getEnergyInRange(80, 300);   // jaw open
    const midEnergy = this.getEnergyInRange(300, 3000);  // lip shape
    const highEnergy = this.getEnergyInRange(3000, 8000); // fricatives

    return {
      jawOpen: Math.min(lowEnergy / 50, 1),
      mouthWidth: Math.min(midEnergy / 40, 1),
      lipsTight: highEnergy > 30 ? 0.5 : 0,
    };
  }

  getEnergyInRange(minHz, maxHz) {
    const binSize = 24000 / this.analyser.frequencyBinCount;
    const startBin = Math.floor(minHz / binSize);
    const endBin = Math.floor(maxHz / binSize);
    let sum = 0;
    for (let i = startBin; i <= endBin; i++) {
      sum += Math.pow(10, this.dataArray[i] / 20);
    }
    return sum / (endBin - startBin + 1);
  }
}

This frequency-analysis approach is simpler than full phoneme detection but produces surprisingly convincing results. The avatar's mouth opens and closes in sync with the audio, and the different frequency bands create enough variation to avoid the "puppet mouth" effect.

Using Ready Player Me Avatars

If you are working with Ready Player Me (RPM) avatars, which use a standard set of ARKit-compatible blend shapes, you can drive lip-sync directly on the blend shape targets:

function applyVisemeToAvatar(avatar, viseme) {
  const head = avatar.getObjectByName('Wolf3D_Head');
  if (!head?.morphTargetInfluences) return;

  const targets = head.morphTargetDictionary;

  // Reset all mouth shapes
  head.morphTargetInfluences[targets['viseme_aa']] = 0;
  head.morphTargetInfluences[targets['viseme_O']] = 0;
  head.morphTargetInfluences[targets['jawOpen']] = 0;

  // Apply current viseme
  head.morphTargetInfluences[targets['jawOpen']] = viseme.jawOpen;
  head.morphTargetInfluences[targets['viseme_aa']] = viseme.mouthWidth * 0.7;
  head.morphTargetInfluences[targets['viseme_O']] = viseme.lipsTight;
}

Avatarium uses Ready Player Me models internally, so if you are building on the Avatarium SDK, the lip-sync pipeline is already integrated. You pass audio data to the avatar component and it handles viseme extraction and blend shape animation automatically.

Putting It All Together: The Full Pipeline

Here is the orchestration function that ties all four layers into a single streaming pipeline:

async function streamAvatarResponse(ws, userText) {
  const systemPrompt = `You are a helpful assistant. Keep responses
    conversational and under 4 sentences per turn.`;

  // Stream LLM tokens, chunk into sentences
  const sentences = [];
  let processingIndex = 0;

  // Start LLM streaming (produces sentences)
  const llmPromise = streamLLMResponse(ws, userText, systemPrompt);

  // Process sentences as they arrive (TTS + send audio)
  const processLoop = setInterval(async () => {
    if (processingIndex < sentences.length) {
      const sentence = sentences[processingIndex];
      await streamTTS(ws, sentence, 'your-voice-id', processingIndex);
      processingIndex++;
    }
  }, 50);

  await llmPromise;
  clearInterval(processLoop);

  // Process any remaining sentences
  while (processingIndex < sentences.length) {
    await streamTTS(ws, sentences[processingIndex], 'voice-id', processingIndex);
    processingIndex++;
  }

  ws.send(JSON.stringify({ type: 'response-complete' }));
}

In production, you would replace the polling interval with an event-driven approach using an async queue. But the core idea remains: LLM generation and TTS synthesis run concurrently, with sentences flowing from one to the other as they become available.

Latency Budget: Where the Milliseconds Go

For a good user experience, the avatar should start speaking within 1.5 seconds of the user finishing their input. Here is a realistic latency budget:

Speech-to-text (if using voice input): 200-400ms with Whisper or Deepgram streaming
LLM first token: 200-600ms depending on model and provider
Sentence accumulation: 300-800ms (depends on sentence length)
TTS first audio chunk: 150-300ms with ElevenLabs streaming
Client audio playback start: 50-100ms (buffering)

Total time to first audible word: roughly 900ms to 2.2 seconds. The lower end is achievable with a fast model (GPT-4o mini, Claude 3.5 Haiku), streaming TTS, and good network conditions. The upper end is what you get with a larger model and non-streaming TTS.

The single biggest latency win is switching from "generate full response then synthesize" to "synthesize as sentences arrive." This alone typically cuts perceived latency by 60-70%.

Handling Edge Cases in Production

Interruptions

Users will speak while the avatar is still talking. You need a barge-in system that detects user speech, immediately stops the current avatar audio, and processes the new input. On the server side, this means cancelling any in-flight TTS requests and aborting the current LLM stream:

ws.on('message', (data) => {
  const msg = JSON.parse(data.toString());

  if (msg.type === 'user-interrupt') {
    // Cancel current generation
    if (currentAbortController) {
      currentAbortController.abort();
    }
    // Start new response pipeline
    streamAvatarResponse(ws, msg.text);
  }
});

Error Recovery

When a TTS API call fails mid-sentence, you have two options: skip the sentence (the avatar goes briefly silent then continues) or fall back to browser TTS for that chunk. In practice, skipping is less jarring than a sudden voice quality change.

Concurrency and Session Management

Each connected user needs their own LLM conversation history, TTS voice state, and streaming pipeline. Use a session map keyed by WebSocket connection ID, and clean up resources on disconnect. Memory usage per session is minimal (a few KB for conversation history), so a single server can handle hundreds of concurrent avatar sessions.

Cost Considerations

Running a real-time avatar assistant at scale involves three variable costs:

LLM tokens: at $0.15-3.00 per million input tokens depending on model, a typical 4-sentence response costs $0.001-0.005
TTS characters: ElevenLabs charges roughly $0.30 per 1,000 characters. A 4-sentence response (about 80 words, 400 characters) costs about $0.12
Compute: WebSocket server, minimal. GPU for local TTS, more significant but amortised across sessions

TTS is the dominant cost by far. This is why many production deployments use a hybrid approach: neural TTS for the first response (when perceived quality matters most) and a lighter model for subsequent turns. Or they use local Piper TTS entirely and accept the slight quality tradeoff.

Where Avatarium Fits In

If building this entire pipeline from scratch sounds like a lot of plumbing, that is because it is. Avatarium's SDK handles the WebSocket management, audio streaming, lip-sync rendering, and avatar display as a single integration. You bring your LLM and system prompt; Avatarium handles everything from TTS through to the animated avatar in the browser.

A basic integration looks like this:

import { AvatarSession } from '@avatarium/sdk';

const session = new AvatarSession({
  apiKey: 'your-avatarium-key',
  avatarId: 'avatar_abc123',
  containerId: 'avatar-container',
});

await session.connect();

// Send text, avatar speaks it with lip-sync automatically
await session.speak('Welcome! How can I help you today?');

// Or connect to your LLM for fully interactive conversation
session.onUserSpeech((transcript) => {
  // Your LLM logic here
  const response = await getAIResponse(transcript);
  session.speak(response);
});

You can explore the full SDK documentation at docs.avatarium.ai and start building with a free tier on the Avatarium dashboard.

What to Build Next

Once you have a working real-time avatar assistant, the interesting extensions are:

Multi-modal input – accept images, documents, or screen shares alongside voice
Emotion rendering – map sentiment from the LLM response to facial expressions (smile, concern, enthusiasm)
Memory and personalization – store conversation history per user so the avatar remembers previous interactions
Multi-language – detect the user's language and switch TTS voice dynamically
Analytics – track conversation completion rates, common questions, and user satisfaction to improve your system prompt

Real-time AI avatars are moving fast. The gap between a demo and a production-quality experience is mostly in the streaming pipeline engineering, and now you have the blueprint to close that gap.