Iseer Live API — Complete Reference

The Iseer Live API enables real-time, bidirectional interaction with the arete-live model, supporting audio, video, and text inputs with ultra-low latency native audio outputs. This is the definitive guide to every feature, configuration, and best practice available.


Overview

The Iseer Live API uses WebSockets for persistent, duplex communication. Unlike traditional request-response APIs, a Live session maintains continuous state, enabling natural conversations with interruptions, tool calls, and multimodal streaming.

Key Concepts

  • Session: A persistent, stateful WebSocket connection to the arete-live model.
  • Setup Message: The initial JSON payload sent immediately after connecting that defines modalities, voice, system instructions, tools, and all session parameters.
  • Real-time Input: Client-side streaming of PCM audio, JPEG/PNG video frames, or text.
  • Server Content: Model responses streamed back as native audio, text transcriptions, or tool call requests.
  • Turn: One cycle of user input → model response. Billing and context accumulation happen per turn.

Establishing a Connection

Step 1: Request a Session

Send a POST request to /api/live-connect on https://api.iseer.co. The endpoint returns a secure WebSocket URL (with an ephemeral token) and the setup payload.

curl -X POST https://api.iseer.co/api/live-connect \
  -H "Content-Type: application/json" \
  -d '{
    "system_instruction": "You are a helpful Iseer assistant.",
    "response_modalities": ["AUDIO"]
  }'

Response:

{
  "url": "wss://genai.api.iseer.co?access_token=YOUR_EPHEMERAL_TOKEN",
  "setup": {
    "setup": {
      "model": "models/arete-live",
      "generationConfig": {
        "responseModalities": ["AUDIO"]
      },
      "systemInstruction": {
        "parts": [{ "text": "You are a helpful Iseer assistant." }]
      }
    }
  }
}

Step 2: Connect via WebSocket

Open the WebSocket and immediately send the setup payload. Wait for the setupComplete acknowledgement before streaming any data.

const ws = new WebSocket(response.url);
 
ws.onopen = () => {
  ws.send(JSON.stringify(response.setup));
};
 
ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.setupComplete) {
    console.log("Session established. Ready to stream.");
  }
};

Interaction Modalities

Once setupComplete is received, you can begin streaming data in any combination of the following modalities.

Sending Audio (Speech-to-Speech)

Input audio must be raw, little-endian, 16-bit PCM. The native sample rate is 16kHz mono, but the API will resample if you send a different rate (set via the MIME type).

{
  "realtimeInput": {
    "audio": {
      "mimeType": "audio/pcm;rate=16000",
      "data": "BASE64_ENCODED_PCM_DATA"
    }
  }
}

Best Practice: Chunk your audio into 20ms–100ms segments (320–1600 bytes at 16kHz) and send them continuously. Do NOT buffer more than ~100ms before sending—smaller chunks minimize latency. If your microphone captures at 44.1kHz or 48kHz, resample to 16kHz on the client before transmission.

Audio Stream End

When the audio stream is paused for more than one second (e.g., the user mutes their mic), send an audioStreamEnd event to flush any buffered audio on the server. You can resume sending audio data at any time afterwards.

{
  "realtimeInput": {
    "audioStreamEnd": true
  }
}

Sending Video (Vision)

The arete-live model supports real-time vision. Send individual image frames (JPEG or PNG) at a maximum rate of 1 frame per second.

{
  "realtimeInput": {
    "video": {
      "mimeType": "image/jpeg",
      "data": "BASE64_ENCODED_JPEG_DATA"
    }
  }
}

Sending Text

You can inject text into the conversation via realtimeInput:

{
  "realtimeInput": {
    "text": "Hello, how are you?"
  }
}

Incremental Content Updates (Context Seeding)

Use clientContent to seed initial conversation history or inject context. This is useful for restoring a session or providing the model with background information before the user speaks.

{
  "clientContent": {
    "turns": [
      { "role": "user", "parts": [{ "text": "What is the capital of France?" }] },
      { "role": "model", "parts": [{ "text": "Paris." }] },
      { "role": "user", "parts": [{ "text": "And Germany?" }] }
    ],
    "turnComplete": true
  }
}

Note: For arete-live, clientContent is primarily supported for seeding initial context history. During active conversation, use realtimeInput with the text field instead.


Receiving Output

The model streams responses as JSON payloads containing serverContent.

Receiving Native Audio

Output audio is raw, little-endian, 16-bit PCM at 24kHz mono. A single server event may contain multiple content parts simultaneously (e.g., inlineData and transcript together). Ensure your code processes all parts in each event.

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  const parts = msg.serverContent?.modelTurn?.parts;
  if (parts) {
    for (const part of parts) {
      if (part.inlineData?.mimeType?.includes("audio/pcm")) {
        playAudioBuffer(part.inlineData.data); // base64 → PCM → AudioContext
      }
    }
  }
};

Turn Completion

When the model finishes its response, it sends a turnComplete signal:

{ "serverContent": { "turnComplete": true } }

Generation Complete

Distinct from turnComplete, the generationComplete flag signals that the model has finished all generation for the current request (important for multi-part or tool-assisted responses):

{ "serverContent": { "generationComplete": true } }

Audio Transcriptions

You can request real-time text transcriptions of both input audio (what the user says) and output audio (what the model says). This is essential for displaying subtitles, logging conversations, or feeding text to downstream systems.

Enabling Transcriptions

Include the transcription flags in your setup configuration:

"config": {
  "responseModalities": ["AUDIO"],
  "inputAudioTranscription": {},
  "outputAudioTranscription": {}
}

Receiving Transcriptions

Transcriptions arrive interwoven with audio chunks:

{
  "serverContent": {
    "inputTranscription": { "text": "What's the weather like today?" }
  }
}
{
  "serverContent": {
    "outputTranscription": { "text": "It's currently 72 degrees and sunny." }
  }
}

Billing Note: When transcription is enabled, you are charged for the text tokens generated at the text output rate in addition to the standard audio token costs.


Voice & Language Configuration

Changing the Voice

The arete-live model supports a wide range of voices. Specify a voice in your setup config:

"config": {
  "responseModalities": ["AUDIO"],
  "speechConfig": {
    "voiceConfig": {
      "prebuiltVoiceConfig": { "voiceName": "Kore" }
    }
  }
}

Language Support

The model supports 97 languages and can switch between them naturally during conversation. You can restrict which languages it speaks in via your system instructions. If you need the model to respond in a non-English language, include in your system instructions:

RESPOND IN {LANGUAGE}. YOU MUST RESPOND UNMISTAKABLY IN {LANGUAGE}.
LanguageCodeLanguageCodeLanguageCode
ArabicarEnglishenHindihi
BengalibnFrenchfrJapaneseja
ChinesezhGermandeKoreanko
DutchnlIndonesianidPortuguesept
FilipinofilItalianitRussianru
SpanishesTurkishtrVietnamesevi

...and 79 more. The model automatically detects and responds in the user's language.


Thinking & Reasoning

The arete-live model uses dynamic internal thinking to reason through complex problems before speaking. You can control the depth of reasoning using the thinkingLevel parameter.

"config": {
  "thinkingConfig": {
    "thinkingLevel": "low"
  }
}
LevelBehavior
minimalFastest responses, least reasoning. Default.
lowLight reasoning with low latency.
mediumBalanced reasoning and speed.
highDeep reasoning, higher latency.

Thought Summaries

You can also receive summaries of the model's internal thought process by setting includeThoughts: true:

"config": {
  "thinkingConfig": {
    "thinkingLevel": "medium",
    "includeThoughts": true
  }
}

Voice Activity Detection (VAD)

VAD allows the model to recognize when a person is speaking. This is essential for natural conversations—it lets the user interrupt the model at any time.

When VAD detects an interruption, the ongoing generation is cancelled and discarded. Only information already sent to the client is retained. The server sends:

{ "serverContent": { "interrupted": true } }

Client Implementation: When you receive interrupted: true, you must immediately stop playback and clear your client-side audio buffer to prevent the model from talking over the user.

Automatic VAD (Default)

By default, the model automatically performs VAD on the incoming audio stream. You can fine-tune it:

"config": {
  "realtimeInputConfig": {
    "automaticActivityDetection": {
      "disabled": false,
      "startOfSpeechSensitivity": "START_SENSITIVITY_LOW",
      "endOfSpeechSensitivity": "END_SENSITIVITY_LOW",
      "prefixPaddingMs": 20,
      "silenceDurationMs": 800
    }
  }
}

Understanding VAD Parameters

ParameterDescriptionRecommendation
prefixPaddingMsAudio included before speech is detected (look-back buffer). 0 may clip first syllables.20ms
silenceDurationMsHow long to wait through silence before ending a speech turn.500800ms
startOfSpeechSensitivityHow eagerly the system detects the start of speech.START_SENSITIVITY_LOW for noisy environments
endOfSpeechSensitivityHow eagerly the system detects the end of speech.END_SENSITIVITY_LOW for users who pause often

Warning: Setting silenceDurationMs below 300ms will cause the model to split natural sentences at every breath or pause, resulting in fragmented context and degraded response quality. We strongly recommend 500ms–800ms.

Manual VAD (Client-Side)

You can disable automatic VAD and manage speech boundaries yourself using activityStart and activityEnd signals:

"config": {
  "realtimeInputConfig": {
    "automaticActivityDetection": {
      "disabled": true
    }
  }
}

Then send explicit boundary signals:

{ "realtimeInput": { "activityStart": {} } }
{ "realtimeInput": { "activityEnd": {} } }

Best Practice for Manual VAD: Use an end-of-speech silence threshold of at least 500ms in your client-side detector. Below this, audio becomes fragmented and response quality degrades significantly.


Advanced Audio Features

Affective Dialog

This feature lets the model adapt its response style—tone, pace, emotion—to match the expression and tone of the user's vocal input. Enable it in your setup:

"config": {
  "enableAffectiveDialog": true
}

Proactive Audio

When enabled, the model can intelligently decide not to respond if the content it hears is not relevant (e.g., background conversation, TV noise). This prevents unnecessary interruptions.

"config": {
  "proactivity": {
    "proactiveAudio": true
  }
}

Billing Note: When Proactive Audio is enabled, input tokens are charged the entire time the API is listening, while output tokens are only charged when the model actually responds.


Media Resolution

You can control the resolution of input media (images/video frames) to trade off between quality and token consumption:

"config": {
  "mediaResolution": "MEDIA_RESOLUTION_LOW"
}

Tool Calling (Function Execution)

The Iseer Live API natively supports bidirectional tool calling. You declare tools during setup, and the model can invoke them mid-conversation.

Defining Tools

Include function declarations in your setup config. Each function needs a name, description, and parameters schema:

"tools": [
  {
    "functionDeclarations": [
      {
        "name": "get_weather",
        "description": "Gets the current weather for a location. Invoke when the user asks about weather.",
        "parameters": {
          "type": "OBJECT",
          "properties": {
            "location": {
              "type": "STRING",
              "description": "City name or coordinates"
            }
          },
          "required": ["location"]
        }
      }
    ]
  }
]

Best Practice: Include an Invocation Condition in your tool descriptions (e.g., "Invoke this tool only after the user provides their location"). This dramatically improves when and how the model uses each tool.

Handling Tool Calls

When the model needs to call a tool, it pauses audio generation and sends a toolCall:

{
  "toolCall": {
    "functionCalls": [
      {
        "id": "call_abc123",
        "name": "get_weather",
        "args": { "location": "San Francisco" }
      }
    ]
  }
}

Responding to Tool Calls

Execute the function locally and send the result back immediately:

{
  "toolResponse": {
    "functionResponses": [
      {
        "id": "call_abc123",
        "name": "get_weather",
        "response": {
          "temperature": "72°F",
          "condition": "Sunny",
          "humidity": "45%"
        }
      }
    ]
  }
}

The model will then resume speaking and incorporate the tool result into its response.

Synchronous vs Asynchronous Function Calling

By default, function calling is synchronous—the model pauses all interaction until your tool response arrives. For non-blocking (asynchronous) execution, set behavior: "NON_BLOCKING" on the function declaration:

{
  "functionDeclarations": [
    {
      "name": "send_email",
      "description": "Sends an email to the user.",
      "behavior": "NON_BLOCKING"
    }
  ]
}

For non-blocking functions, control how the model reacts when your response arrives using the scheduling parameter in the toolResponse:

SchedulingBehavior
INTERRUPTThe model stops what it's saying and immediately addresses the tool result.
WHEN_IDLEThe model waits until it finishes its current thought, then addresses the result.
SILENTThe model absorbs the information silently and uses it later in the conversation.
{
  "toolResponse": {
    "functionResponses": [
      {
        "id": "call_xyz",
        "name": "send_email",
        "response": {
          "result": "sent",
          "scheduling": "WHEN_IDLE"
        }
      }
    ]
  }
}

Iseer Web Search (Grounding)

Give the arete-live model access to real-time web information via the Iseer Web Search tool. This dramatically reduces hallucinations and enables up-to-date answers.

"tools": [
  { "iseerSearch": {} }
]

When enabled, the model will autonomously decide when to search the web, execute retrieval, and incorporate the results into its spoken response.

Combining Multiple Tools

You can pass multiple tools simultaneously. The model will use them intelligently based on context:

"tools": [
  { "iseerSearch": {} },
  {
    "functionDeclarations": [
      { "name": "get_user_profile", "description": "Fetches user profile data." },
      { "name": "schedule_meeting", "description": "Schedules a meeting." }
    ]
  }
]

Session Management

Audio tokens accumulate at approximately 25 tokens per second. Without management, sessions hit hard limits quickly.

Session TypeDefault Limit
Audio-only15 minutes
Audio + Video2 minutes
With Context CompressionUnlimited

Context Window Compression

Enable a sliding-window mechanism that automatically evicts old context when the token count exceeds a configurable threshold:

"config": {
  "contextWindowCompression": {
    "slidingWindow": {},
    "triggerTokens": 25000
  }
}

When the context window hits triggerTokens, the API compresses older content and retains only the most recent window. This enables unlimited session duration.

Session Resumption

The server may periodically reset the WebSocket connection (~every 10 minutes). Session resumption lets you seamlessly reconnect without losing any conversational context.

How it works:

  1. Include sessionResumption: {} in your setup config.
  2. The server periodically sends sessionResumptionUpdate messages containing a newHandle.
  3. Store the latest newHandle.
  4. When reconnecting, pass it as the handle in your new setup config.
"config": {
  "sessionResumption": {
    "handle": "PREVIOUS_HANDLE_HERE"
  }
}

Receiving updates:

{
  "sessionResumptionUpdate": {
    "resumable": true,
    "newHandle": "eyJhbGciOiJS..."
  }
}

Resumption tokens are valid for 2 hours after the last session terminates. You can use the same ephemeral token for reconnection even if it was configured with uses: 1.

GoAway Messages

Before the server terminates a connection, it sends a goAway message with a timeLeft field indicating remaining time. Use this to gracefully wrap up or initiate session resumption:

{
  "goAway": {
    "timeLeft": "30s"
  }
}

Token Counting & Analytics

The server periodically streams usageMetadata messages so you can track consumption in real-time:

{
  "usageMetadata": {
    "totalTokenCount": 4250,
    "responseTokensDetails": [
      { "modality": "AUDIO", "tokenCount": 3800 },
      { "modality": "TEXT", "tokenCount": 450 }
    ]
  }
}

Billing Model

The Live API bills per turn for all tokens in the active context window:

  • Accumulation: Each turn's cost includes new tokens plus all accumulated tokens from previous turns (re-processed for context).
  • Audio tokens: Billed at the audio input rate (~25 tokens/sec of audio).
  • Transcription surcharge: When transcription is enabled, text tokens are billed additionally at the text output rate.
  • Cost control: Use contextWindowCompression with a triggerTokens limit to cap per-turn costs.

Security: Ephemeral Tokens

Ephemeral tokens secure client-to-API WebSocket connections. They are short-lived, single-use, and prevent your long-lived API keys from being exposed in client-side code.

How It Works

  1. Your backend authenticates the user.
  2. Your backend requests an ephemeral token from the Iseer provisioning API.
  3. The token is sent to the client.
  4. The client connects directly to wss://genai.api.iseer.co using the token as access_token.
  5. The token expires automatically (default: 30 minutes).

Token Configuration

ParameterDescriptionDefault
usesNumber of sessions this token can initiate.1
expireTimeAbsolute expiration time (ISO 8601).30 minutes from creation
newSessionExpireTimeHow long the token can be used to start new sessions.1 minute from creation

Locking Tokens to Configurations

For maximum security, you can lock an ephemeral token to a specific set of configurations. This guarantees that even if a malicious user intercepts the token, they cannot change the system instruction, temperature, tools, or model:

{
  "uses": 1,
  "liveConnectConstraints": {
    "model": "models/arete-live",
    "config": {
      "sessionResumption": {},
      "temperature": 0.7,
      "responseModalities": ["AUDIO"],
      "systemInstruction": {
        "parts": [{ "text": "You are a secure Iseer agent." }]
      }
    }
  }
}

The ephemeral token must be passed as the access_token query parameter in the WebSocket URL, or in an HTTP Authorization header prefixed with the auth-scheme Token.


Best Practices

System Instructions

  1. Define the persona first: Name, role, characteristics, accent, preferred language.
  2. Specify conversational rules: Separate one-time steps (e.g., "gather user info") from loops (e.g., "discuss topics freely").
  3. Specify tool invocation conditions: Describe when each tool should be called in clear, separate sentences.
  4. Add guardrails last: Define what the model should never do.

Streaming Audio

  • Chunk size: Send 20ms–40ms audio chunks for optimal latency.
  • Interruption handling: When you receive interrupted: true, immediately stop playback and clear your audio buffer.
  • Resampling: Always resample microphone input (44.1kHz/48kHz) to 16kHz before transmission.

Context Management

  • Enable contextWindowCompression for any session expected to last more than a few minutes.
  • Audio tokens accumulate at ~25 tokens/second. A 5-minute conversation without compression consumes ~7,500 tokens.
  • For longer contexts, provide a single summary message rather than full turn-by-turn history.

Prompt Design

  • Use clear, concise prompts. Provide examples of what the model should and shouldn't do.
  • Limit prompts to one persona per system instruction. Use prompt chaining for multi-role scenarios.
  • To have the model initiate the conversation, include a prompt asking it to greet the user.

Session Limits & Context Window

ParameterValue
Context window128k tokens
Audio-only session (no compression)15 minutes
Audio+Video session (no compression)2 minutes
With compressionUnlimited
Connection lifetime~10 minutes (use session resumption)
Audio token rate~25 tokens/second
Max video frame rate1 FPS
Input audio formatPCM 16-bit LE, 16kHz mono
Output audio formatPCM 16-bit LE, 24kHz mono

Need help? If you experience connectivity issues, ensure your system clock is synced—ephemeral tokens are time-sensitive. Contact developer@iseer.co for support.