Iseer Live API — Complete Reference

The Iseer Live API enables real-time, bidirectional interaction with the arete-live model, supporting audio, video, and text inputs with ultra-low latency native audio outputs. This is the definitive guide to every feature, configuration, and best practice available.

Overview

The Iseer Live API uses WebSockets for persistent, duplex communication. Unlike traditional request-response APIs, a Live session maintains continuous state, enabling natural conversations with interruptions, tool calls, and multimodal streaming.

Key Concepts

Session: A persistent, stateful WebSocket connection to the arete-live model.
Setup Message: The initial JSON payload sent immediately after connecting that defines modalities, voice, system instructions, tools, and all session parameters.
Real-time Input: Client-side streaming of PCM audio, JPEG/PNG video frames, or text.
Server Content: Model responses streamed back as native audio, text transcriptions, or tool call requests.
Turn: One cycle of user input → model response. Billing and context accumulation happen per turn.

Authentication

All requests to the Iseer Live API require an Iseer API Key. API keys use the format iseer_live_<random_string> and are distinct from any third-party credentials.

Obtaining an API Key

Contact developer@iseer.co to request API access. You will receive a key in the format:

iseer_live_a1B2c3D4e5F6g7H8i9J0k1L2m3N4o5P6

Using Your API Key

Pass the key either as a query parameter or in an Authorization header:

Query parameter:

wss://genai.api.iseer.co?key=iseer_live_YOUR_API_KEY

Authorization header:

Authorization: Bearer iseer_live_YOUR_API_KEY

Security: Never expose your API key in client-side code in production. Use a backend service to broker WebSocket connections or provision ephemeral tokens.

Establishing a Connection

Direct WebSocket Connection

Open a WebSocket to wss://genai.api.iseer.co with your API key, then immediately send the setup payload:

const ws = new WebSocket('wss://genai.api.iseer.co?key=iseer_live_YOUR_API_KEY');
 
ws.onopen = () => {
  ws.send(JSON.stringify({
    setup: {
      model: "models/arete-live",
      generationConfig: {
        responseModalities: ["AUDIO"]
      },
      systemInstruction: {
        parts: [{ text: "You are a helpful Iseer assistant." }]
      }
    }
  }));
};
 
ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.setupComplete) {
    console.log("Session established. Ready to stream.");
  }
};

Using the Provisioning API

For production applications, request a session via the provisioning API:

curl -X POST https://api.iseer.co/api/live-connect \
  -H "Authorization: Bearer iseer_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "system_instruction": "You are a helpful Iseer assistant.",
    "response_modalities": ["AUDIO"]
  }'

Response:

{
  "url": "wss://genai.api.iseer.co?access_token=YOUR_EPHEMERAL_TOKEN",
  "setup": {
    "setup": {
      "model": "models/arete-live",
      "generationConfig": {
        "responseModalities": ["AUDIO"]
      },
      "systemInstruction": {
        "parts": [{ "text": "You are a helpful Iseer assistant." }]
      }
    }
  }
}

Interaction Modalities

Once setupComplete is received, you can begin streaming data in any combination of the following modalities.

Sending Audio (Speech-to-Speech)

Input audio must be raw, little-endian, 16-bit PCM. The native sample rate is 16kHz mono, but the API will resample if you send a different rate (set via the MIME type).

{
  "realtimeInput": {
    "audio": {
      "mimeType": "audio/pcm;rate=16000",
      "data": "BASE64_ENCODED_PCM_DATA"
    }
  }
}

Best Practice: Chunk your audio into 20ms–100ms segments (320–1600 bytes at 16kHz) and send them continuously. Do NOT buffer more than ~100ms before sending—smaller chunks minimize latency. If your microphone captures at 44.1kHz or 48kHz, resample to 16kHz on the client before transmission.

Audio Stream End

When the audio stream is paused for more than one second (e.g., the user mutes their mic), send an audioStreamEnd event to flush any buffered audio on the server. You can resume sending audio data at any time afterwards.

{
  "realtimeInput": {
    "audioStreamEnd": true
  }
}

Sending Video (Vision)

The arete-live model supports real-time vision. Send individual image frames (JPEG or PNG) at a maximum rate of 1 frame per second.

{
  "realtimeInput": {
    "video": {
      "mimeType": "image/jpeg",
      "data": "BASE64_ENCODED_JPEG_DATA"
    }
  }
}

Sending Text

You can inject text into the conversation via realtimeInput:

{
  "realtimeInput": {
    "text": "Hello, how are you?"
  }
}

Incremental Content Updates (Context Seeding)

Use clientContent to seed initial conversation history or inject context. This is useful for restoring a session or providing the model with background information before the user speaks.

{
  "clientContent": {
    "turns": [
      { "role": "user", "parts": [{ "text": "What is the capital of France?" }] },
      { "role": "model", "parts": [{ "text": "Paris." }] },
      { "role": "user", "parts": [{ "text": "And Germany?" }] }
    ],
    "turnComplete": true
  }
}

Note: For arete-live, clientContent is primarily supported for seeding initial context history. During active conversation, use realtimeInput with the text field instead.

Receiving Output

The model streams responses as JSON payloads containing serverContent.

Receiving Native Audio

Output audio is raw, little-endian, 16-bit PCM at 24kHz mono. A single server event may contain multiple content parts simultaneously (e.g., inlineData and transcript together). Ensure your code processes all parts in each event.

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  const parts = msg.serverContent?.modelTurn?.parts;
  if (parts) {
    for (const part of parts) {
      if (part.inlineData?.mimeType?.includes("audio/pcm")) {
        playAudioBuffer(part.inlineData.data); // base64 → PCM → AudioContext
      }
    }
  }
};

Turn Completion

When the model finishes its response, it sends a turnComplete signal:

{ "serverContent": { "turnComplete": true } }

Generation Complete

Distinct from turnComplete, the generationComplete flag signals that the model has finished all generation for the current request (important for multi-part or tool-assisted responses):

{ "serverContent": { "generationComplete": true } }

Audio Transcriptions

You can request real-time text transcriptions of both input audio (what the user says) and output audio (what the model says). This is essential for displaying subtitles, logging conversations, or feeding text to downstream systems.

Enabling Transcriptions

Include the transcription flags in your setup configuration:

"config": {
  "responseModalities": ["AUDIO"],
  "outputAudioTranscription": {},
  "inputAudioTranscription": {}
}

Receiving Transcriptions

Transcriptions arrive interwoven with audio chunks:

{
  "serverContent": {
    "inputTranscription": { "text": "What's the weather like today?" }
  }
}

{
  "serverContent": {
    "outputTranscription": { "text": "It's currently 72 degrees and sunny." }
  }
}

Voice & Language Configuration

Changing the Voice

The arete-live model supports a wide range of voices. Specify a voice in your setup config:

"config": {
  "responseModalities": ["AUDIO"],
  "speechConfig": {
    "voiceConfig": {
      "prebuiltVoiceConfig": { "voiceName": "Kore" }
    }
  }
}

Language Support

The model supports 97 languages and can switch between them naturally during conversation. You can restrict which languages it speaks in via your system instructions. If you need the model to respond in a non-English language, include in your system instructions:

RESPOND IN {LANGUAGE}. YOU MUST RESPOND UNMISTAKABLY IN {LANGUAGE}.

Language	Code	Language	Code	Language	Code
Arabic	`ar`	English	`en`	Hindi	`hi`
Bengali	`bn`	French	`fr`	Japanese	`ja`
Chinese	`zh`	German	`de`	Korean	`ko`
Dutch	`nl`	Indonesian	`id`	Portuguese	`pt`
Filipino	`fil`	Italian	`it`	Russian	`ru`
Spanish	`es`	Turkish	`tr`	Vietnamese	`vi`

...and 79 more. The model automatically detects and responds in the user's language.

Thinking & Reasoning

The arete-live model uses dynamic internal thinking to reason through complex problems before speaking. You can control the depth of reasoning using the thinkingLevel parameter.

"config": {
  "thinkingConfig": {
    "thinkingLevel": "low"
  }
}

Level	Behavior
`minimal`	Fastest responses, least reasoning. Default.
`low`	Light reasoning with low latency.
`medium`	Balanced reasoning and speed.
`high`	Deep reasoning, higher latency.

Thought Summaries

You can also receive summaries of the model's internal thought process by setting includeThoughts: true:

"config": {
  "thinkingConfig": {
    "thinkingLevel": "medium",
    "includeThoughts": true
  }
}

Voice Activity Detection (VAD)

VAD allows the model to recognize when a person is speaking. This is essential for natural conversations—it lets the user interrupt the model at any time.

When VAD detects an interruption, the ongoing generation is cancelled and discarded. Only information already sent to the client is retained. The server sends:

{ "serverContent": { "interrupted": true } }

Client Implementation: When you receive interrupted: true, you must immediately stop playback and clear your client-side audio buffer to prevent the model from talking over the user.

Automatic VAD (Default)

By default, the model automatically performs VAD on the incoming audio stream. You can fine-tune it:

"config": {
  "realtimeInputConfig": {
    "automaticActivityDetection": {
      "disabled": false,
      "startOfSpeechSensitivity": "START_SENSITIVITY_LOW",
      "endOfSpeechSensitivity": "END_SENSITIVITY_LOW",
      "prefixPaddingMs": 20,
      "silenceDurationMs": 800
    }
  }
}

Understanding VAD Parameters

Parameter	Description	Recommendation
`prefixPaddingMs`	Audio included before speech is detected (look-back buffer). `0` may clip first syllables.	`20`ms
`silenceDurationMs`	How long to wait through silence before ending a speech turn.	`500`–`800`ms
`startOfSpeechSensitivity`	How eagerly the system detects the start of speech.	`START_SENSITIVITY_LOW` for noisy environments
`endOfSpeechSensitivity`	How eagerly the system detects the end of speech.	`END_SENSITIVITY_LOW` for users who pause often

Warning: Setting silenceDurationMs below 300ms will cause the model to split natural sentences at every breath or pause, resulting in fragmented context and degraded response quality. We strongly recommend 500ms–800ms.

Manual VAD (Client-Side)

You can disable automatic VAD and manage speech boundaries yourself using activityStart and activityEnd signals:

"config": {
  "realtimeInputConfig": {
    "automaticActivityDetection": {
      "disabled": true
    }
  }
}

Then send explicit boundary signals:

{ "realtimeInput": { "activityStart": {} } }

{ "realtimeInput": { "activityEnd": {} } }

Best Practice for Manual VAD: Use an end-of-speech silence threshold of at least 500ms in your client-side detector. Below this, audio becomes fragmented and response quality degrades significantly.

Advanced Audio Features

Affective Dialog

This feature lets the model adapt its response style—tone, pace, emotion—to match the expression and tone of the user's vocal input. Enable it in your setup:

"config": {
  "enableAffectiveDialog": true
}

Proactive Audio

When enabled, the model can intelligently decide not to respond if the content it hears is not relevant (e.g., background conversation, TV noise). This prevents unnecessary interruptions.

"config": {
  "proactivity": {
    "proactiveAudio": true
  }
}

Media Resolution

You can control the resolution of input media (images/video frames) to trade off between quality and token consumption:

"config": {
  "mediaResolution": "MEDIA_RESOLUTION_LOW"
}

Tool Calling (Function Execution)

The Iseer Live API natively supports bidirectional tool calling. You declare tools during setup, and the model can invoke them mid-conversation.

Defining Tools

Include function declarations in your setup config. Each function needs a name, description, and parameters schema:

"tools": [
  {
    "functionDeclarations": [
      {
        "name": "get_weather",
        "description": "Gets the current weather for a location. Invoke when the user asks about weather.",
        "parameters": {
          "type": "OBJECT",
          "properties": {
            "location": {
              "type": "STRING",
              "description": "City name or coordinates"
            }
          },
          "required": ["location"]
        }
      }
    ]
  }
]

Best Practice: Include an Invocation Condition in your tool descriptions (e.g., "Invoke this tool only after the user provides their location"). This dramatically improves when and how the model uses each tool.

Handling Tool Calls

When the model needs to call a tool, it pauses audio generation and sends a toolCall:

{
  "toolCall": {
    "functionCalls": [
      {
        "id": "call_abc123",
        "name": "get_weather",
        "args": { "location": "San Francisco" }
      }
    ]
  }
}

Responding to Tool Calls

Execute the function locally and send the result back immediately:

{
  "toolResponse": {
    "functionResponses": [
      {
        "id": "call_abc123",
        "name": "get_weather",
        "response": {
          "temperature": "72°F",
          "condition": "Sunny",
          "humidity": "45%"
        }
      }
    ]
  }
}

The model will then resume speaking and incorporate the tool result into its response.

Synchronous vs Asynchronous Function Calling

By default, function calling is synchronous—the model pauses all interaction until your tool response arrives. For non-blocking (asynchronous) execution, set behavior: "NON_BLOCKING" on the function declaration:

{
  "functionDeclarations": [
    {
      "name": "send_email",
      "description": "Sends an email to the user.",
      "behavior": "NON_BLOCKING"
    }
  ]
}

For non-blocking functions, control how the model reacts when your response arrives using the scheduling parameter in the toolResponse:

Scheduling	Behavior
`INTERRUPT`	The model stops what it's saying and immediately addresses the tool result.
`WHEN_IDLE`	The model waits until it finishes its current thought, then addresses the result.
`SILENT`	The model absorbs the information silently and uses it later in the conversation.

{
  "toolResponse": {
    "functionResponses": [
      {
        "id": "call_xyz",
        "name": "send_email",
        "response": {
          "result": "sent",
          "scheduling": "WHEN_IDLE"
        }
      }
    ]
  }
}

Iseer Web Search (Grounding)

Give the arete-live model access to real-time web information via the Iseer Web Search tool. This dramatically reduces hallucinations and enables up-to-date answers.

"tools": [
  { "iseerSearch": {} }
]

When enabled, the model will autonomously decide when to search the web, execute retrieval, and incorporate the results into its spoken response.

Combining Multiple Tools

You can pass multiple tools simultaneously. The model will use them intelligently based on context:

"tools": [
  { "iseerSearch": {} },
  {
    "functionDeclarations": [
      { "name": "get_user_profile", "description": "Fetches user profile data." },
      { "name": "schedule_meeting", "description": "Schedules a meeting." }
    ]
  }
]

Session Management

Audio tokens accumulate at approximately 25 tokens per second. Without management, sessions hit hard limits quickly.

Session Type	Default Limit
Audio-only	15 minutes
Audio + Video	2 minutes
With Context Compression	Unlimited

Context Window Compression

Enable a sliding-window mechanism that automatically evicts old context when the token count exceeds a configurable threshold:

"config": {
  "contextWindowCompression": {
    "slidingWindow": {},
    "triggerTokens": 25000
  }
}

When the context window hits triggerTokens, the API compresses older content and retains only the most recent window. This enables unlimited session duration.

Session Resumption

The server may periodically reset the WebSocket connection (~every 10 minutes). Session resumption lets you seamlessly reconnect without losing any conversational context.

How it works:

Include sessionResumption: {} in your setup config.
The server periodically sends sessionResumptionUpdate messages containing a newHandle.
Store the latest newHandle.
When reconnecting, pass it as the handle in your new setup config.

"config": {
  "sessionResumption": {
    "handle": "PREVIOUS_HANDLE_HERE"
  }
}

Receiving updates:

{
  "sessionResumptionUpdate": {
    "resumable": true,
    "newHandle": "eyJhbGciOiJS..."
  }
}

Resumption tokens are valid for 2 hours after the last session terminates.

GoAway Messages

Before the server terminates a connection, it sends a goAway message with a timeLeft field indicating remaining time. Use this to gracefully wrap up or initiate session resumption:

{
  "goAway": {
    "timeLeft": "30s"
  }
}

Authentication: Iseer API Keys vs Ephemeral Tokens

The Iseer API uses a dual-token system for maximum security:

Iseer API Key (iseer_live_...): A long-lived, secret API key used by your backend server to authenticate with the Iseer API. Never expose this key in client-side code.
Ephemeral Token: A short-lived, single-use token returned by the provisioning API. This token is passed to the client (browser/mobile app) to securely establish the WebSocket connection.

How It Works

Your backend authenticates the user.
Your backend requests an ephemeral token from the Iseer provisioning API, authenticating with your secret iseer_live_ API key.
The token is sent to the client.
The client connects directly to wss://genai.api.iseer.co using the token as access_token.
The token expires automatically (default: 30 minutes).

Token Configuration

Parameter	Description	Default
`uses`	Number of sessions this token can initiate.	`1`
`expireTime`	Absolute expiration time (ISO 8601).	30 minutes from creation
`newSessionExpireTime`	How long the token can be used to start new sessions.	1 minute from creation

Locking Tokens to Configurations

For maximum security, you can lock an ephemeral token to a specific set of configurations. This guarantees that even if a malicious user intercepts the token, they cannot change the system instruction, temperature, tools, or model:

{
  "uses": 1,
  "liveConnectConstraints": {
    "model": "models/arete-live",
    "config": {
      "sessionResumption": {},
      "temperature": 0.7,
      "responseModalities": ["AUDIO"],
      "systemInstruction": {
        "parts": [{ "text": "You are a secure Iseer agent." }]
      }
    }
  }
}

The ephemeral token must be passed as the access_token query parameter in the WebSocket URL, or in an HTTP Authorization header prefixed with the auth-scheme Token.

Token Counting & Analytics

The server periodically streams usageMetadata messages so you can track consumption in real-time:

{
  "usageMetadata": {
    "totalTokenCount": 4250,
    "responseTokensDetails": [
      { "modality": "AUDIO", "tokenCount": 3800 },
      { "modality": "TEXT", "tokenCount": 450 }
    ]
  }
}

Billing Model

The Live API bills per turn for all tokens in the active context window:

Accumulation: Each turn's cost includes new tokens plus all accumulated tokens from previous turns (re-processed for context).
Audio tokens: Billed at the audio input rate (~25 tokens/sec of audio).
Transcription surcharge: When transcription is enabled, text tokens are billed additionally at the text output rate.
Cost control: Use contextWindowCompression with a triggerTokens limit to cap per-turn costs.

Best Practices

System Instructions

Define the persona first: Name, role, characteristics, accent, preferred language.
Specify conversational rules: Separate one-time steps (e.g., "gather user info") from loops (e.g., "discuss topics freely").
Specify tool invocation conditions: Describe when each tool should be called in clear, separate sentences.
Add guardrails last: Define what the model should never do.

Streaming Audio

Chunk size: Send 20ms–40ms audio chunks for optimal latency.
Interruption handling: When you receive interrupted: true, immediately stop playback and clear your audio buffer.
Resampling: Always resample microphone input (44.1kHz/48kHz) to 16kHz before transmission.

Context Management

Enable contextWindowCompression for any session expected to last more than a few minutes.
Audio tokens accumulate at ~25 tokens/second. A 5-minute conversation without compression consumes ~7,500 tokens.
For longer contexts, provide a single summary message rather than full turn-by-turn history.

Prompt Design

Use clear, concise prompts. Provide examples of what the model should and shouldn't do.
Limit prompts to one persona per system instruction. Use prompt chaining for multi-role scenarios.
To have the model initiate the conversation, include a prompt asking it to greet the user.

Session Limits & Context Window

Parameter	Value
Context window	128k tokens
Audio-only session (no compression)	15 minutes
Audio+Video session (no compression)	2 minutes
With compression	Unlimited
Connection lifetime	~10 minutes (use session resumption)
Audio token rate	~25 tokens/second
Max video frame rate	1 FPS
Input audio format	PCM 16-bit LE, 16kHz mono
Output audio format	PCM 16-bit LE, 24kHz mono

Need help? Contact developer@iseer.co for support.