Iseer Live API — Complete Reference
The Iseer Live API enables real-time, bidirectional interaction with the arete-live model, supporting audio, video, and text inputs with ultra-low latency native audio outputs. This is the definitive guide to every feature, configuration, and best practice available.
Overview
The Iseer Live API uses WebSockets for persistent, duplex communication. Unlike traditional request-response APIs, a Live session maintains continuous state, enabling natural conversations with interruptions, tool calls, and multimodal streaming.
Key Concepts
- Session: A persistent, stateful WebSocket connection to the
arete-livemodel. - Setup Message: The initial JSON payload sent immediately after connecting that defines modalities, voice, system instructions, tools, and all session parameters.
- Real-time Input: Client-side streaming of PCM audio, JPEG/PNG video frames, or text.
- Server Content: Model responses streamed back as native audio, text transcriptions, or tool call requests.
- Turn: One cycle of user input → model response. Billing and context accumulation happen per turn.
Establishing a Connection
Step 1: Request a Session
Send a POST request to /api/live-connect on https://api.iseer.co. The endpoint returns a secure WebSocket URL (with an ephemeral token) and the setup payload.
curl -X POST https://api.iseer.co/api/live-connect \
-H "Content-Type: application/json" \
-d '{
"system_instruction": "You are a helpful Iseer assistant.",
"response_modalities": ["AUDIO"]
}'Response:
{
"url": "wss://genai.api.iseer.co?access_token=YOUR_EPHEMERAL_TOKEN",
"setup": {
"setup": {
"model": "models/arete-live",
"generationConfig": {
"responseModalities": ["AUDIO"]
},
"systemInstruction": {
"parts": [{ "text": "You are a helpful Iseer assistant." }]
}
}
}
}Step 2: Connect via WebSocket
Open the WebSocket and immediately send the setup payload. Wait for the setupComplete acknowledgement before streaming any data.
const ws = new WebSocket(response.url);
ws.onopen = () => {
ws.send(JSON.stringify(response.setup));
};
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.setupComplete) {
console.log("Session established. Ready to stream.");
}
};Interaction Modalities
Once setupComplete is received, you can begin streaming data in any combination of the following modalities.
Sending Audio (Speech-to-Speech)
Input audio must be raw, little-endian, 16-bit PCM. The native sample rate is 16kHz mono, but the API will resample if you send a different rate (set via the MIME type).
{
"realtimeInput": {
"audio": {
"mimeType": "audio/pcm;rate=16000",
"data": "BASE64_ENCODED_PCM_DATA"
}
}
}Best Practice: Chunk your audio into 20ms–100ms segments (320–1600 bytes at 16kHz) and send them continuously. Do NOT buffer more than ~100ms before sending—smaller chunks minimize latency. If your microphone captures at 44.1kHz or 48kHz, resample to 16kHz on the client before transmission.
Audio Stream End
When the audio stream is paused for more than one second (e.g., the user mutes their mic), send an audioStreamEnd event to flush any buffered audio on the server. You can resume sending audio data at any time afterwards.
{
"realtimeInput": {
"audioStreamEnd": true
}
}Sending Video (Vision)
The arete-live model supports real-time vision. Send individual image frames (JPEG or PNG) at a maximum rate of 1 frame per second.
{
"realtimeInput": {
"video": {
"mimeType": "image/jpeg",
"data": "BASE64_ENCODED_JPEG_DATA"
}
}
}Sending Text
You can inject text into the conversation via realtimeInput:
{
"realtimeInput": {
"text": "Hello, how are you?"
}
}Incremental Content Updates (Context Seeding)
Use clientContent to seed initial conversation history or inject context. This is useful for restoring a session or providing the model with background information before the user speaks.
{
"clientContent": {
"turns": [
{ "role": "user", "parts": [{ "text": "What is the capital of France?" }] },
{ "role": "model", "parts": [{ "text": "Paris." }] },
{ "role": "user", "parts": [{ "text": "And Germany?" }] }
],
"turnComplete": true
}
}Note: For
arete-live,clientContentis primarily supported for seeding initial context history. During active conversation, userealtimeInputwith thetextfield instead.
Receiving Output
The model streams responses as JSON payloads containing serverContent.
Receiving Native Audio
Output audio is raw, little-endian, 16-bit PCM at 24kHz mono. A single server event may contain multiple content parts simultaneously (e.g., inlineData and transcript together). Ensure your code processes all parts in each event.
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
const parts = msg.serverContent?.modelTurn?.parts;
if (parts) {
for (const part of parts) {
if (part.inlineData?.mimeType?.includes("audio/pcm")) {
playAudioBuffer(part.inlineData.data); // base64 → PCM → AudioContext
}
}
}
};Turn Completion
When the model finishes its response, it sends a turnComplete signal:
{ "serverContent": { "turnComplete": true } }Generation Complete
Distinct from turnComplete, the generationComplete flag signals that the model has finished all generation for the current request (important for multi-part or tool-assisted responses):
{ "serverContent": { "generationComplete": true } }Audio Transcriptions
You can request real-time text transcriptions of both input audio (what the user says) and output audio (what the model says). This is essential for displaying subtitles, logging conversations, or feeding text to downstream systems.
Enabling Transcriptions
Include the transcription flags in your setup configuration:
"config": {
"responseModalities": ["AUDIO"],
"inputAudioTranscription": {},
"outputAudioTranscription": {}
}Receiving Transcriptions
Transcriptions arrive interwoven with audio chunks:
{
"serverContent": {
"inputTranscription": { "text": "What's the weather like today?" }
}
}{
"serverContent": {
"outputTranscription": { "text": "It's currently 72 degrees and sunny." }
}
}Billing Note: When transcription is enabled, you are charged for the text tokens generated at the text output rate in addition to the standard audio token costs.
Voice & Language Configuration
Changing the Voice
The arete-live model supports a wide range of voices. Specify a voice in your setup config:
"config": {
"responseModalities": ["AUDIO"],
"speechConfig": {
"voiceConfig": {
"prebuiltVoiceConfig": { "voiceName": "Kore" }
}
}
}Language Support
The model supports 97 languages and can switch between them naturally during conversation. You can restrict which languages it speaks in via your system instructions. If you need the model to respond in a non-English language, include in your system instructions:
RESPOND IN {LANGUAGE}. YOU MUST RESPOND UNMISTAKABLY IN {LANGUAGE}.
| Language | Code | Language | Code | Language | Code |
|---|---|---|---|---|---|
| Arabic | ar | English | en | Hindi | hi |
| Bengali | bn | French | fr | Japanese | ja |
| Chinese | zh | German | de | Korean | ko |
| Dutch | nl | Indonesian | id | Portuguese | pt |
| Filipino | fil | Italian | it | Russian | ru |
| Spanish | es | Turkish | tr | Vietnamese | vi |
...and 79 more. The model automatically detects and responds in the user's language.
Thinking & Reasoning
The arete-live model uses dynamic internal thinking to reason through complex problems before speaking. You can control the depth of reasoning using the thinkingLevel parameter.
"config": {
"thinkingConfig": {
"thinkingLevel": "low"
}
}| Level | Behavior |
|---|---|
minimal | Fastest responses, least reasoning. Default. |
low | Light reasoning with low latency. |
medium | Balanced reasoning and speed. |
high | Deep reasoning, higher latency. |
Thought Summaries
You can also receive summaries of the model's internal thought process by setting includeThoughts: true:
"config": {
"thinkingConfig": {
"thinkingLevel": "medium",
"includeThoughts": true
}
}Voice Activity Detection (VAD)
VAD allows the model to recognize when a person is speaking. This is essential for natural conversations—it lets the user interrupt the model at any time.
When VAD detects an interruption, the ongoing generation is cancelled and discarded. Only information already sent to the client is retained. The server sends:
{ "serverContent": { "interrupted": true } }Client Implementation: When you receive
interrupted: true, you must immediately stop playback and clear your client-side audio buffer to prevent the model from talking over the user.
Automatic VAD (Default)
By default, the model automatically performs VAD on the incoming audio stream. You can fine-tune it:
"config": {
"realtimeInputConfig": {
"automaticActivityDetection": {
"disabled": false,
"startOfSpeechSensitivity": "START_SENSITIVITY_LOW",
"endOfSpeechSensitivity": "END_SENSITIVITY_LOW",
"prefixPaddingMs": 20,
"silenceDurationMs": 800
}
}
}Understanding VAD Parameters
| Parameter | Description | Recommendation |
|---|---|---|
prefixPaddingMs | Audio included before speech is detected (look-back buffer). 0 may clip first syllables. | 20ms |
silenceDurationMs | How long to wait through silence before ending a speech turn. | 500–800ms |
startOfSpeechSensitivity | How eagerly the system detects the start of speech. | START_SENSITIVITY_LOW for noisy environments |
endOfSpeechSensitivity | How eagerly the system detects the end of speech. | END_SENSITIVITY_LOW for users who pause often |
Warning: Setting
silenceDurationMsbelow 300ms will cause the model to split natural sentences at every breath or pause, resulting in fragmented context and degraded response quality. We strongly recommend 500ms–800ms.
Manual VAD (Client-Side)
You can disable automatic VAD and manage speech boundaries yourself using activityStart and activityEnd signals:
"config": {
"realtimeInputConfig": {
"automaticActivityDetection": {
"disabled": true
}
}
}Then send explicit boundary signals:
{ "realtimeInput": { "activityStart": {} } }{ "realtimeInput": { "activityEnd": {} } }Best Practice for Manual VAD: Use an end-of-speech silence threshold of at least 500ms in your client-side detector. Below this, audio becomes fragmented and response quality degrades significantly.
Advanced Audio Features
Affective Dialog
This feature lets the model adapt its response style—tone, pace, emotion—to match the expression and tone of the user's vocal input. Enable it in your setup:
"config": {
"enableAffectiveDialog": true
}Proactive Audio
When enabled, the model can intelligently decide not to respond if the content it hears is not relevant (e.g., background conversation, TV noise). This prevents unnecessary interruptions.
"config": {
"proactivity": {
"proactiveAudio": true
}
}Billing Note: When Proactive Audio is enabled, input tokens are charged the entire time the API is listening, while output tokens are only charged when the model actually responds.
Media Resolution
You can control the resolution of input media (images/video frames) to trade off between quality and token consumption:
"config": {
"mediaResolution": "MEDIA_RESOLUTION_LOW"
}Tool Calling (Function Execution)
The Iseer Live API natively supports bidirectional tool calling. You declare tools during setup, and the model can invoke them mid-conversation.
Defining Tools
Include function declarations in your setup config. Each function needs a name, description, and parameters schema:
"tools": [
{
"functionDeclarations": [
{
"name": "get_weather",
"description": "Gets the current weather for a location. Invoke when the user asks about weather.",
"parameters": {
"type": "OBJECT",
"properties": {
"location": {
"type": "STRING",
"description": "City name or coordinates"
}
},
"required": ["location"]
}
}
]
}
]Best Practice: Include an
Invocation Conditionin your tool descriptions (e.g., "Invoke this tool only after the user provides their location"). This dramatically improves when and how the model uses each tool.
Handling Tool Calls
When the model needs to call a tool, it pauses audio generation and sends a toolCall:
{
"toolCall": {
"functionCalls": [
{
"id": "call_abc123",
"name": "get_weather",
"args": { "location": "San Francisco" }
}
]
}
}Responding to Tool Calls
Execute the function locally and send the result back immediately:
{
"toolResponse": {
"functionResponses": [
{
"id": "call_abc123",
"name": "get_weather",
"response": {
"temperature": "72°F",
"condition": "Sunny",
"humidity": "45%"
}
}
]
}
}The model will then resume speaking and incorporate the tool result into its response.
Synchronous vs Asynchronous Function Calling
By default, function calling is synchronous—the model pauses all interaction until your tool response arrives. For non-blocking (asynchronous) execution, set behavior: "NON_BLOCKING" on the function declaration:
{
"functionDeclarations": [
{
"name": "send_email",
"description": "Sends an email to the user.",
"behavior": "NON_BLOCKING"
}
]
}For non-blocking functions, control how the model reacts when your response arrives using the scheduling parameter in the toolResponse:
| Scheduling | Behavior |
|---|---|
INTERRUPT | The model stops what it's saying and immediately addresses the tool result. |
WHEN_IDLE | The model waits until it finishes its current thought, then addresses the result. |
SILENT | The model absorbs the information silently and uses it later in the conversation. |
{
"toolResponse": {
"functionResponses": [
{
"id": "call_xyz",
"name": "send_email",
"response": {
"result": "sent",
"scheduling": "WHEN_IDLE"
}
}
]
}
}Iseer Web Search (Grounding)
Give the arete-live model access to real-time web information via the Iseer Web Search tool. This dramatically reduces hallucinations and enables up-to-date answers.
"tools": [
{ "iseerSearch": {} }
]When enabled, the model will autonomously decide when to search the web, execute retrieval, and incorporate the results into its spoken response.
Combining Multiple Tools
You can pass multiple tools simultaneously. The model will use them intelligently based on context:
"tools": [
{ "iseerSearch": {} },
{
"functionDeclarations": [
{ "name": "get_user_profile", "description": "Fetches user profile data." },
{ "name": "schedule_meeting", "description": "Schedules a meeting." }
]
}
]Session Management
Audio tokens accumulate at approximately 25 tokens per second. Without management, sessions hit hard limits quickly.
| Session Type | Default Limit |
|---|---|
| Audio-only | 15 minutes |
| Audio + Video | 2 minutes |
| With Context Compression | Unlimited |
Context Window Compression
Enable a sliding-window mechanism that automatically evicts old context when the token count exceeds a configurable threshold:
"config": {
"contextWindowCompression": {
"slidingWindow": {},
"triggerTokens": 25000
}
}When the context window hits triggerTokens, the API compresses older content and retains only the most recent window. This enables unlimited session duration.
Session Resumption
The server may periodically reset the WebSocket connection (~every 10 minutes). Session resumption lets you seamlessly reconnect without losing any conversational context.
How it works:
- Include
sessionResumption: {}in your setup config. - The server periodically sends
sessionResumptionUpdatemessages containing anewHandle. - Store the latest
newHandle. - When reconnecting, pass it as the
handlein your new setup config.
"config": {
"sessionResumption": {
"handle": "PREVIOUS_HANDLE_HERE"
}
}Receiving updates:
{
"sessionResumptionUpdate": {
"resumable": true,
"newHandle": "eyJhbGciOiJS..."
}
}Resumption tokens are valid for 2 hours after the last session terminates. You can use the same ephemeral token for reconnection even if it was configured with
uses: 1.
GoAway Messages
Before the server terminates a connection, it sends a goAway message with a timeLeft field indicating remaining time. Use this to gracefully wrap up or initiate session resumption:
{
"goAway": {
"timeLeft": "30s"
}
}Token Counting & Analytics
The server periodically streams usageMetadata messages so you can track consumption in real-time:
{
"usageMetadata": {
"totalTokenCount": 4250,
"responseTokensDetails": [
{ "modality": "AUDIO", "tokenCount": 3800 },
{ "modality": "TEXT", "tokenCount": 450 }
]
}
}Billing Model
The Live API bills per turn for all tokens in the active context window:
- Accumulation: Each turn's cost includes new tokens plus all accumulated tokens from previous turns (re-processed for context).
- Audio tokens: Billed at the audio input rate (~25 tokens/sec of audio).
- Transcription surcharge: When transcription is enabled, text tokens are billed additionally at the text output rate.
- Cost control: Use
contextWindowCompressionwith atriggerTokenslimit to cap per-turn costs.
Security: Ephemeral Tokens
Ephemeral tokens secure client-to-API WebSocket connections. They are short-lived, single-use, and prevent your long-lived API keys from being exposed in client-side code.
How It Works
- Your backend authenticates the user.
- Your backend requests an ephemeral token from the Iseer provisioning API.
- The token is sent to the client.
- The client connects directly to
wss://genai.api.iseer.cousing the token asaccess_token. - The token expires automatically (default: 30 minutes).
Token Configuration
| Parameter | Description | Default |
|---|---|---|
uses | Number of sessions this token can initiate. | 1 |
expireTime | Absolute expiration time (ISO 8601). | 30 minutes from creation |
newSessionExpireTime | How long the token can be used to start new sessions. | 1 minute from creation |
Locking Tokens to Configurations
For maximum security, you can lock an ephemeral token to a specific set of configurations. This guarantees that even if a malicious user intercepts the token, they cannot change the system instruction, temperature, tools, or model:
{
"uses": 1,
"liveConnectConstraints": {
"model": "models/arete-live",
"config": {
"sessionResumption": {},
"temperature": 0.7,
"responseModalities": ["AUDIO"],
"systemInstruction": {
"parts": [{ "text": "You are a secure Iseer agent." }]
}
}
}
}The ephemeral token must be passed as the
access_tokenquery parameter in the WebSocket URL, or in an HTTPAuthorizationheader prefixed with the auth-schemeToken.
Best Practices
System Instructions
- Define the persona first: Name, role, characteristics, accent, preferred language.
- Specify conversational rules: Separate one-time steps (e.g., "gather user info") from loops (e.g., "discuss topics freely").
- Specify tool invocation conditions: Describe when each tool should be called in clear, separate sentences.
- Add guardrails last: Define what the model should never do.
Streaming Audio
- Chunk size: Send 20ms–40ms audio chunks for optimal latency.
- Interruption handling: When you receive
interrupted: true, immediately stop playback and clear your audio buffer. - Resampling: Always resample microphone input (44.1kHz/48kHz) to 16kHz before transmission.
Context Management
- Enable
contextWindowCompressionfor any session expected to last more than a few minutes. - Audio tokens accumulate at ~25 tokens/second. A 5-minute conversation without compression consumes ~7,500 tokens.
- For longer contexts, provide a single summary message rather than full turn-by-turn history.
Prompt Design
- Use clear, concise prompts. Provide examples of what the model should and shouldn't do.
- Limit prompts to one persona per system instruction. Use prompt chaining for multi-role scenarios.
- To have the model initiate the conversation, include a prompt asking it to greet the user.
Session Limits & Context Window
| Parameter | Value |
|---|---|
| Context window | 128k tokens |
| Audio-only session (no compression) | 15 minutes |
| Audio+Video session (no compression) | 2 minutes |
| With compression | Unlimited |
| Connection lifetime | ~10 minutes (use session resumption) |
| Audio token rate | ~25 tokens/second |
| Max video frame rate | 1 FPS |
| Input audio format | PCM 16-bit LE, 16kHz mono |
| Output audio format | PCM 16-bit LE, 24kHz mono |
Need help? If you experience connectivity issues, ensure your system clock is synced—ephemeral tokens are time-sensitive. Contact
developer@iseer.cofor support.