Documentation

Streaming

SM-AI-MODELS supports real-time audio streaming for low-latency applications such as voice bots, IVR systems, and live assistants.

Why Streaming? Standard REST APIs wait for the entire audio to generate before responding. Streaming starts sending audio chunks within milliseconds, dramatically reducing perceived latency.

Streaming Methods

SM-AI-MODELS provides two streaming protocols:

Method	Protocol	Port	Best For
WebSocket	Bidirectional	`9999` (TTS), `8088` (ASR)	Real-time apps, voice bots, live transcription
gRPC	Bidirectional	`50051` (TTS), `50052` (ASR)	High-performance, MRCP, enterprise systems

WebSocket Streaming

SM-AI-MODELS provides WebSocket endpoints for real-time, bidirectional audio streaming with the lowest possible latency.

When to Use WebSockets: Use WebSocket streaming for real-time conversational applications, live transcription, voice bots, and scenarios requiring instant audio feedback.

WebSocket Endpoints

Service	Endpoint	Protocol	Best For
TTS	`ws://host:9999/v1/audio/stream`	Text→Audio	Voice bots, real-time speech synthesis
ASR	`ws://host:8088/v1/audio/stream`	Audio→Text	Live transcription, voice assistants

Text-to-Speech WebSocket

Connection

Code
 
ws://YOUR_HOST:9999/v1/audio/stream

Protocol

Client → Server: JSON messages with synthesis requests

Server → Client:

JSON status messages
Binary PCM audio chunks

Message Flow

Code
 
sequenceDiagram
    Client->>Server: Connect WebSocket
    Server->>Client: Connection accepted
    Client->>Server: JSON: {"text": "...", "voice": "Yara"}
    Server->>Client: JSON: {"type": "synthesis_start"}
    Server->>Client: Binary: PCM audio chunk
    Server->>Client: Binary: PCM audio chunk
    Server->>Client: JSON: {"type": "synthesis_complete"}

Request Format


Code
 
{
  "text": "مرحباً بكم في يونيكود سولوشنز",
  "voice": "Yara",
  "speed": 1.0,
  "sample_rate": 24000,
  "chunk_size": 1024
}

Parameters:

Parameter	Type	Default	Description
`text`	string	required	Text to synthesize (max 5,000 characters)
`voice`	string	`"Yara"`	Voice name: `Yara`, `Nouf`, `Yara_en`
`speed`	float	`1.0`	Speech speed (0.25 - 4.0)
`sample_rate`	integer	`22050`	Output sample rate in Hz (8000, 16000, 22050, 24000)
`chunk_size`	integer	`1024`	Audio chunk size in bytes

Response Events

synthesis_start — Synthesis has begun


Code
 
{
  "type": "synthesis_start",
  "session_id": "abc12345"
}

synthesis_complete — Synthesis finished


Code
 
{
  "type": "synthesis_complete",
  "session_id": "abc12345",
  "total_bytes": 48000,
  "model_id": "primary"
}

error — An error occurred


Code
 
{
  "type": "error",
  "session_id": "abc12345",
  "message": "Text exceeds maximum length"
}

Python WebSocket TTS Client


Code
 
import asyncio
import websockets
import json

async def stream_tts(text, voice="Yara"):
    """Stream TTS via WebSocket and receive audio chunks."""
    uri = "ws://YOUR_HOST:9999/v1/audio/stream"

    async with websockets.connect(uri) as websocket:
        # Send synthesis request
        request = {
            "text": text,
            "voice": voice,
            "sample_rate": 16000,
            "speed": 1.0
        }
        await websocket.send(json.dumps(request))

        # Receive responses
        audio_chunks = []
        async for message in websocket:
            # Check if message is JSON or binary
            if isinstance(message, str):
                event = json.loads(message)
                print(f"Event: {event['type']}")

                if event['type'] == 'synthesis_complete':
                    print(f"Total bytes: {event['total_bytes']}")
                    break
                elif event['type'] == 'error':
                    print(f"Error: {event['message']}")
                    break
            else:
                # Binary audio chunk
                audio_chunks.append(message)

        return b''.join(audio_chunks)

# Usage
audio_data = asyncio.run(stream_tts("مرحباً بكم"))
with open("output.pcm", "wb") as f:
    f.write(audio_data)

JavaScript WebSocket TTS Client


Code
 
const WebSocket = require('ws');

function streamTTS(text, voice = 'Yara') {
  return new Promise((resolve, reject) => {
    const ws = new WebSocket('ws://YOUR_HOST:9999/v1/audio/stream');
    const audioChunks = [];

    ws.on('open', () => {
      // Send synthesis request
      ws.send(JSON.stringify({
        text: text,
        voice: voice,
        sample_rate: 16000,
        speed: 1.0
      }));
    });

    ws.on('message', (data) => {
      if (typeof data === 'string') {
        // JSON event
        const event = JSON.parse(data);
        console.log(`Event: ${event.type}`);

        if (event.type === 'synthesis_complete') {
          console.log(`Total bytes: ${event.total_bytes}`);
          const audioBuffer = Buffer.concat(audioChunks);
          ws.close();
          resolve(audioBuffer);
        } else if (event.type === 'error') {
          reject(new Error(event.message));
        }
      } else {
        // Binary audio chunk
        audioChunks.push(data);
      }
    });

    ws.on('error', (error) => {
      reject(error);
    });
  });
}

// Usage
streamTTS('مرحباً بكم')
  .then(audioData => {
    require('fs').writeFileSync('output.pcm', audioData);
  })
  .catch(console.error);

Multi-Turn Synthesis

The WebSocket connection supports multiple synthesis requests on a single connection:


Code
 
async def multi_turn_tts(texts):
    """Send multiple texts on one WebSocket connection."""
    uri = "ws://YOUR_HOST:9999/v1/audio/stream"

    async with websockets.connect(uri) as websocket:
        for text in texts:
            # Send request
            await websocket.send(json.dumps({
                "text": text,
                "voice": "Yara"
            }))

            # Collect audio for this utterance
            audio_chunks = []
            while True:
                message = await websocket.recv()
                if isinstance(message, str):
                    event = json.loads(message)
                    if event['type'] == 'synthesis_complete':
                        break
                else:
                    audio_chunks.append(message)

            yield b''.join(audio_chunks)

# Usage
texts = ["مرحباً", "كيف حالك؟", "إلى اللقاء"]
for i, audio in enumerate(multi_turn_tts(texts)):
    with open(f"output_{i}.pcm", "wb") as f:
        f.write(audio)

Speech Recognition WebSocket

Connection

Code
 
ws://YOUR_HOST:8088/v1/audio/stream

Legacy endpoint (backward compatibility):

Code
 
ws://YOUR_HOST:8088/audio

Protocol

Client → Server: Binary PCM audio frames (int16, 16kHz, mono)

Server → Client: JSON events (speech detection, transcription results)

Message Flow

Code
 
sequenceDiagram
    Client->>Server: Connect WebSocket
    Server->>Client: Connection accepted
    Client->>Server: Binary: Audio chunk
    Client->>Server: Binary: Audio chunk
    Server->>Client: JSON: {"type": "speech_start"}
    Client->>Server: Binary: Audio chunk
    Server->>Client: JSON: {"type": "transcript", "text": "...", "is_final": false}
    Client->>Server: Binary: Audio chunk (silence)
    Server->>Client: JSON: {"type": "speech_end"}
    Server->>Client: JSON: {"type": "transcript", "text": "...", "is_final": true}

Audio Requirements

Parameter	Value	Description
Encoding	LINEAR16 (int16)	16-bit signed PCM
Sample rate	16,000 Hz	16kHz mono audio
Channels	1 (mono)	Mono audio only
Chunk size	1024-4096 bytes	Recommended 2048 bytes

Response Events

speech_start — Speech detected


Code
 
{
  "type": "speech_start",
  "timestamp_ms": 1234
}

transcript — Transcription result (interim or final)


Code
 
{
  "type": "transcript",
  "text": "مرحباً بكم",
  "is_final": true,
  "confidence": 0.95
}

speech_end — End of speech detected


Code
 
{
  "type": "speech_end",
  "timestamp_ms": 5678
}

false_detection — Audio was too short (not speech)


Code
 
{
  "type": "false_detection",
  "timestamp_ms": 1234
}

error — An error occurred


Code
 
{
  "type": "error",
  "message": "Invalid audio format"
}

Python WebSocket ASR Client


Code
 
import asyncio
import websockets
import pyaudio
import json

async def stream_asr():
    """Stream microphone audio to ASR via WebSocket."""
    uri = "ws://YOUR_HOST:8088/v1/audio/stream"

    # Audio settings
    CHUNK = 2048
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000

    # Initialize audio input
    audio = pyaudio.PyAudio()
    stream = audio.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        frames_per_buffer=CHUNK
    )

    async with websockets.connect(uri) as websocket:
        print("Connected. Start speaking...")

        async def send_audio():
            """Read from microphone and send to server."""
            try:
                while True:
                    data = stream.read(CHUNK, exception_on_overflow=False)
                    await websocket.send(data)
                    await asyncio.sleep(0.01)
            except Exception as e:
                print(f"Audio send error: {e}")

        async def receive_results():
            """Receive transcription results."""
            try:
                async for message in websocket:
                    event = json.loads(message)

                    if event['type'] == 'speech_start':
                        print("\n[Speech detected]")

                    elif event['type'] == 'transcript':
                        if event['is_final']:
                            print(f"\n[Final] {event['text']}")
                        else:
                            print(f"[Partial] {event['text']}", end='\r')

                    elif event['type'] == 'speech_end':
                        print("\n[Speech ended]")

                    elif event['type'] == 'error':
                        print(f"\n[Error] {event['message']}")
                        break
            except Exception as e:
                print(f"Receive error: {e}")

        # Run sending and receiving concurrently
        await asyncio.gather(
            send_audio(),
            receive_results()
        )

    stream.stop_stream()
    stream.close()
    audio.terminate()

# Usage
asyncio.run(stream_asr())

JavaScript WebSocket ASR Client


Code
 
const WebSocket = require('ws');
const fs = require('fs');

function streamASR(audioFilePath) {
  return new Promise((resolve, reject) => {
    const ws = new WebSocket('ws://YOUR_HOST:8088/v1/audio/stream');
    const results = [];

    ws.on('open', () => {
      console.log('Connected. Streaming audio...');

      // Read audio file (16-bit PCM, 16kHz, mono)
      const audioData = fs.readFileSync(audioFilePath);
      const chunkSize = 2048;

      // Send audio in chunks
      for (let i = 0; i < audioData.length; i += chunkSize) {
        const chunk = audioData.slice(i, i + chunkSize);
        ws.send(chunk);
      }
    });

    ws.on('message', (data) => {
      const event = JSON.parse(data.toString());

      if (event.type === 'speech_start') {
        console.log('[Speech detected]');
      } else if (event.type === 'transcript') {
        if (event.is_final) {
          console.log(`[Final] ${event.text}`);
          results.push(event.text);
        } else {
          process.stdout.write(`\r[Partial] ${event.text}`);
        }
      } else if (event.type === 'speech_end') {
        console.log('\n[Speech ended]');
        ws.close();
        resolve(results.join(' '));
      } else if (event.type === 'error') {
        console.error(`[Error] ${event.message}`);
        reject(new Error(event.message));
      }
    });

    ws.on('error', (error) => {
      reject(error);
    });
  });
}

// Usage
streamASR('audio.pcm')
  .then(transcript => {
    console.log(`\nFull transcript: ${transcript}`);
  })
  .catch(console.error);

gRPC Streaming

gRPC bidirectional streaming provides high-performance, type-safe streaming ideal for MRCP servers and enterprise voice applications.

TTS gRPC Streaming

Proto Definition

Code
 
syntax = "proto3";

package smtts;

service TextToSpeech {
  // Unary: send text, receive complete audio
  rpc Synthesize (SynthesizeRequest) returns (SynthesizeResponse);

  // Server streaming: send text, receive audio chunks
  rpc StreamingSynthesize (SynthesizeRequest) returns (stream AudioChunk);
}

message SynthesizeRequest {
  string input = 1;
  string voice = 2;         // Yara, Nouf, Yara_en
  string format = 3;        // pcm, opus, mp3
  float speed = 4;          // 0.25 - 4.0
  int32 sample_rate = 5;    // 8000, 16000, 22050, 24000
}

message SynthesizeResponse {
  bytes audio = 1;
  string format = 2;
  int32 sample_rate = 3;
  float duration_seconds = 4;
}

message AudioChunk {
  bytes audio_data = 1;
  int32 chunk_index = 2;
  bool is_final = 3;
}

Python — gRPC Streaming Client


Code
 
import grpc
import smtts_pb2
import smtts_pb2_grpc

def grpc_stream_tts(text, voice="Yara"):
    """Stream TTS audio via gRPC."""
    channel = grpc.insecure_channel("YOUR_HOST:50051")
    stub = smtts_pb2_grpc.TextToSpeechStub(channel)

    request = smtts_pb2.SynthesizeRequest(
        input=text,
        voice=voice,
        format="pcm",
        sample_rate=16000
    )

    metadata = [('authorization', f'Bearer {API_KEY}')]

    # Receive audio chunks as they are generated
    for chunk in stub.StreamingSynthesize(request, metadata=metadata):
        yield chunk.audio_data

        if chunk.is_final:
            break

# Usage
audio_chunks = []
for chunk in grpc_stream_tts("مرحباً بكم في يونيكود سولوشنز"):
    audio_chunks.append(chunk)

full_audio = b"".join(audio_chunks)

ASR gRPC Streaming

Proto Definition

Code
 
service SpeechRecognition {
  // Unary: send audio file, receive transcription
  rpc Transcribe (TranscribeRequest) returns (TranscribeResponse);

  // Bidirectional streaming: send audio chunks, receive partial results
  rpc StreamingTranscribe (stream AudioChunk) returns (stream TranscriptionResult);
}

message TranscribeRequest {
  bytes audio = 1;
  string format = 2;
}

message TranscribeResponse {
  string text = 1;
  float confidence = 2;
  float duration_seconds = 3;
}

message TranscriptionResult {
  string text = 1;
  bool is_final = 2;
  float confidence = 3;
  repeated WordInfo words = 4;
}

message WordInfo {
  string word = 1;
  float start_time = 2;
  float end_time = 3;
  float confidence = 4;
}

Python — Real-Time ASR Streaming


Code
 
import grpc
import smasr_pb2
import smasr_pb2_grpc

def stream_microphone_audio():
    """Generator that yields audio chunks from microphone."""
    import sounddevice as sd
    import numpy as np

    SAMPLE_RATE = 16000
    CHUNK_DURATION = 0.1  # 100ms chunks
    CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_DURATION)

    with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype='int16') as stream:
        while True:
            audio_data, _ = stream.read(CHUNK_SIZE)
            yield smasr_pb2.AudioChunk(
                audio_data=audio_data.tobytes(),
                chunk_index=0,
                is_final=False
            )

def realtime_transcribe():
    """Real-time transcription from microphone."""
    channel = grpc.insecure_channel("YOUR_HOST:50052")
    stub = smasr_pb2_grpc.SpeechRecognitionStub(channel)

    metadata = [('authorization', f'Bearer {API_KEY}')]

    responses = stub.StreamingTranscribe(
        stream_microphone_audio(),
        metadata=metadata
    )

    for result in responses:
        if result.is_final:
            print(f"[Final] {result.text} (confidence: {result.confidence:.2f})")
        else:
            print(f"[Partial] {result.text}", end='\r')

realtime_transcribe()

Latency Expectations

Scenario	WebSocket	gRPC	Notes
Short text (under 50 chars)	~50-100ms	~100-150ms	WebSocket lowest latency
Medium text (50-200 chars)	~100-150ms	~150-200ms	Optimal for most use cases
Long text (>500 chars)	~150-250ms	~200-300ms	Consider chunking text

TTFC = Time to First Chunk. Actual values depend on hardware (GPU model), network conditions, and server load.

Protocol Comparison

Feature	WebSocket	gRPC
Latency	⚡ Lowest (50-100ms)	⚡ Very Low (100-150ms)
Bidirectional	✓ Yes	✓ Yes
Connection overhead	Low (persistent)	Low (persistent)
Browser support	✓ Native	⚠ Requires proxy
Implementation	Medium	Complex
Type safety	Manual (JSON)	✓ Protocol Buffers
Best for	Real-time web apps	Enterprise systems

Best Practices

For Lowest Latency

WebSocket for real-time conversational applications and web-based voice bots
gRPC for high-performance enterprise systems and MRCP integration
Use pcm format — it skips audio encoding entirely
Use sample_rate: 16000 for telephony, 22050 for general playback
Keep text inputs short (under 200 characters) for real-time conversations
Pre-warm connections — establish WebSocket/gRPC connections before first request
Use connection pooling for high-throughput scenarios

Audio Format Selection

For TTS:

Use pcm format for lowest latency
Use opus for bandwidth-constrained networks
Set sample rate to 16000 for telephony, 22050 for general use

For ASR:

Always send 16-bit PCM at 16kHz
Use mono audio (single channel)
Send audio in 2048-byte chunks for optimal VAD performance

Connection Management

Handle Connection Loss:


Code
 
import asyncio
import websockets

async def reliable_websocket(uri, max_retries=3):
    """WebSocket with automatic reconnection."""
    for attempt in range(max_retries):
        try:
            async with websockets.connect(uri) as ws:
                # Your WebSocket logic here
                pass
        except websockets.exceptions.ConnectionClosed:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Connection lost. Retrying in {wait_time}s...")
                await asyncio.sleep(wait_time)
            else:
                raise

Text Chunking for Conversations


Code
 
def chunk_text_for_streaming(text, max_chars=150):
    """Split text at sentence boundaries for streaming TTS."""
    import re

    # Split on Arabic and English sentence boundaries
    sentences = re.split(r'(?<=[.!?،؟])\s+', text)

    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) > max_chars and current_chunk:
            yield current_chunk.strip()
            current_chunk = sentence
        else:
            current_chunk += " " + sentence

    if current_chunk.strip():
        yield current_chunk.strip()

Latency Monitoring


Code
 
import time

class LatencyMonitor:
    """Monitor streaming latency."""
    def __init__(self):
        self.send_time = None
        self.latencies = []

    def mark_send(self):
        """Mark when request was sent."""
        self.send_time = time.time()

    def mark_receive(self):
        """Mark when response was received."""
        if self.send_time:
            latency = (time.time() - self.send_time) * 1000
            self.latencies.append(latency)
            self.send_time = None
            return latency

    def get_stats(self):
        """Get latency statistics."""
        if not self.latencies:
            return None
        return {
            'min': min(self.latencies),
            'max': max(self.latencies),
            'avg': sum(self.latencies) / len(self.latencies),
            'p95': sorted(self.latencies)[int(len(self.latencies) * 0.95)]
        }

Troubleshooting

Connection Issues

Problem: WebSocket/gRPC connection fails

Solution:

Check firewall settings — ensure ports 9999, 8088, 50051, 50052 are open
Verify service is running with health check
Test with telnet YOUR_HOST 9999 to check connectivity
Review server logs for connection errors

Audio Quality Issues

Problem: Choppy or distorted audio

Solution:

Ensure audio is 16-bit PCM at 16kHz (for ASR)
Check network stability
Increase chunk size to 4096 bytes
Use a buffer to smooth playback
Monitor network jitter and packet loss

No Transcription Results (ASR)

Problem: ASR WebSocket not returning results

Solution:

Verify VAD is detecting speech (check for speech_start events)
Ensure audio volume is adequate
Check audio format (must be LINEAR16, 16kHz, mono)
Send continuous audio stream (don't pause between chunks)
Check for error events in WebSocket messages

Next Steps

gRPC Guide — Complete gRPC API documentation with proto files
Performance — Latency optimization techniques
SDKs — Official client libraries with streaming support
API Limits — Rate limits and streaming constraints

Last modified on February 7, 2026

REST API Reference Models & Specifications

Documentation

Streaming

SM-AI-MODELS supports real-time audio streaming for low-latency applications such as voice bots, IVR systems, and live assistants.

Why Streaming? Standard REST APIs wait for the entire audio to generate before responding. Streaming starts sending audio chunks within milliseconds, dramatically reducing perceived latency.

Streaming Methods

SM-AI-MODELS provides two streaming protocols:

Method	Protocol	Port	Best For
WebSocket	Bidirectional	`9999` (TTS), `8088` (ASR)	Real-time apps, voice bots, live transcription
gRPC	Bidirectional	`50051` (TTS), `50052` (ASR)	High-performance, MRCP, enterprise systems

WebSocket Streaming

SM-AI-MODELS provides WebSocket endpoints for real-time, bidirectional audio streaming with the lowest possible latency.

When to Use WebSockets: Use WebSocket streaming for real-time conversational applications, live transcription, voice bots, and scenarios requiring instant audio feedback.

WebSocket Endpoints

Service	Endpoint	Protocol	Best For
TTS	`ws://host:9999/v1/audio/stream`	Text→Audio	Voice bots, real-time speech synthesis
ASR	`ws://host:8088/v1/audio/stream`	Audio→Text	Live transcription, voice assistants

Text-to-Speech WebSocket

Connection

Code
 
ws://YOUR_HOST:9999/v1/audio/stream

Protocol

Client → Server: JSON messages with synthesis requests

Server → Client:

JSON status messages
Binary PCM audio chunks

Message Flow

Code
 
sequenceDiagram
    Client->>Server: Connect WebSocket
    Server->>Client: Connection accepted
    Client->>Server: JSON: {"text": "...", "voice": "Yara"}
    Server->>Client: JSON: {"type": "synthesis_start"}
    Server->>Client: Binary: PCM audio chunk
    Server->>Client: Binary: PCM audio chunk
    Server->>Client: JSON: {"type": "synthesis_complete"}

Request Format


Code
 
{
  "text": "مرحباً بكم في يونيكود سولوشنز",
  "voice": "Yara",
  "speed": 1.0,
  "sample_rate": 24000,
  "chunk_size": 1024
}

Parameters:

Parameter	Type	Default	Description
`text`	string	required	Text to synthesize (max 5,000 characters)
`voice`	string	`"Yara"`	Voice name: `Yara`, `Nouf`, `Yara_en`
`speed`	float	`1.0`	Speech speed (0.25 - 4.0)
`sample_rate`	integer	`22050`	Output sample rate in Hz (8000, 16000, 22050, 24000)
`chunk_size`	integer	`1024`	Audio chunk size in bytes

Response Events

synthesis_start — Synthesis has begun


Code
 
{
  "type": "synthesis_start",
  "session_id": "abc12345"
}

synthesis_complete — Synthesis finished


Code
 
{
  "type": "synthesis_complete",
  "session_id": "abc12345",
  "total_bytes": 48000,
  "model_id": "primary"
}

error — An error occurred


Code
 
{
  "type": "error",
  "session_id": "abc12345",
  "message": "Text exceeds maximum length"
}

Python WebSocket TTS Client


Code
 
import asyncio
import websockets
import json

async def stream_tts(text, voice="Yara"):
    """Stream TTS via WebSocket and receive audio chunks."""
    uri = "ws://YOUR_HOST:9999/v1/audio/stream"

    async with websockets.connect(uri) as websocket:
        # Send synthesis request
        request = {
            "text": text,
            "voice": voice,
            "sample_rate": 16000,
            "speed": 1.0
        }
        await websocket.send(json.dumps(request))

        # Receive responses
        audio_chunks = []
        async for message in websocket:
            # Check if message is JSON or binary
            if isinstance(message, str):
                event = json.loads(message)
                print(f"Event: {event['type']}")

                if event['type'] == 'synthesis_complete':
                    print(f"Total bytes: {event['total_bytes']}")
                    break
                elif event['type'] == 'error':
                    print(f"Error: {event['message']}")
                    break
            else:
                # Binary audio chunk
                audio_chunks.append(message)

        return b''.join(audio_chunks)

# Usage
audio_data = asyncio.run(stream_tts("مرحباً بكم"))
with open("output.pcm", "wb") as f:
    f.write(audio_data)

JavaScript WebSocket TTS Client


Code
 
const WebSocket = require('ws');

function streamTTS(text, voice = 'Yara') {
  return new Promise((resolve, reject) => {
    const ws = new WebSocket('ws://YOUR_HOST:9999/v1/audio/stream');
    const audioChunks = [];

    ws.on('open', () => {
      // Send synthesis request
      ws.send(JSON.stringify({
        text: text,
        voice: voice,
        sample_rate: 16000,
        speed: 1.0
      }));
    });

    ws.on('message', (data) => {
      if (typeof data === 'string') {
        // JSON event
        const event = JSON.parse(data);
        console.log(`Event: ${event.type}`);

        if (event.type === 'synthesis_complete') {
          console.log(`Total bytes: ${event.total_bytes}`);
          const audioBuffer = Buffer.concat(audioChunks);
          ws.close();
          resolve(audioBuffer);
        } else if (event.type === 'error') {
          reject(new Error(event.message));
        }
      } else {
        // Binary audio chunk
        audioChunks.push(data);
      }
    });

    ws.on('error', (error) => {
      reject(error);
    });
  });
}

// Usage
streamTTS('مرحباً بكم')
  .then(audioData => {
    require('fs').writeFileSync('output.pcm', audioData);
  })
  .catch(console.error);

Multi-Turn Synthesis

The WebSocket connection supports multiple synthesis requests on a single connection:


Code
 
async def multi_turn_tts(texts):
    """Send multiple texts on one WebSocket connection."""
    uri = "ws://YOUR_HOST:9999/v1/audio/stream"

    async with websockets.connect(uri) as websocket:
        for text in texts:
            # Send request
            await websocket.send(json.dumps({
                "text": text,
                "voice": "Yara"
            }))

            # Collect audio for this utterance
            audio_chunks = []
            while True:
                message = await websocket.recv()
                if isinstance(message, str):
                    event = json.loads(message)
                    if event['type'] == 'synthesis_complete':
                        break
                else:
                    audio_chunks.append(message)

            yield b''.join(audio_chunks)

# Usage
texts = ["مرحباً", "كيف حالك؟", "إلى اللقاء"]
for i, audio in enumerate(multi_turn_tts(texts)):
    with open(f"output_{i}.pcm", "wb") as f:
        f.write(audio)

Speech Recognition WebSocket

Connection

Code
 
ws://YOUR_HOST:8088/v1/audio/stream

Legacy endpoint (backward compatibility):

Code
 
ws://YOUR_HOST:8088/audio

Protocol

Client → Server: Binary PCM audio frames (int16, 16kHz, mono)

Server → Client: JSON events (speech detection, transcription results)

Message Flow

Code
 
sequenceDiagram
    Client->>Server: Connect WebSocket
    Server->>Client: Connection accepted
    Client->>Server: Binary: Audio chunk
    Client->>Server: Binary: Audio chunk
    Server->>Client: JSON: {"type": "speech_start"}
    Client->>Server: Binary: Audio chunk
    Server->>Client: JSON: {"type": "transcript", "text": "...", "is_final": false}
    Client->>Server: Binary: Audio chunk (silence)
    Server->>Client: JSON: {"type": "speech_end"}
    Server->>Client: JSON: {"type": "transcript", "text": "...", "is_final": true}

Audio Requirements

Parameter	Value	Description
Encoding	LINEAR16 (int16)	16-bit signed PCM
Sample rate	16,000 Hz	16kHz mono audio
Channels	1 (mono)	Mono audio only
Chunk size	1024-4096 bytes	Recommended 2048 bytes

Response Events

speech_start — Speech detected


Code
 
{
  "type": "speech_start",
  "timestamp_ms": 1234
}

transcript — Transcription result (interim or final)


Code
 
{
  "type": "transcript",
  "text": "مرحباً بكم",
  "is_final": true,
  "confidence": 0.95
}

speech_end — End of speech detected


Code
 
{
  "type": "speech_end",
  "timestamp_ms": 5678
}

false_detection — Audio was too short (not speech)


Code
 
{
  "type": "false_detection",
  "timestamp_ms": 1234
}

error — An error occurred


Code
 
{
  "type": "error",
  "message": "Invalid audio format"
}

Python WebSocket ASR Client


Code
 
import asyncio
import websockets
import pyaudio
import json

async def stream_asr():
    """Stream microphone audio to ASR via WebSocket."""
    uri = "ws://YOUR_HOST:8088/v1/audio/stream"

    # Audio settings
    CHUNK = 2048
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000

    # Initialize audio input
    audio = pyaudio.PyAudio()
    stream = audio.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        frames_per_buffer=CHUNK
    )

    async with websockets.connect(uri) as websocket:
        print("Connected. Start speaking...")

        async def send_audio():
            """Read from microphone and send to server."""
            try:
                while True:
                    data = stream.read(CHUNK, exception_on_overflow=False)
                    await websocket.send(data)
                    await asyncio.sleep(0.01)
            except Exception as e:
                print(f"Audio send error: {e}")

        async def receive_results():
            """Receive transcription results."""
            try:
                async for message in websocket:
                    event = json.loads(message)

                    if event['type'] == 'speech_start':
                        print("\n[Speech detected]")

                    elif event['type'] == 'transcript':
                        if event['is_final']:
                            print(f"\n[Final] {event['text']}")
                        else:
                            print(f"[Partial] {event['text']}", end='\r')

                    elif event['type'] == 'speech_end':
                        print("\n[Speech ended]")

                    elif event['type'] == 'error':
                        print(f"\n[Error] {event['message']}")
                        break
            except Exception as e:
                print(f"Receive error: {e}")

        # Run sending and receiving concurrently
        await asyncio.gather(
            send_audio(),
            receive_results()
        )

    stream.stop_stream()
    stream.close()
    audio.terminate()

# Usage
asyncio.run(stream_asr())

JavaScript WebSocket ASR Client


Code
 
const WebSocket = require('ws');
const fs = require('fs');

function streamASR(audioFilePath) {
  return new Promise((resolve, reject) => {
    const ws = new WebSocket('ws://YOUR_HOST:8088/v1/audio/stream');
    const results = [];

    ws.on('open', () => {
      console.log('Connected. Streaming audio...');

      // Read audio file (16-bit PCM, 16kHz, mono)
      const audioData = fs.readFileSync(audioFilePath);
      const chunkSize = 2048;

      // Send audio in chunks
      for (let i = 0; i < audioData.length; i += chunkSize) {
        const chunk = audioData.slice(i, i + chunkSize);
        ws.send(chunk);
      }
    });

    ws.on('message', (data) => {
      const event = JSON.parse(data.toString());

      if (event.type === 'speech_start') {
        console.log('[Speech detected]');
      } else if (event.type === 'transcript') {
        if (event.is_final) {
          console.log(`[Final] ${event.text}`);
          results.push(event.text);
        } else {
          process.stdout.write(`\r[Partial] ${event.text}`);
        }
      } else if (event.type === 'speech_end') {
        console.log('\n[Speech ended]');
        ws.close();
        resolve(results.join(' '));
      } else if (event.type === 'error') {
        console.error(`[Error] ${event.message}`);
        reject(new Error(event.message));
      }
    });

    ws.on('error', (error) => {
      reject(error);
    });
  });
}

// Usage
streamASR('audio.pcm')
  .then(transcript => {
    console.log(`\nFull transcript: ${transcript}`);
  })
  .catch(console.error);

gRPC Streaming

gRPC bidirectional streaming provides high-performance, type-safe streaming ideal for MRCP servers and enterprise voice applications.

TTS gRPC Streaming

Proto Definition

Code
 
syntax = "proto3";

package smtts;

service TextToSpeech {
  // Unary: send text, receive complete audio
  rpc Synthesize (SynthesizeRequest) returns (SynthesizeResponse);

  // Server streaming: send text, receive audio chunks
  rpc StreamingSynthesize (SynthesizeRequest) returns (stream AudioChunk);
}

message SynthesizeRequest {
  string input = 1;
  string voice = 2;         // Yara, Nouf, Yara_en
  string format = 3;        // pcm, opus, mp3
  float speed = 4;          // 0.25 - 4.0
  int32 sample_rate = 5;    // 8000, 16000, 22050, 24000
}

message SynthesizeResponse {
  bytes audio = 1;
  string format = 2;
  int32 sample_rate = 3;
  float duration_seconds = 4;
}

message AudioChunk {
  bytes audio_data = 1;
  int32 chunk_index = 2;
  bool is_final = 3;
}

Python — gRPC Streaming Client


Code
 
import grpc
import smtts_pb2
import smtts_pb2_grpc

def grpc_stream_tts(text, voice="Yara"):
    """Stream TTS audio via gRPC."""
    channel = grpc.insecure_channel("YOUR_HOST:50051")
    stub = smtts_pb2_grpc.TextToSpeechStub(channel)

    request = smtts_pb2.SynthesizeRequest(
        input=text,
        voice=voice,
        format="pcm",
        sample_rate=16000
    )

    metadata = [('authorization', f'Bearer {API_KEY}')]

    # Receive audio chunks as they are generated
    for chunk in stub.StreamingSynthesize(request, metadata=metadata):
        yield chunk.audio_data

        if chunk.is_final:
            break

# Usage
audio_chunks = []
for chunk in grpc_stream_tts("مرحباً بكم في يونيكود سولوشنز"):
    audio_chunks.append(chunk)

full_audio = b"".join(audio_chunks)

ASR gRPC Streaming

Proto Definition

Code
 
service SpeechRecognition {
  // Unary: send audio file, receive transcription
  rpc Transcribe (TranscribeRequest) returns (TranscribeResponse);

  // Bidirectional streaming: send audio chunks, receive partial results
  rpc StreamingTranscribe (stream AudioChunk) returns (stream TranscriptionResult);
}

message TranscribeRequest {
  bytes audio = 1;
  string format = 2;
}

message TranscribeResponse {
  string text = 1;
  float confidence = 2;
  float duration_seconds = 3;
}

message TranscriptionResult {
  string text = 1;
  bool is_final = 2;
  float confidence = 3;
  repeated WordInfo words = 4;
}

message WordInfo {
  string word = 1;
  float start_time = 2;
  float end_time = 3;
  float confidence = 4;
}

Python — Real-Time ASR Streaming


Code
 
import grpc
import smasr_pb2
import smasr_pb2_grpc

def stream_microphone_audio():
    """Generator that yields audio chunks from microphone."""
    import sounddevice as sd
    import numpy as np

    SAMPLE_RATE = 16000
    CHUNK_DURATION = 0.1  # 100ms chunks
    CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_DURATION)

    with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype='int16') as stream:
        while True:
            audio_data, _ = stream.read(CHUNK_SIZE)
            yield smasr_pb2.AudioChunk(
                audio_data=audio_data.tobytes(),
                chunk_index=0,
                is_final=False
            )

def realtime_transcribe():
    """Real-time transcription from microphone."""
    channel = grpc.insecure_channel("YOUR_HOST:50052")
    stub = smasr_pb2_grpc.SpeechRecognitionStub(channel)

    metadata = [('authorization', f'Bearer {API_KEY}')]

    responses = stub.StreamingTranscribe(
        stream_microphone_audio(),
        metadata=metadata
    )

    for result in responses:
        if result.is_final:
            print(f"[Final] {result.text} (confidence: {result.confidence:.2f})")
        else:
            print(f"[Partial] {result.text}", end='\r')

realtime_transcribe()

Latency Expectations

Scenario	WebSocket	gRPC	Notes
Short text (under 50 chars)	~50-100ms	~100-150ms	WebSocket lowest latency
Medium text (50-200 chars)	~100-150ms	~150-200ms	Optimal for most use cases
Long text (>500 chars)	~150-250ms	~200-300ms	Consider chunking text

TTFC = Time to First Chunk. Actual values depend on hardware (GPU model), network conditions, and server load.

Protocol Comparison

Feature	WebSocket	gRPC
Latency	⚡ Lowest (50-100ms)	⚡ Very Low (100-150ms)
Bidirectional	✓ Yes	✓ Yes
Connection overhead	Low (persistent)	Low (persistent)
Browser support	✓ Native	⚠ Requires proxy
Implementation	Medium	Complex
Type safety	Manual (JSON)	✓ Protocol Buffers
Best for	Real-time web apps	Enterprise systems

Best Practices

For Lowest Latency

WebSocket for real-time conversational applications and web-based voice bots
gRPC for high-performance enterprise systems and MRCP integration
Use pcm format — it skips audio encoding entirely
Use sample_rate: 16000 for telephony, 22050 for general playback
Keep text inputs short (under 200 characters) for real-time conversations
Pre-warm connections — establish WebSocket/gRPC connections before first request
Use connection pooling for high-throughput scenarios

Audio Format Selection

For TTS:

Use pcm format for lowest latency
Use opus for bandwidth-constrained networks
Set sample rate to 16000 for telephony, 22050 for general use

For ASR:

Always send 16-bit PCM at 16kHz
Use mono audio (single channel)
Send audio in 2048-byte chunks for optimal VAD performance

Connection Management

Handle Connection Loss:


Code
 
import asyncio
import websockets

async def reliable_websocket(uri, max_retries=3):
    """WebSocket with automatic reconnection."""
    for attempt in range(max_retries):
        try:
            async with websockets.connect(uri) as ws:
                # Your WebSocket logic here
                pass
        except websockets.exceptions.ConnectionClosed:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Connection lost. Retrying in {wait_time}s...")
                await asyncio.sleep(wait_time)
            else:
                raise

Text Chunking for Conversations


Code
 
def chunk_text_for_streaming(text, max_chars=150):
    """Split text at sentence boundaries for streaming TTS."""
    import re

    # Split on Arabic and English sentence boundaries
    sentences = re.split(r'(?<=[.!?،؟])\s+', text)

    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) > max_chars and current_chunk:
            yield current_chunk.strip()
            current_chunk = sentence
        else:
            current_chunk += " " + sentence

    if current_chunk.strip():
        yield current_chunk.strip()

Latency Monitoring


Code
 
import time

class LatencyMonitor:
    """Monitor streaming latency."""
    def __init__(self):
        self.send_time = None
        self.latencies = []

    def mark_send(self):
        """Mark when request was sent."""
        self.send_time = time.time()

    def mark_receive(self):
        """Mark when response was received."""
        if self.send_time:
            latency = (time.time() - self.send_time) * 1000
            self.latencies.append(latency)
            self.send_time = None
            return latency

    def get_stats(self):
        """Get latency statistics."""
        if not self.latencies:
            return None
        return {
            'min': min(self.latencies),
            'max': max(self.latencies),
            'avg': sum(self.latencies) / len(self.latencies),
            'p95': sorted(self.latencies)[int(len(self.latencies) * 0.95)]
        }

Troubleshooting

Connection Issues

Problem: WebSocket/gRPC connection fails

Solution:

Check firewall settings — ensure ports 9999, 8088, 50051, 50052 are open
Verify service is running with health check
Test with telnet YOUR_HOST 9999 to check connectivity
Review server logs for connection errors

Audio Quality Issues

Problem: Choppy or distorted audio

Solution:

Ensure audio is 16-bit PCM at 16kHz (for ASR)
Check network stability
Increase chunk size to 4096 bytes
Use a buffer to smooth playback
Monitor network jitter and packet loss

No Transcription Results (ASR)

Problem: ASR WebSocket not returning results

Solution:

Verify VAD is detecting speech (check for speech_start events)
Ensure audio volume is adequate
Check audio format (must be LINEAR16, 16kHz, mono)
Send continuous audio stream (don't pause between chunks)
Check for error events in WebSocket messages

Next Steps

gRPC Guide — Complete gRPC API documentation with proto files
Performance — Latency optimization techniques
SDKs — Official client libraries with streaming support
API Limits — Rate limits and streaming constraints

Last modified on February 7, 2026

REST API Reference Models & Specifications