SM-AI-MODELS supports real-time audio streaming for low-latency applications such as voice bots, IVR systems, and live assistants.
Why Streaming? Standard REST APIs wait for the entire audio to generate before responding. Streaming starts sending audio chunks within milliseconds, dramatically reducing perceived latency.
Streaming Methods
SM-AI-MODELS provides two streaming protocols:
Method
Protocol
Port
Best For
WebSocket
Bidirectional
9999 (TTS), 8088 (ASR)
Real-time apps, voice bots, live transcription
gRPC
Bidirectional
50051 (TTS), 50052 (ASR)
High-performance, MRCP, enterprise systems
WebSocket Streaming
SM-AI-MODELS provides WebSocket endpoints for real-time, bidirectional audio streaming with the lowest possible latency.
When to Use WebSockets: Use WebSocket streaming for real-time conversational applications, live transcription, voice bots, and scenarios requiring instant audio feedback.
WebSocket Endpoints
Service
Endpoint
Protocol
Best For
TTS
ws://host:9999/v1/audio/stream
Text→Audio
Voice bots, real-time speech synthesis
ASR
ws://host:8088/v1/audio/stream
Audio→Text
Live transcription, voice assistants
Text-to-Speech WebSocket
Connection
Code
ws://YOUR_HOST:9999/v1/audio/stream
Protocol
Client → Server: JSON messages with synthesis requests
The WebSocket connection supports multiple synthesis requests on a single connection:
Code
async def multi_turn_tts(texts): """Send multiple texts on one WebSocket connection.""" uri = "ws://YOUR_HOST:9999/v1/audio/stream" async with websockets.connect(uri) as websocket: for text in texts: # Send request await websocket.send(json.dumps({ "text": text, "voice": "Yara" })) # Collect audio for this utterance audio_chunks = [] while True: message = await websocket.recv() if isinstance(message, str): event = json.loads(message) if event['type'] == 'synthesis_complete': break else: audio_chunks.append(message) yield b''.join(audio_chunks)# Usagetexts = ["مرحباً", "كيف حالك؟", "إلى اللقاء"]for i, audio in enumerate(multi_turn_tts(texts)): with open(f"output_{i}.pcm", "wb") as f: f.write(audio)
import grpcimport smasr_pb2import smasr_pb2_grpcdef stream_microphone_audio(): """Generator that yields audio chunks from microphone.""" import sounddevice as sd import numpy as np SAMPLE_RATE = 16000 CHUNK_DURATION = 0.1 # 100ms chunks CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_DURATION) with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype='int16') as stream: while True: audio_data, _ = stream.read(CHUNK_SIZE) yield smasr_pb2.AudioChunk( audio_data=audio_data.tobytes(), chunk_index=0, is_final=False )def realtime_transcribe(): """Real-time transcription from microphone.""" channel = grpc.insecure_channel("YOUR_HOST:50052") stub = smasr_pb2_grpc.SpeechRecognitionStub(channel) metadata = [('authorization', f'Bearer {API_KEY}')] responses = stub.StreamingTranscribe( stream_microphone_audio(), metadata=metadata ) for result in responses: if result.is_final: print(f"[Final] {result.text} (confidence: {result.confidence:.2f})") else: print(f"[Partial] {result.text}", end='\r')realtime_transcribe()
Latency Expectations
Scenario
WebSocket
gRPC
Notes
Short text (under 50 chars)
~50-100ms
~100-150ms
WebSocket lowest latency
Medium text (50-200 chars)
~100-150ms
~150-200ms
Optimal for most use cases
Long text (>500 chars)
~150-250ms
~200-300ms
Consider chunking text
TTFC = Time to First Chunk. Actual values depend on hardware (GPU model), network conditions, and server load.
Protocol Comparison
Feature
WebSocket
gRPC
Latency
⚡ Lowest (50-100ms)
⚡ Very Low (100-150ms)
Bidirectional
✓ Yes
✓ Yes
Connection overhead
Low (persistent)
Low (persistent)
Browser support
✓ Native
⚠ Requires proxy
Implementation
Medium
Complex
Type safety
Manual (JSON)
✓ Protocol Buffers
Best for
Real-time web apps
Enterprise systems
Best Practices
For Lowest Latency
WebSocket for real-time conversational applications and web-based voice bots
gRPC for high-performance enterprise systems and MRCP integration
Use pcm format — it skips audio encoding entirely
Use sample_rate: 16000 for telephony, 22050 for general playback
Keep text inputs short (under 200 characters) for real-time conversations
Pre-warm connections — establish WebSocket/gRPC connections before first request
Use connection pooling for high-throughput scenarios
Audio Format Selection
For TTS:
Use pcm format for lowest latency
Use opus for bandwidth-constrained networks
Set sample rate to 16000 for telephony, 22050 for general use
For ASR:
Always send 16-bit PCM at 16kHz
Use mono audio (single channel)
Send audio in 2048-byte chunks for optimal VAD performance
Connection Management
Handle Connection Loss:
Code
import asyncioimport websocketsasync def reliable_websocket(uri, max_retries=3): """WebSocket with automatic reconnection.""" for attempt in range(max_retries): try: async with websockets.connect(uri) as ws: # Your WebSocket logic here pass except websockets.exceptions.ConnectionClosed: if attempt < max_retries - 1: wait_time = 2 ** attempt print(f"Connection lost. Retrying in {wait_time}s...") await asyncio.sleep(wait_time) else: raise
Text Chunking for Conversations
Code
def chunk_text_for_streaming(text, max_chars=150): """Split text at sentence boundaries for streaming TTS.""" import re # Split on Arabic and English sentence boundaries sentences = re.split(r'(?<=[.!?،؟])\s+', text) current_chunk = "" for sentence in sentences: if len(current_chunk) + len(sentence) > max_chars and current_chunk: yield current_chunk.strip() current_chunk = sentence else: current_chunk += " " + sentence if current_chunk.strip(): yield current_chunk.strip()