SM-AI-MODELS supports real-time audio streaming for low-latency applications such as voice bots, IVR systems, and live assistants.
Why Streaming? Standard REST APIs wait for the entire audio to generate before responding. Streaming starts sending audio chunks within milliseconds, dramatically reducing perceived latency.
Authentication: All streaming endpoints require an API key.
WebSocket: pass the key via header (X-API-Key: YOUR_API_KEY or Authorization: Bearer YOUR_API_KEY) or query parameter (?api_key=YOUR_API_KEY or ?token=YOUR_JWT).
SM-AI-MODELS provides WebSocket endpoints for real-time, bidirectional audio streaming with the lowest possible latency.
When to Use WebSockets: Use WebSocket streaming for real-time conversational applications, live transcription, voice bots, and scenarios requiring instant audio feedback.
WebSocket Endpoints
Service
Endpoint
Protocol
Best For
TTS
wss://api.withsm.ai/v1/tts/stream
Text→Audio
Voice bots, real-time speech synthesis
ASR
wss://api.withsm.ai/v1/asr/stream
Audio→Text
Live transcription, voice assistants
Text-to-Speech WebSocket
Connection
Code
wss://api.withsm.ai/v1/tts/stream
Protocol
Client → Server: JSON messages with synthesis requests
The WebSocket connection supports multiple synthesis requests on a single connection:
Code
async def multi_turn_tts(texts): """Send multiple texts on one WebSocket connection.""" uri = "wss://api.withsm.ai/v1/tts/stream" async with websockets.connect(uri) as websocket: # Authenticate first await websocket.send(json.dumps({"x-api-key": "YOUR_API_KEY"})) for text in texts: # Send request await websocket.send(json.dumps({ "text": text, "voice": "Yara" })) # Collect audio for this utterance audio_chunks = [] while True: message = await websocket.recv() if isinstance(message, str): event = json.loads(message) if event['type'] == 'synthesis_complete': break else: audio_chunks.append(message) yield b''.join(audio_chunks)# Usagetexts = ["مرحباً", "كيف حالك؟", "إلى اللقاء"]for i, audio in enumerate(multi_turn_tts(texts)): with open(f"output_{i}.pcm", "wb") as f: f.write(audio)
Speech Recognition WebSocket
Connection
Code
wss://api.withsm.ai/v1/asr/stream
Legacy endpoint (backward compatibility):
Code
wss://api.withsm.ai/v1/asr/audio
Connection Parameters
All ASR WebSocket parameters are passed as URL query parameters (not headers, not JSON messages):
Parameter
Type
Default
Description
language
string
auto
Language routing: ar (Arabic), en (English), or auto (auto-detect). Explicit values skip language classification (~50–100ms faster).
sample_rate
integer
16000
Client audio sample rate in Hz. Allowed: 8000, 16000, 22050, 44100, 48000. Audio is resampled server-side to 16kHz if different.
request_id
string
auto-generated
Client-provided request ID for end-to-end tracing. Format: alphanumeric, dash, underscore (1–64 chars). Server generates an 8-char UUID if omitted.
api_key
string
—
API key (alternative to X-API-Key header).
token
string
—
JWT token (alternative to Authorization: Bearer header).
import grpcimport asr_pb2import asr_pb2_grpcAPI_KEY = "YOUR_API_KEY"def request_iterator(): """First yield the RecognitionConfig, then microphone audio chunks.""" import sounddevice as sd SAMPLE_RATE = 16000 CHUNK_DURATION = 0.1 # 100ms chunks CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_DURATION) # 1. Send config FIRST (required by the server before any audio) yield asr_pb2.RecognizeRequest( config=asr_pb2.RecognitionConfig( encoding=asr_pb2.LINEAR16, sample_rate_hz=SAMPLE_RATE, channels=1, language_code="ar", # "ar", "en", or "" for auto-detect enable_interim_results=True, enable_word_timestamps=False, vad_config=asr_pb2.VadConfig(enabled=True), ) ) # 2. Stream audio chunks with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype='int16') as stream: while True: audio_data, _ = stream.read(CHUNK_SIZE) yield asr_pb2.RecognizeRequest(audio_content=audio_data.tobytes())def realtime_transcribe(): """Real-time transcription from microphone via gRPC bidi stream.""" channel = grpc.secure_channel("api.withsm.ai:9101", grpc.ssl_channel_credentials()) stub = asr_pb2_grpc.SpeechRecognitionStub(channel) # Pass API key + (optional) request_id as gRPC metadata metadata = [ ("x-api-key", API_KEY), ("x-request-id", "grpc-mic-session-1"), ] responses = stub.Recognize(request_iterator(), metadata=metadata) for response in responses: for result in response.results: if result.alternatives: alt = result.alternatives[0] tag = "[Final]" if result.is_final else "[Partial]" print(f"{tag} {alt.transcript} (confidence: {alt.confidence:.2f})")realtime_transcribe()
Note: Send RecognitionConfig as the firstRecognizeRequest, then audio bytes in subsequent messages. The server rejects audio that arrives before config. See grpc_service/protos/asr.proto for the full schema.
Latency Expectations
Scenario
WebSocket
gRPC
Notes
Short text (under 50 chars)
~50-100ms
~100-150ms
WebSocket lowest latency
Medium text (50-200 chars)
~100-150ms
~150-200ms
Optimal for most use cases
Long text (>500 chars)
~150-250ms
~200-300ms
Consider chunking text
TTFC = Time to First Chunk. Actual values depend on hardware (GPU model), network conditions, and server load.
Protocol Comparison
Feature
WebSocket
gRPC
Latency
⚡ Lowest (50-100ms)
⚡ Very Low (100-150ms)
Bidirectional
✓ Yes
✓ Yes
Connection overhead
Low (persistent)
Low (persistent)
Browser support
✓ Native
⚠ Requires proxy
Implementation
Medium
Complex
Type safety
Manual (JSON)
✓ Protocol Buffers
Best for
Real-time web apps
Enterprise systems
Best Practices
For Lowest Latency
WebSocket for real-time conversational applications and web-based voice bots
gRPC for high-performance enterprise systems and MRCP integration
Use pcm format — it skips audio encoding entirely
Use sample_rate: 16000 for telephony, 22050 for general playback
Keep text inputs short (under 200 characters) for real-time conversations
Pre-warm connections — establish WebSocket/gRPC connections before first request
Use connection pooling for high-throughput scenarios
Audio Format Selection
For TTS:
Use pcm format for lowest latency
Use opus for bandwidth-constrained networks
Set sample rate to 16000 for telephony, 22050 for general use
For ASR:
Always send 16-bit PCM at 16kHz
Use mono audio (single channel)
Send audio in 2048-byte chunks for optimal VAD performance
Connection Management
Handle Connection Loss:
Code
import asyncioimport websocketsasync def reliable_websocket(uri, max_retries=3): """WebSocket with automatic reconnection.""" for attempt in range(max_retries): try: async with websockets.connect(uri) as ws: # Your WebSocket logic here pass except websockets.exceptions.ConnectionClosed: if attempt < max_retries - 1: wait_time = 2 ** attempt print(f"Connection lost. Retrying in {wait_time}s...") await asyncio.sleep(wait_time) else: raise
Text Chunking for Conversations
Code
def chunk_text_for_streaming(text, max_chars=150): """Split text at sentence boundaries for streaming TTS.""" import re # Split on Arabic and English sentence boundaries sentences = re.split(r'(?<=[.!?،؟])\s+', text) current_chunk = "" for sentence in sentences: if len(current_chunk) + len(sentence) > max_chars and current_chunk: yield current_chunk.strip() current_chunk = sentence else: current_chunk += " " + sentence if current_chunk.strip(): yield current_chunk.strip()