Maximum file size: Contact admin for specific limit
Optimization tips:
Use FLAC for best compression without quality loss
Compress audio before uploading if using WAV
Remove silence from beginning/end of recordings
Timeout Recommendations
Operation
Recommended Timeout
Notes
TTS
30 seconds
Longer for large text
ASR
60 seconds
Depends on file size
Health check
5 seconds
Quick response expected
Example with timeout:
Code
import requests# TTS with timeoutresponse = requests.post( "http://localhost:9999/v1/audio/speech", json={"input": "مرحباً بكم"}, timeout=30)# ASR with timeoutwith open("audio.wav", "rb") as f: response = requests.post( "http://localhost:8088/v1/audio/transcriptions", files={"file": f}, timeout=60 )
Performance & Latency
Optimize SM-AI-MODELS for real-time voice applications, IVR systems, and production workloads.
Latency Benchmarks
TTS Latency (SM-TTS-V1)
Measured on NVIDIA A100 (40GB) with default configuration:
Input Length
TTFC (gRPC)
TTFC (HTTP)
Total Generation
Notes
Short (under 50 chars)
~150ms
~200ms
~400ms
Single sentence
Medium (50-200 chars)
~200ms
~300ms
~800ms
Short paragraph
Long (200-500 chars)
~250ms
~350ms
~1.5s
Full paragraph
Very long (500-2000 chars)
~300ms
~400ms
~3-5s
Multiple paragraphs
TTFC = Time to First Chunk (streaming mode). This is the most critical metric for real-time applications — it determines how quickly the user hears the first audio.
ASR Latency (SM-STT-V1)
Audio Duration
Processing Time (REST)
Real-time Factor
Notes
1 second
~300ms
0.3x
Near real-time
5 seconds
~800ms
0.16x
Fast
30 seconds
~3s
0.1x
Optimal segment length
60 seconds
~6s
0.1x
Acceptable
300 seconds (max)
~25-30s
0.08-0.1x
Use streaming for long audio
gRPC ASR Streaming Latency
Metric
Value
Notes
Partial result latency
~100-200ms
From speech to first partial transcript
Final result latency
~300-500ms
From end of speech to confirmed transcript
End-of-utterance detection
~500-800ms
Silence-based endpoint detection
Optimization Guide
1. Choose the Right Protocol
Protocol
TTFC
Throughput
Best For
gRPC Streaming
⚡ ~150ms
High
MRCP, voice bots, real-time apps
HTTP Streaming
🔶 ~200-300ms
Medium
Web apps, simple integrations
HTTP (non-streaming)
🔴 Full wait
Medium
Batch processing, file generation
2. Choose the Right Audio Format
Format
Encoding Overhead
File Size
Best For
pcm
⚡ None
Large
Lowest latency, telephony (8kHz/16kHz)
opus
⚡ Minimal
Small
WebRTC, streaming, bandwidth-constrained
mp3
🔶 ~20-50ms
Medium
Web playback, downloads
wav
🔶 Minimal
Large
Editing, archival, high quality
flac
🔴 ~30-60ms
Medium
Lossless archival
3. Optimize Sample Rate
Sample Rate
Use Case
Quality
Latency Impact
8,000 Hz
Telephony (G.711)
Acceptable
⚡ Fastest
16,000 Hz
Telephony (wideband), voice bots
Good
⚡ Fast
22,050 Hz
General playback (default)
High
🔶 Default
24,000 Hz
High-quality applications
Highest
🔶 Slightly slower
4. Optimize Text Input
Code
# ❌ Bad: Sending entire page at once — high TTFC, high latencylong_text = "مرحباً بكم في يونيكود سولوشنز... (2000+ characters)"response = generate_speech(long_text) # Waits 3-5 seconds# ✅ Good: Stream sentence by sentence — fast TTFC per segmentsentences = split_into_sentences(long_text)for sentence in sentences: for chunk in stream_tts(sentence): # TTFC ~200ms per sentence play_audio(chunk)
5. Connection Management
Code
# ❌ Bad: New connection per requestfor text in texts: response = requests.post(url, json={"input": text})# ✅ Good: Reuse session (connection pooling)session = requests.Session()session.headers.update({"Authorization": f"Bearer {API_KEY}"})for text in texts: response = session.post(url, json={"input": text})
Code
# ✅ Best: gRPC with persistent channelchannel = grpc.insecure_channel("YOUR_HOST:50051")stub = smtts_pb2_grpc.TextToSpeechStub(channel)# Reuse stub for all requests — channel stays openfor text in texts: response = stub.StreamingSynthesize(request)
6. Pre-warm the Model
The first request after service startup may be slower due to model loading. Send a warm-up request during deployment:
Code
# Warm-up script — run after service startscurl -s -X POST http://YOUR_HOST:9999/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"input": "warmup", "voice": "Yara"}' \ --output /dev/nullecho "TTS model warmed up"
Performance Tuning Configuration
Environment Variables
Variable
Default
Description
TTS_MAX_WORKERS
4
Number of concurrent TTS worker threads
ASR_MAX_WORKERS
4
Number of concurrent ASR worker threads
TTS_BATCH_SIZE
1
Batch size for TTS inference (increase for throughput)
TTS_MAX_TEXT_LENGTH
5000
Maximum input text length
ASR_MAX_AUDIO_DURATION
300
Maximum audio duration in seconds
GPU_MEMORY_FRACTION
0.9
Fraction of GPU memory to allocate
ENABLE_FP16
true
Use FP16 inference (faster, slightly less precise)
Reduce GPU_MEMORY_FRACTION, check for memory leaks
Latency spikes
Thermal throttling
Check GPU temperature with nvidia-smi
Slow first request
Model cold start
Add warm-up request to deployment script
Degrading over time
Memory fragmentation
Schedule periodic service restarts
High latency on long text
Single-threaded processing
Split text into shorter segments
Best Practices
1. Validate Before Sending
Code
def validate_tts_input(text, voice, speed): """Validate TTS inputs before API call.""" errors = [] if not text or not text.strip(): errors.append("Text is required") if voice not in ['Yara', 'Nouf', 'Yara_en']: errors.append(f"Invalid voice: {voice}") if not (0.25 <= speed <= 4.0): errors.append(f"Speed must be between 0.25 and 4.0, got {speed}") return errors# Usageerrors = validate_tts_input("مرحباً", "Yara", 1.5)if errors: print("Validation failed:", errors)else: # Make API call pass
2. Implement Retry Logic
Code
import timeimport requestsdef request_with_backoff(url, data, max_retries=3): """Make request with exponential backoff.""" for attempt in range(max_retries): try: response = requests.post(url, json=data, timeout=30) if response.status_code == 200: return response # Retry on server errors if response.status_code >= 500: wait_time = (2 ** attempt) + 1 print(f"Retry {attempt + 1}/{max_retries} in {wait_time}s") time.sleep(wait_time) continue # Don't retry client errors response.raise_for_status() except requests.Timeout: if attempt < max_retries - 1: continue raise raise Exception("Max retries exceeded")