Documentation

Optimization & Limits

Understanding limits, optimizing performance, and building reliable SM-AI-MODELS applications.

API Limits

Text-to-Speech (TTS) Limits

Input Text

Limit	Value	Error Code
Maximum text length	51,200 characters	`400 http_error` — "Text exceeds maximum length of 51,200 characters"
Maximum text length (streaming)	10,000 characters	`text_too_long`
Minimum text length	1 character	`invalid_request`
Empty input	Not allowed	`invalid_request`

Handling long text:


Code
 
def split_text(text, max_length=5000):
    """Split long text into chunks."""
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0

    for word in words:
        if current_length + len(word) > max_length:
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_length = len(word)
        else:
            current_chunk.append(word)
            current_length += len(word) + 1

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

Voice Parameters

Parameter	Allowed Values	Default
Voices	`Yara`, `Nouf`, `Atheer`, `Yara_en`	`Yara`
Invalid voice	Returns `invalid_voice` error	—

Audio Format

Format	Extension	Content-Type	Use Case
mp3	`.mp3`	`audio/mpeg`	Web playback, smallest size
wav	`.wav`	`audio/wav`	High quality, editing
opus	`.opus`	`audio/opus`	Streaming, low latency
flac	`.flac`	`audio/flac`	Lossless compression

Speed Range

Limit	Value	Error Code
Minimum speed	0.5 (2x slower)	clamped silently — no error
Maximum speed	2.0 (2x faster)	clamped silently — no error
Default speed	1.0 (normal)	—

Values outside 0.5 – 2.0 are clamped to the nearest bound. The request does not fail — the response is simply rendered at the clamped speed. If you need exact-speed behaviour, validate client-side before sending.

Speech Recognition (ASR) Limits

File Size

Limit	Value	Error Code
Maximum file size	100 MB	`400 http_error` — "Invalid file type" / multipart upload rejected
Empty file	Not allowed	`invalid_file`

Audio Duration

Limit	Value	Notes
Maximum duration	300 seconds (5 minutes)	Per single request
Maximum duration (streaming)	3,600 seconds (1 hour)	Per gRPC streaming session
Optimal duration	Under 30 seconds	Best performance

Tip: For long recordings, split into smaller segments:


Code
 
from pydub import AudioSegment

def split_audio(file_path, segment_length_ms=30000):
    """Split audio into 30-second segments."""
    audio = AudioSegment.from_file(file_path)
    segments = []

    for i in range(0, len(audio), segment_length_ms):
        segment = audio[i:i + segment_length_ms]
        segment_path = f"segment_{i//segment_length_ms}.wav"
        segment.export(segment_path, format="wav")
        segments.append(segment_path)

    return segments

Supported Formats

Format	Extension	Recommended
FLAC	`.flac`	✓ Best quality
MP3	`.mp3`	Common format
WAV	`.wav`	Uncompressed
OGG	`.ogg`	Open format
WebM	`.webm`	Web recordings

Unsupported formats return unsupported_format error.

Audio Quality

Parameter	Recommendation	Notes
Sample rate	16kHz or higher	Lower rates may reduce accuracy
Channels	Mono preferred	Stereo supported
Bit depth	16-bit minimum	Higher is better

Language Support

Language	Support Level
Arabic	✓ Full support
English	✓ Full support
Other languages	✓ Full support

All languages receive equal support with high accuracy.

Rate Limits

SM-AI-MODELS enforces rate limits to ensure fair usage and system stability.

API Rate Limits

Limit	Default	Description
Requests per minute (RPM)	60	Maximum API requests per minute per API key
Requests per second (RPS)	10	Maximum burst rate per API key
Concurrent requests	5	Maximum simultaneous in-flight requests per API key
Concurrent gRPC streams	10	Maximum open gRPC streaming sessions per API key

Custom limits: Enterprise deployments can have custom rate limits. Contact sales.

Rate Limit Headers

Every API response includes rate limit headers:

Code
 
HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 57
X-RateLimit-Reset: 1706745600
X-Request-ID: req_a1b2c3d4e5f6

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed per minute
`X-RateLimit-Remaining`	Requests remaining in current window
`X-RateLimit-Reset`	Unix timestamp when the rate limit window resets
`X-Request-ID`	Unique request identifier (use for support tickets)

Rate Limit Errors

When you exceed a rate limit, the API returns a 429 Too Many Requests response:


Code
 
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Maximum 60 requests per minute. Retry after 12 seconds.",
    "request_id": "req_a1b2c3d4e5f6",
    "retry_after": 12
  }
}

The response includes a Retry-After header indicating how many seconds to wait:

Code
 
HTTP/1.1 429 Too Many Requests
Retry-After: 12
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706745612

Handling Rate Limits

Python — Rate Limit Handler with Exponential Backoff


Code
 
import time
import requests

class RateLimitHandler:
    def __init__(self, api_key, base_url="https://api.withsm.ai"):
        self.api_key = api_key
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        })

    def request(self, endpoint, data, max_retries=5):
        """Make API request with automatic rate limit handling."""
        for attempt in range(max_retries):
            response = self.session.post(
                f"{self.base_url}{endpoint}",
                json=data,
                timeout=30
            )

            # Success
            if response.status_code == 200:
                remaining = response.headers.get("X-RateLimit-Remaining", "?")
                print(f"✓ Success (rate limit remaining: {remaining})")
                return response

            # Rate limited
            if response.status_code == 429:
                retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
                print(f"⏳ Rate limited. Waiting {retry_after}s (attempt {attempt + 1}/{max_retries})")
                time.sleep(retry_after)
                continue

            # Server error — retry with backoff
            if response.status_code >= 500:
                wait = (2 ** attempt) + 0.5
                print(f"⚠ Server error {response.status_code}. Retrying in {wait}s")
                time.sleep(wait)
                continue

            # Client error — don't retry
            response.raise_for_status()

        raise Exception(f"Max retries ({max_retries}) exceeded")

# Usage
handler = RateLimitHandler(api_key="YOUR_API_KEY")
response = handler.request("/v1/tts/audio/speech", {
    "input": "مرحباً بكم",
    "voice": "Yara"
})
audio = response.content

Batch Processing

For bulk TTS/ASR processing, respect rate limits with a queue:


Code
 
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def batch_tts(texts, voice="Yara", max_concurrent=5, delay_ms=100):
    """Process multiple TTS requests respecting rate limits."""
    results = {}

    def process_one(index, text):
        time.sleep(index * (delay_ms / 1000))  # Stagger requests
        response = requests.post(
            "https://api.withsm.ai/v1/tts/audio/speech",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={"input": text, "voice": voice}
        )
        response.raise_for_status()
        return index, response.content

    with ThreadPoolExecutor(max_workers=max_concurrent) as executor:
        futures = {
            executor.submit(process_one, i, text): i
            for i, text in enumerate(texts)
        }

        for future in as_completed(futures):
            index, audio = future.result()
            results[index] = audio
            print(f"Completed {len(results)}/{len(texts)}")

    return [results[i] for i in range(len(texts))]

# Usage
texts = ["مرحباً بكم", "شكراً لاتصالكم", "هل يمكنني مساعدتك؟"]
audio_files = batch_tts(texts)

Monitoring Your Usage

Check your current rate limit status:

Code
 
curl -I https://api.withsm.ai/v1/tts/health \
  -H "Authorization: Bearer YOUR_API_KEY"

Code
 
HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1706745600

Concurrent Requests

Limit	Value	Notes
Concurrent connections	Contact admin for specific limit	Varies by deployment
Recommended	Keep under 10 concurrent	For optimal performance

Example with connection pooling:


Code
 
import asyncio
import aiohttp

async def generate_speech_batch(texts, max_concurrent=5):
    """Generate speech for multiple texts with concurrency limit."""
    semaphore = asyncio.Semaphore(max_concurrent)

    async def fetch(session, text):
        async with semaphore:
            async with session.post(
                "https://api.withsm.ai/v1/tts/audio/speech",
                json={"input": text, "voice": "Yara"}
            ) as response:
                return await response.read()

    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, text) for text in texts]
        return await asyncio.gather(*tasks)

# Usage
texts = ["مرحباً", "صباح الخير", "مساء الخير"]
results = asyncio.run(generate_speech_batch(texts))

Request/Response Size

TTS Response Size

Response size depends on:

Input text length — Longer text = larger audio file
Audio format — mp3 < opus < flac < wav
Duration — Approximately 1 second of audio per 10 characters (varies by language)

Estimated sizes (for 100 characters of Arabic text):

Code
 
mp3:  ~15-20 KB
opus: ~20-25 KB
flac: ~100-150 KB
wav:  ~150-200 KB

ASR Request Size

Maximum file size: Contact admin for specific limit

Optimization tips:

Use FLAC for best compression without quality loss
Compress audio before uploading if using WAV
Remove silence from beginning/end of recordings

Timeout Recommendations

Operation	Recommended Timeout	Notes
TTS	30 seconds	Longer for large text
ASR	60 seconds	Depends on file size
Health check	5 seconds	Quick response expected

Example with timeout:


Code
 
import requests

# TTS with timeout
response = requests.post(
    "https://api.withsm.ai/v1/tts/audio/speech",
    json={"input": "مرحباً بكم"},
    timeout=30
)

# ASR with timeout
with open("audio.wav", "rb") as f:
    response = requests.post(
        "https://api.withsm.ai/v1/asr/audio/transcriptions",
        files={"file": f},
        timeout=60
    )

Performance & Latency

Optimize SM-AI-MODELS for real-time voice applications, IVR systems, and production workloads.

Latency Benchmarks

TTS Latency (SM-TTS-V1)

Measured under optimal conditions:

Input Length	TTFC (gRPC)	TTFC (HTTP)	Total Generation	Notes
Short (under 50 chars)	~150ms	~200ms	~400ms	Single sentence
Medium (50-200 chars)	~200ms	~300ms	~800ms	Short paragraph
Long (200-500 chars)	~250ms	~350ms	~1.5s	Full paragraph
Very long (500-2000 chars)	~300ms	~400ms	~3-5s	Multiple paragraphs

TTFC = Time to First Chunk (streaming mode). This is the most critical metric for real-time applications — it determines how quickly the user hears the first audio.

ASR Latency (SM-STT-V1)

Audio Duration	Processing Time (REST)	Real-time Factor	Notes
1 second	~300ms	0.3x	Near real-time
5 seconds	~800ms	0.16x	Fast
30 seconds	~3s	0.1x	Optimal segment length
60 seconds	~6s	0.1x	Acceptable
300 seconds (max)	~25-30s	0.08-0.1x	Use streaming for long audio

gRPC ASR Streaming Latency

Metric	Value	Notes
Partial result latency	~100-200ms	From speech to first partial transcript
Final result latency	~300-500ms	From end of speech to confirmed transcript
End-of-utterance detection	~500-800ms	Silence-based endpoint detection

Optimization Guide

1. Choose the Right Protocol

Protocol	TTFC	Throughput	Best For
gRPC Streaming	⚡ ~150ms	High	MRCP, voice bots, real-time apps
HTTP Streaming	🔶 ~200-300ms	Medium	Web apps, simple integrations
HTTP (non-streaming)	🔴 Full wait	Medium	Batch processing, file generation

2. Choose the Right Audio Format

Format	Encoding Overhead	File Size	Best For
`pcm`	⚡ None	Large	Lowest latency, telephony (8kHz/16kHz)
`opus`	⚡ Minimal	Small	WebRTC, streaming, bandwidth-constrained
`mp3`	🔶 ~20-50ms	Medium	Web playback, downloads
`wav`	🔶 Minimal	Large	Editing, archival, high quality
`flac`	🔴 ~30-60ms	Medium	Lossless archival

3. Optimize Sample Rate

Sample Rate	Use Case	Quality	Latency Impact
8,000 Hz	Telephony (G.711)	Acceptable	⚡ Fastest
16,000 Hz	Telephony (wideband), voice bots	Good	⚡ Fast
22,050 Hz	General playback (default)	High	🔶 Default
24,000 Hz	High-quality applications	Highest	🔶 Slightly slower

4. Optimize Text Input


Code
 
# ❌ Bad: Sending entire page at once — high TTFC, high latency
long_text = "مرحباً بكم في يونيكود... (2000+ characters)"
response = generate_speech(long_text)  # Waits 3-5 seconds

# ✅ Good: Stream sentence by sentence — fast TTFC per segment
sentences = split_into_sentences(long_text)
for sentence in sentences:
    for chunk in stream_tts(sentence):  # TTFC ~200ms per sentence
        play_audio(chunk)

5. Connection Management


Code
 
# ❌ Bad: New connection per request
for text in texts:
    response = requests.post(url, json={"input": text})

# ✅ Good: Reuse session (connection pooling)
session = requests.Session()
session.headers.update({"Authorization": f"Bearer {API_KEY}"})
for text in texts:
    response = session.post(url, json={"input": text})


Code
 
# ✅ Best: gRPC with persistent channel
channel = grpc.secure_channel("api.withsm.ai:9102", grpc.ssl_channel_credentials())
stub = smtts_pb2_grpc.TextToSpeechStub(channel)

# Reuse stub for all requests — channel stays open
for text in texts:
    response = stub.StreamingSynthesize(request)

Load Testing

Use standard load testing tools like k6, Locust, or simple shell scripts to measure your deployment's performance:

Code
 
# Benchmark TTS — measure response time
for i in $(seq 1 100); do
  curl -s -o /dev/null -w "%{time_total}\n" \
    -X POST https://api.withsm.ai/v1/tts/audio/speech \
    -H "X-API-Key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"input": "مرحباً بكم", "voice": "Yara"}'
done | awk '{sum+=$1; n++} END {print "Avg:", sum/n "s over", n, "requests"}'

Example output:

Code
 
TTS Benchmark Results
─────────────────────
Total requests:     100
Successful:         100
Failed:             0
Concurrency:        10

Latency (ms):
  P50:              280
  P90:              420
  P95:              510
  P99:              680
  TTFC (avg):       210

Throughput:
  Requests/sec:     8.3
  Characters/sec:   230

Troubleshooting Latency

Symptom	Likely Cause	Fix
TTFC > 1 second	GPU memory full	Check GPU utilization, reduce concurrent load
Latency spikes	Thermal throttling	Check GPU temperature with monitoring tools
Slow first request	Model cold start	Add warm-up request to deployment script
Degrading over time	Memory fragmentation	Schedule periodic service restarts
High latency on long text	Single-threaded processing	Split text into shorter segments

Best Practices

1. Validate Before Sending


Code
 
def validate_tts_input(text, voice, speed):
    """Validate TTS inputs before API call."""
    errors = []

    if not text or not text.strip():
        errors.append("Text is required")

    if voice not in ['Yara', 'Nouf', 'Atheer', 'Yara_en']:
        errors.append(f"Invalid voice: {voice}")

    if not (0.5 <= speed <= 2.0):
        errors.append(f"Speed must be between 0.5 and 2.0, got {speed}")

    return errors

# Usage
errors = validate_tts_input("مرحباً", "Yara", 1.5)
if errors:
    print("Validation failed:", errors)
else:
    # Make API call
    pass

2. Implement Retry Logic


Code
 
import time
import requests

def request_with_backoff(url, data, max_retries=3):
    """Make request with exponential backoff."""
    for attempt in range(max_retries):
        try:
            response = requests.post(url, json=data, timeout=30)

            if response.status_code == 200:
                return response

            # Retry on server errors
            if response.status_code >= 500:
                wait_time = (2 ** attempt) + 1
                print(f"Retry {attempt + 1}/{max_retries} in {wait_time}s")
                time.sleep(wait_time)
                continue

            # Don't retry client errors
            response.raise_for_status()

        except requests.Timeout:
            if attempt < max_retries - 1:
                continue
            raise

    raise Exception("Max retries exceeded")

3. Monitor Usage


Code
 
import logging
from datetime import datetime

logger = logging.getLogger(__name__)

def track_api_usage(operation, input_size, response_time):
    """Log API usage metrics."""
    logger.info(f"API Usage: {operation} | "
                f"Input: {input_size} | "
                f"Time: {response_time:.2f}s | "
                f"Timestamp: {datetime.now()}")

# Usage
start = time.time()
response = generate_speech("مرحباً")
elapsed = time.time() - start

track_api_usage("TTS", len("مرحباً"), elapsed)

4. Handle Limits Gracefully


Code
 
def safe_transcribe(file_path, max_file_size_mb=10):
    """Transcribe with file size check."""
    import os

    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)

    if file_size_mb > max_file_size_mb:
        raise ValueError(
            f"File size ({file_size_mb:.1f}MB) exceeds "
            f"maximum ({max_file_size_mb}MB)"
        )

    # Proceed with transcription
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://api.withsm.ai/v1/asr/audio/transcriptions",
            files={"file": f}
        )

    return response.json()["text"]

Summary of Key Limits

Service	Limit Type	Value
TTS	Max text length	51,200 characters per request
TTS	Speed range	0.5 - 2.0
ASR	Max file size	100 MB
ASR	Max duration	300 seconds (3,600 for streaming)
API	Requests per minute	60 RPM
API	Requests per second	10 RPS
API	Concurrent requests	5

Note: Limits may vary for Enterprise deployments. Contact your administrator for deployment-specific values.

Next Steps

Error Handling — Handle API errors properly
REST API Documentation — TTS and ASR endpoint details
Streaming — Real-time audio streaming
Specifications — Engine specs and capabilities

Last modified on May 4, 2026

Models & Specifications Python Integration