SM-AI-MODELS REST API provides simple HTTP endpoints for Text-to-Speech and Speech Recognition.
Base URL:
https://api.withsm.ai— single gateway for all services. TTS routes are namespaced under/v1/tts/..., ASR routes under/v1/asr/.... All requests requireX-API-Key(see Authentication). For streaming: See Streaming API for WebSocket and gRPC
Text-to-Speech (TTS)
Convert text to natural-sounding speech with our neural TTS engine.
Endpoint
Code
Service: SM-TTS-V1
Request Body
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
input | string | Yes | — | Text to synthesize (see limits for maximum length) |
voice | string | No | Yara | Voice name. One of: Yara, Nouf, Atheer, Yara_en |
response_format | string | No | mp3 | Audio format. One of: mp3, wav, opus, flac, pcm, aac |
speed | number | No | 1.0 | Speech speed (0.5 to 2.0) |
Available Voices
| Voice | Language | Description |
|---|---|---|
| Yara | Arabic | Female voice — Natural and clear |
| Nouf | Arabic | Female voice — Warm and expressive |
| Atheer | Arabic | Female voice — Soft, conversational |
| Yara_en | English | Female voice — Professional |
Audio Formats
| Format | Content-Type | Use Case |
|---|---|---|
mp3 | audio/mpeg | Web playback, small file size |
wav | audio/wav | High quality, editing |
opus | audio/opus | Streaming, low latency |
flac | audio/flac | Lossless compression |
pcm | audio/pcm | Raw PCM samples — best for piping into another encoder |
aac | audio/aac | Mobile/HLS-friendly compressed audio |
Examples
Basic Arabic TTS
Code
English with Custom Speed
Code
WAV Format Output
Code
Response
On success, returns binary audio data with the appropriate Content-Type header.
Error Response
Code
See Error Handling for the full error-response reference.
Handling Long Text
For text exceeding the maximum length, split it into smaller chunks:
Code
Performance Tips
Audio Format Selection
| Format | File Size | Quality | Best For |
|---|---|---|---|
| mp3 | Smallest | Good | Web playback, storage efficiency |
| opus | Small | Excellent | Streaming, low-latency apps |
| flac | Medium | Lossless | Archiving, post-processing |
| wav | Largest | Lossless | Editing, highest quality |
Speed Optimization
Code
Concurrent Requests
Code
Speech Recognition (ASR)
Convert audio to text with our high-accuracy, multi-language speech recognition engine.
Endpoint
Code
Service: SM-STT-V1
Request
Send audio as multipart/form-data:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file | file | Yes | — | Audio file to transcribe |
language | string | No | auto | Language routing: ar (Arabic), en (English), or auto (auto-detect). Explicit values skip language classification (~50–100ms faster). |
model | string | No | sm-asr | Model identifier. Currently ignored — automatic routing is used. |
prompt | string | No | — | Optional text to guide transcription style. |
response_format | string | No | json | Output format: json, text, verbose_json, srt, vtt. |
temperature | float | No | — | Sampling temperature, 0–1. |
timestamp_granularities[] | array of string | No | [segment] | For verbose_json only. Allowed values: segment, word. |
The response includes an
X-Model-IDheader indicating which model handled the request (sm-ASR-1for Arabic,sm-ASR-2for English).
Language Support
The SM-STT-V1 engine supports Arabic and English with optional explicit routing:
| Language | Support Level | language value |
|---|---|---|
| Arabic (MSA, Gulf, Egyptian, Levantine) | ✓ Full support | ar |
| English | ✓ Full support | en |
| Auto-detect | Default | auto (or omit the field) |
Key features:
- Automatic language detection when
languageis omitted or set toauto - Set
language=arorlanguage=ento skip classification and shave ~50–100ms off latency - Mixed-language content is handled by the auto-detect path
Supported Audio Formats
| Format | Extension | Notes |
|---|---|---|
| FLAC | .flac | Recommended for quality |
| MP3 | .mp3 | Common format |
| WAV | .wav | Uncompressed audio |
| OGG | .ogg | Open format |
| WebM | .webm | Web recordings |
Examples
Auto-detect language (default):
Code
Explicit Arabic (skips classification, ~50–100ms faster):
Code
Verbose JSON with word-level timestamps:
Code
Response
Code
Best Practices
Tips for best results:
- Audio quality: Use clear audio with minimal background noise
- Sample rate: 16kHz or higher recommended for best accuracy
- Speakers: Single speaker audio works best
- Duration: Keep audio segments under 30 seconds for optimal performance
- File size: Check API Limits for maximum file size
- Format: Use FLAC for best quality-to-size ratio
- Bit depth: 16-bit minimum, higher is better
For long recordings:
Code
Error Response
Code
See Error Handling for the full error-response reference.
Advanced ASR Endpoint
Transcribe with Diarization
For advanced transcription with speaker identification, use the /v1/asr/transcribe endpoint:
Code
Request fields (multipart/form-data):
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file | file | Yes | — | Audio file to transcribe |
diarize | boolean | No | false | Enable speaker diarization (adds ~2s/min of audio). |
language | string | No | auto | Language routing: ar, en, or auto. Explicit values skip classification (~50–100ms faster). |
Example — Arabic meeting with diarization:
Code
Response:
Code
Features:
- Speaker identification and labeling
- Word-level timestamps
- Segment-level speaker attribution
- Language auto-detect by default; explicit
ar/enfor lower latency
Use Cases:
- Meeting transcriptions
- Multi-speaker conversations
- Call center analytics
Try It
Go to the API Reference to test these endpoints interactively with the API playground.
Next Steps
- Optimization & Limits — API limits and performance tuning
- Streaming API — WebSocket and gRPC for real-time audio
- cURL Examples — Comprehensive cURL examples
- gRPC Guide — Use gRPC APIs for high performance
- Error Handling — Handle errors properly
- Python Integration — Python code examples
- Node.js Integration — JavaScript/TypeScript examples
