Documentation

REST API Reference

SM-AI-MODELS REST API provides simple HTTP endpoints for Text-to-Speech and Speech Recognition.

Service Ports: TTS on port 9999, ASR on port 8088 For streaming: See Streaming API for WebSocket and gRPC

Text-to-Speech (TTS)

Convert text to natural-sounding speech with our neural TTS engine.

Endpoint

Code
 
POST /v1/audio/speech

Service: SM-TTS-V1 on port 9999

Request Body

Parameter	Type	Required	Default	Description
`input`	string	Yes	—	Text to synthesize (see limits for maximum length)
`voice`	string	No	`Yara`	Voice name (`Yara`, `Nouf`, or `Yara_en`)
`response_format`	string	No	`mp3`	Audio format (`mp3`, `wav`, `opus`, or `flac`)
`speed`	number	No	`1.0`	Speech speed (0.25 to 4.0)

Available Voices

Voice	Language	Description
Yara	Arabic	Female voice — Natural and clear
Nouf	Arabic	Female voice — Warm and expressive
Yara_en	English	Female voice — Professional

Audio Formats

Format	Content-Type	Use Case
`mp3`	audio/mpeg	Web playback, small file size
`wav`	audio/wav	High quality, editing
`opus`	audio/opus	Streaming, low latency
`flac`	audio/flac	Lossless compression

Examples

Basic Arabic TTS

Code
 
curl -X POST http://localhost:9999/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "مرحباً بكم في يونيكود سولوشنز"}' \
  --output speech.mp3

English with Custom Speed

Code
 
curl -X POST http://localhost:9999/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Welcome to Unicode Solutions",
    "voice": "Yara_en",
    "speed": 1.2
  }' \
  --output english.mp3

WAV Format Output

Code
 
curl -X POST http://localhost:9999/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Testing WAV format",
    "voice": "Nouf",
    "response_format": "wav"
  }' \
  --output audio.wav

Response

On success, returns binary audio data with the appropriate Content-Type header.

Error Response


Code
 
{
  "error": {
    "code": "invalid_request",
    "message": "Input text is required"
  }
}

Handling Long Text

For text exceeding the maximum length, split it into smaller chunks:


Code
 
def split_text(text, max_length=5000):
    """Split long text into manageable chunks."""
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0

    for word in words:
        word_length = len(word) + 1  # +1 for space
        if current_length + word_length > max_length:
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_length = word_length
        else:
            current_chunk.append(word)
            current_length += word_length

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Generate speech for each chunk
long_text = "Your very long text here..."
chunks = split_text(long_text)

for i, chunk in enumerate(chunks):
    response = requests.post(
        "http://localhost:9999/v1/audio/speech",
        json={"input": chunk, "voice": "Yara"}
    )
    with open(f"part_{i}.mp3", "wb") as f:
        f.write(response.content)

Performance Tips

Audio Format Selection

Format	File Size	Quality	Best For
mp3	Smallest	Good	Web playback, storage efficiency
opus	Small	Excellent	Streaming, low-latency apps
flac	Medium	Lossless	Archiving, post-processing
wav	Largest	Lossless	Editing, highest quality

Speed Optimization


Code
 
# For fastest response, use mp3 format
response = requests.post(
    "http://localhost:9999/v1/audio/speech",
    json={
        "input": "مرحباً",
        "response_format": "mp3"  # Fastest encoding
    }
)

Concurrent Requests


Code
 
import concurrent.futures
import requests

def generate_speech(text, voice="Yara"):
    response = requests.post(
        "http://localhost:9999/v1/audio/speech",
        json={"input": text, "voice": voice}
    )
    return response.content

# Generate multiple audio files in parallel
texts = ["مرحباً", "صباح الخير", "مساء الخير"]

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(generate_speech, text) for text in texts]
    results = [future.result() for future in futures]

Speech Recognition (ASR)

Convert audio to text with our high-accuracy, multi-language speech recognition engine.

Endpoint

Code
 
POST /v1/audio/transcriptions

Service: SM-STT-V1 on port 8088

Request

Send audio as multipart/form-data:

Parameter	Type	Required	Description
`file`	file	Yes	Audio file to transcribe

Language Support

The SM-ASR-v2 engine provides full multi-language support with high accuracy across all supported languages:

Language	Support Level	Notes
Arabic	✓ Full support	Excellent accuracy for all dialects
English	✓ Full support	Native and non-native speakers
Other languages	✓ Full support	Equal quality across all languages

Key features:

Automatic language detection
No need to specify language in API request
Consistent accuracy across all supported languages
Support for mixed-language content

Supported Audio Formats

Format	Extension	Notes
FLAC	`.flac`	Recommended for quality
MP3	`.mp3`	Common format
WAV	`.wav`	Uncompressed audio
OGG	`.ogg`	Open format
WebM	`.webm`	Web recordings

Example

Code
 
curl -X POST http://localhost:8088/v1/audio/transcriptions \
  -F "file=@recording.wav"

Response


Code
 
{
  "text": "مرحباً بكم في خدمات يونيكود سولوشنز"
}

Best Practices

Tips for best results:

Audio quality: Use clear audio with minimal background noise
Sample rate: 16kHz or higher recommended for best accuracy
Speakers: Single speaker audio works best
Duration: Keep audio segments under 30 seconds for optimal performance
File size: Check API Limits for maximum file size
Format: Use FLAC for best quality-to-size ratio
Bit depth: 16-bit minimum, higher is better

For long recordings:


Code
 
from pydub import AudioSegment

def split_audio(file_path, segment_length_ms=30000):
    """Split long audio into 30-second segments."""
    audio = AudioSegment.from_file(file_path)
    segments = []

    for i in range(0, len(audio), segment_length_ms):
        segment = audio[i:i + segment_length_ms]
        segment_path = f"segment_{i//segment_length_ms}.wav"
        segment.export(segment_path, format="wav")
        segments.append(segment_path)

    return segments

Error Response


Code
 
{
  "error": {
    "code": "invalid_file",
    "message": "Unsupported audio format"
  }
}

Advanced ASR Endpoint

Transcribe with Diarization

For advanced transcription with speaker identification, use the /api/transcribe endpoint:

Code
 
POST /api/transcribe

Request:

Code
 
curl -X POST http://localhost:8088/api/transcribe \
  -F "file=@meeting.wav" \
  -F "diarize=true"

Response:


Code
 
{
  "segments": [
    {
      "text": "مرحباً بكم في الاجتماع",
      "speaker": "SPEAKER_00",
      "start_time": 0.0,
      "end_time": 2.5,
      "words": [
        {
          "word": "مرحباً",
          "start_time": 0.0,
          "end_time": 0.8,
          "confidence": 0.98
        }
      ]
    }
  ],
  "model_id": "primary"
}

Features:

Speaker identification and labeling
Word-level timestamps
Segment-level speaker attribution
Automatic language detection

Use Cases:

Meeting transcriptions
Multi-speaker conversations
Call center analytics

Try It

Go to the API Reference to test these endpoints interactively with the API playground.

Next Steps

Optimization & Limits — API limits and performance tuning
Streaming API — WebSocket and gRPC for real-time audio
cURL Examples — Comprehensive cURL examples
gRPC Guide — Use gRPC APIs for high performance
Error Handling — Handle errors properly
Python Integration — Python code examples
Node.js Integration — JavaScript/TypeScript examples

Last modified on February 7, 2026

Streaming

Documentation

REST API Reference

SM-AI-MODELS REST API provides simple HTTP endpoints for Text-to-Speech and Speech Recognition.

Service Ports: TTS on port 9999, ASR on port 8088 For streaming: See Streaming API for WebSocket and gRPC

Text-to-Speech (TTS)

Convert text to natural-sounding speech with our neural TTS engine.

Endpoint

Code
 
POST /v1/audio/speech

Service: SM-TTS-V1 on port 9999

Request Body

Parameter	Type	Required	Default	Description
`input`	string	Yes	—	Text to synthesize (see limits for maximum length)
`voice`	string	No	`Yara`	Voice name (`Yara`, `Nouf`, or `Yara_en`)
`response_format`	string	No	`mp3`	Audio format (`mp3`, `wav`, `opus`, or `flac`)
`speed`	number	No	`1.0`	Speech speed (0.25 to 4.0)

Available Voices

Voice	Language	Description
Yara	Arabic	Female voice — Natural and clear
Nouf	Arabic	Female voice — Warm and expressive
Yara_en	English	Female voice — Professional

Audio Formats

Format	Content-Type	Use Case
`mp3`	audio/mpeg	Web playback, small file size
`wav`	audio/wav	High quality, editing
`opus`	audio/opus	Streaming, low latency
`flac`	audio/flac	Lossless compression

Examples

Basic Arabic TTS

Code
 
curl -X POST http://localhost:9999/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "مرحباً بكم في يونيكود سولوشنز"}' \
  --output speech.mp3

English with Custom Speed

Code
 
curl -X POST http://localhost:9999/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Welcome to Unicode Solutions",
    "voice": "Yara_en",
    "speed": 1.2
  }' \
  --output english.mp3

WAV Format Output

Code
 
curl -X POST http://localhost:9999/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Testing WAV format",
    "voice": "Nouf",
    "response_format": "wav"
  }' \
  --output audio.wav

Response

On success, returns binary audio data with the appropriate Content-Type header.

Error Response


Code
 
{
  "error": {
    "code": "invalid_request",
    "message": "Input text is required"
  }
}

Handling Long Text

For text exceeding the maximum length, split it into smaller chunks:


Code
 
def split_text(text, max_length=5000):
    """Split long text into manageable chunks."""
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0

    for word in words:
        word_length = len(word) + 1  # +1 for space
        if current_length + word_length > max_length:
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_length = word_length
        else:
            current_chunk.append(word)
            current_length += word_length

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Generate speech for each chunk
long_text = "Your very long text here..."
chunks = split_text(long_text)

for i, chunk in enumerate(chunks):
    response = requests.post(
        "http://localhost:9999/v1/audio/speech",
        json={"input": chunk, "voice": "Yara"}
    )
    with open(f"part_{i}.mp3", "wb") as f:
        f.write(response.content)

Performance Tips

Audio Format Selection

Format	File Size	Quality	Best For
mp3	Smallest	Good	Web playback, storage efficiency
opus	Small	Excellent	Streaming, low-latency apps
flac	Medium	Lossless	Archiving, post-processing
wav	Largest	Lossless	Editing, highest quality

Speed Optimization


Code
 
# For fastest response, use mp3 format
response = requests.post(
    "http://localhost:9999/v1/audio/speech",
    json={
        "input": "مرحباً",
        "response_format": "mp3"  # Fastest encoding
    }
)

Concurrent Requests


Code
 
import concurrent.futures
import requests

def generate_speech(text, voice="Yara"):
    response = requests.post(
        "http://localhost:9999/v1/audio/speech",
        json={"input": text, "voice": voice}
    )
    return response.content

# Generate multiple audio files in parallel
texts = ["مرحباً", "صباح الخير", "مساء الخير"]

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(generate_speech, text) for text in texts]
    results = [future.result() for future in futures]

Speech Recognition (ASR)

Convert audio to text with our high-accuracy, multi-language speech recognition engine.

Endpoint

Code
 
POST /v1/audio/transcriptions

Service: SM-STT-V1 on port 8088

Request

Send audio as multipart/form-data:

Parameter	Type	Required	Description
`file`	file	Yes	Audio file to transcribe

Language Support

The SM-ASR-v2 engine provides full multi-language support with high accuracy across all supported languages:

Language	Support Level	Notes
Arabic	✓ Full support	Excellent accuracy for all dialects
English	✓ Full support	Native and non-native speakers
Other languages	✓ Full support	Equal quality across all languages

Key features:

Automatic language detection
No need to specify language in API request
Consistent accuracy across all supported languages
Support for mixed-language content

Supported Audio Formats

Format	Extension	Notes
FLAC	`.flac`	Recommended for quality
MP3	`.mp3`	Common format
WAV	`.wav`	Uncompressed audio
OGG	`.ogg`	Open format
WebM	`.webm`	Web recordings

Example

Code
 
curl -X POST http://localhost:8088/v1/audio/transcriptions \
  -F "file=@recording.wav"

Response


Code
 
{
  "text": "مرحباً بكم في خدمات يونيكود سولوشنز"
}

Best Practices

Tips for best results:

Audio quality: Use clear audio with minimal background noise
Sample rate: 16kHz or higher recommended for best accuracy
Speakers: Single speaker audio works best
Duration: Keep audio segments under 30 seconds for optimal performance
File size: Check API Limits for maximum file size
Format: Use FLAC for best quality-to-size ratio
Bit depth: 16-bit minimum, higher is better

For long recordings:


Code
 
from pydub import AudioSegment

def split_audio(file_path, segment_length_ms=30000):
    """Split long audio into 30-second segments."""
    audio = AudioSegment.from_file(file_path)
    segments = []

    for i in range(0, len(audio), segment_length_ms):
        segment = audio[i:i + segment_length_ms]
        segment_path = f"segment_{i//segment_length_ms}.wav"
        segment.export(segment_path, format="wav")
        segments.append(segment_path)

    return segments

Error Response


Code
 
{
  "error": {
    "code": "invalid_file",
    "message": "Unsupported audio format"
  }
}

Advanced ASR Endpoint

Transcribe with Diarization

For advanced transcription with speaker identification, use the /api/transcribe endpoint:

Code
 
POST /api/transcribe

Request:

Code
 
curl -X POST http://localhost:8088/api/transcribe \
  -F "file=@meeting.wav" \
  -F "diarize=true"

Response:


Code
 
{
  "segments": [
    {
      "text": "مرحباً بكم في الاجتماع",
      "speaker": "SPEAKER_00",
      "start_time": 0.0,
      "end_time": 2.5,
      "words": [
        {
          "word": "مرحباً",
          "start_time": 0.0,
          "end_time": 0.8,
          "confidence": 0.98
        }
      ]
    }
  ],
  "model_id": "primary"
}

Features:

Speaker identification and labeling
Word-level timestamps
Segment-level speaker attribution
Automatic language detection

Use Cases:

Meeting transcriptions
Multi-speaker conversations
Call center analytics

Try It

Go to the API Reference to test these endpoints interactively with the API playground.

Next Steps

Optimization & Limits — API limits and performance tuning
Streaming API — WebSocket and gRPC for real-time audio
cURL Examples — Comprehensive cURL examples
gRPC Guide — Use gRPC APIs for high performance
Error Handling — Handle errors properly
Python Integration — Python code examples
Node.js Integration — JavaScript/TypeScript examples

Last modified on February 7, 2026

Streaming