Documentation

Models & Specifications

SM-AI-MODELS is powered by custom-tuned neural speech engines optimized for Arabic and English.

Text-to-Speech Engine

SM-TTS-V1

Neural TTS engine delivering natural, expressive Arabic and English speech with low latency.

Property	Value
Engine	SM-TTS-V1
Architecture	Neural TTS (Proprietary)
Languages	Arabic, English
Voices	Yara (AR), Nouf (AR), Yara_en (EN)
Sample Rate	22,050 Hz (default), configurable: 8,000 – 24,000 Hz
Bit Depth	16-bit
Max Input	5,000 characters per request
Latency (TTFC)	~200-350ms (REST), ~150-250ms (gRPC streaming)
Output Formats	MP3, WAV, OPUS, FLAC, PCM
REST Port	9999
gRPC Port	50051

Capabilities

Feature	Supported	Notes
Arabic (MSA)	✅	Modern Standard Arabic — primary optimization
Arabic (Gulf dialect)	✅	Optimized for Saudi/Gulf pronunciation
English	✅	Via Yara_en voice
Mixed Arabic/English	✅	Auto-detects language switches within text
Speed control	✅	0.25x – 4.0x
Streaming	✅	HTTP chunked + gRPC server streaming
SSML	🔶 Partial	`<break>`, `<say-as>` tags supported (see SSML Reference)
Diacritics (Tashkeel)	✅	Improves Arabic pronunciation accuracy
Number normalization	✅	Automatic — numbers read as spoken words

Voice Details

Yara (Arabic — Default)

Property	Value
Language	Arabic
Gender	Female
Tone	Natural, clear, professional
Best For	IVR systems, customer service, announcements
Dialect	Modern Standard Arabic with Gulf accent
Speed Sweet Spot	0.9 – 1.2x

Nouf (Arabic)

Property	Value
Language	Arabic
Gender	Female
Tone	Warm, expressive, conversational
Best For	Conversational agents, educational content, storytelling
Dialect	Modern Standard Arabic
Speed Sweet Spot	0.8 – 1.1x

Yara_en (English)

Property	Value
Language	English
Gender	Female
Tone	Professional, neutral
Best For	English prompts, bilingual IVR, international audiences
Accent	Neutral English
Speed Sweet Spot	0.9 – 1.3x

Voice Selection Guide

Scenario	Recommended Voice	Why
Banking IVR (Arabic)	Yara	Clear, professional, excellent for structured prompts
Customer service bot	Nouf	Warm tone feels conversational and approachable
Bilingual system (AR/EN)	Yara + Yara_en	Same voice family, consistent brand experience
Educational content	Nouf	Expressive delivery keeps listener engaged
English-only service	Yara_en	Professional neutral English

Speech Recognition Engine

SM-STT-V1

Neural ASR engine providing high-accuracy Arabic transcription with real-time streaming support.

Property	Value
Engine	SM-STT-V1
Architecture	Neural ASR (Proprietary)
Primary Language	Arabic
Secondary Language	English
Accuracy (Arabic WER)	~8-12% on clean audio
Accuracy (English WER)	~10-15% on clean audio
Max Audio Duration	300 seconds (5 min) per REST request
Max File Size	25 MB
Supported Formats	FLAC, MP3, WAV, OGG, WebM
Optimal Sample Rate	16 kHz mono
REST Port	8088
gRPC Port	50052

ASR Capabilities

Feature	Supported	Notes
Arabic transcription	✅	MSA + Gulf dialect
English transcription	✅	Optimized as secondary language
Mixed Arabic/English	✅	Auto-detects language switches
Streaming (real-time)	✅	gRPC bidirectional streaming
Word timestamps	✅	Via gRPC response fields
Confidence scores	✅	Per-utterance confidence (0.0 – 1.0)
Speaker diarization	✅	Available via `/api/transcribe` endpoint
Punctuation	✅	Automatic punctuation insertion
Number formatting	✅	Spoken numbers converted to digits

ASR Accuracy Tips

Factor	Impact	Recommendation
Audio quality	High	Use 16kHz+ sample rate, minimize background noise
Speaker distance	High	Microphone within 30cm of speaker
Audio format	Medium	FLAC or WAV preferred over lossy formats
Segment length	Medium	5-30 seconds per segment for best accuracy
Speaking pace	Low	Normal speaking pace (120-160 words/min Arabic)
Diacritics context	Low	Model infers diacritics from context

Language Support

Language Support Matrix

Text-to-Speech

Language	Voice(s)	Quality	Dialect
Arabic	Yara, Nouf	⭐⭐⭐⭐⭐	MSA + Gulf
English	Yara_en	⭐⭐⭐⭐	Neutral

Speech Recognition

Language	Quality	WER (Clean Audio)	Notes
Arabic (MSA)	⭐⭐⭐⭐⭐	~8-12%	Primary optimization
Arabic (Gulf)	⭐⭐⭐⭐	~10-15%	Saudi, Emirati, Kuwaiti dialects
Arabic (Egyptian)	⭐⭐⭐	~15-20%	Reasonable accuracy
Arabic (Levantine)	⭐⭐⭐	~15-20%	Syrian, Lebanese, Jordanian
English	⭐⭐⭐⭐	~10-15%	Optimized as secondary language
Mixed AR/EN	⭐⭐⭐⭐	~12-18%	Code-switching handled

WER = Word Error Rate. Lower is better. Measured on clean audio at 16kHz.

Arabic Dialect Details

TTS Dialect Behavior

The Yara and Nouf voices produce Modern Standard Arabic (MSA) pronunciation with Gulf Arabic characteristics:

Feature	Behavior
ق (Qaf)	Pronounced as /q/ (MSA standard)
ج (Jeem)	Pronounced as /dʒ/ (standard)
ث (Tha)	Pronounced as /θ/ (standard)
ذ (Dhal)	Pronounced as /ð/ (standard)
Numbers	Arabic-style reading (e.g., خمسة وعشرون not عشرون وخمسة)
Date format	Day-Month-Year convention
Currency	Saudi Riyal (ريال) recognized by default

ASR Dialect Handling

The ASR engine is trained on multi-dialect Arabic data. It automatically adapts to the speaker's dialect without explicit configuration. However, transcription output is normalized to MSA spelling conventions.

Spoken Dialect	Example Spoken	Transcribed As (MSA)
Gulf: "شلونك"	shlōnak	كيف حالك
Egyptian: "إزيك"	izzayyak	كيف حالك
Levantine: "كيفك"	kīfak	كيف حالك

Note: Dialect-specific transcription (preserving dialectal spelling) is planned for a future release.

Mixed Language (Code-Switching)

SM-AI-MODELS handles Arabic/English code-switching — common in Gulf business communication.

TTS Code-Switching


Code
 
{
  "input": "يرجى إرسال الـ report إلى قسم الـ HR قبل نهاية اليوم",
  "voice": "Yara"
}

The engine automatically:

Detects English words within Arabic text
Switches pronunciation model for English segments
Maintains natural prosody across language boundaries

Tips for best code-switching results:

Pattern	Quality	Example
Arabic sentence with English terms	⭐⭐⭐⭐⭐	"أرسل الـ email الآن"
Full sentence switch	⭐⭐⭐⭐	"شكراً. Thank you for calling."
Word-level alternation	⭐⭐⭐	Complex mixing may reduce naturalness
English sentence with Arabic name	⭐⭐⭐⭐	Use Yara_en for primarily English content

ASR Code-Switching

The ASR engine detects language switches at the word level:


Code
 
// Input audio: "أبغى أحجز appointment يوم الخميس"
// Output:
{
  "text": "أبغى أحجز appointment يوم الخميس",
  "language": "ar"
}

Character Sets

Supported Arabic Characters

Range	Characters	Support
Arabic letters	ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي	✅
Hamza variants	أ إ آ ء ئ ؤ	✅
Diacritics	َ ِ ُ ّ ْ ً ٍ ٌ	✅
Punctuation	، ؛ ؟ ! .	✅
Arabic numerals	٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩	✅
Western numerals	0 1 2 3 4 5 6 7 8 9	✅
Tatweel	ـ (kashida)	✅ Ignored

Unsupported Characters

Characters not in the supported set are silently skipped. Emoji and special Unicode symbols are ignored.

Language Detection (ASR)

The ASR engine automatically detects the spoken language and returns it in the response:


Code
 
{
  "text": "مرحباً بكم في يونيكود سولوشنز",
  "language": "ar"
}


Code
 
{
  "text": "Welcome to Unicode Solutions",
  "language": "en"
}

For mixed-language audio, the language field reflects the dominant language.

Hardware Requirements

SM-AI-MODELS is self-hosted and requires GPU acceleration for optimal performance.

Minimum Requirements

Component	TTS (SM-TTS-V1)	ASR (SM-STT-V1)	Both Services
GPU	NVIDIA T4 (16GB)	NVIDIA T4 (16GB)	NVIDIA A10G (24GB)
CPU	4 cores	4 cores	8 cores
RAM	16 GB	16 GB	32 GB
Storage	20 GB	20 GB	40 GB
CUDA	11.8+	11.8+	11.8+

Recommended (Production)

Component	Specification
GPU	NVIDIA A100 (40GB) or H100 (80GB)
CPU	16+ cores
RAM	64 GB
Storage	100 GB SSD
CUDA	12.0+
Concurrent capacity	~20-50 simultaneous requests (A100)

Current Models

Model	Type	Status
SM-TTS-V1	Text-to-Speech	✅ Production
SM-STT-V1	Speech-to-Text	✅ Production

Last modified on February 7, 2026

Streaming Optimization & Limits