English prompts, bilingual IVR, international audiences
Accent
Neutral English
Speed Sweet Spot
0.9 – 1.3x
Voice Selection Guide
Scenario
Recommended Voice
Why
Banking IVR (Arabic)
Yara
Clear, professional, excellent for structured prompts
Customer service bot
Nouf
Warm tone feels conversational and approachable
Bilingual system (AR/EN)
Yara + Yara_en
Same voice family, consistent brand experience
Educational content
Nouf
Expressive delivery keeps listener engaged
English-only service
Yara_en
Professional neutral English
Speech Recognition Engine
SM-STT-V1
Neural ASR engine providing high-accuracy Arabic transcription with real-time streaming support.
Property
Value
Engine
SM-STT-V1
Architecture
Neural ASR (Proprietary)
Primary Language
Arabic
Secondary Language
English
Accuracy (Arabic WER)
~8-12% on clean audio
Accuracy (English WER)
~10-15% on clean audio
Max Audio Duration
300 seconds (5 min) per REST request
Max File Size
25 MB
Supported Formats
FLAC, MP3, WAV, OGG, WebM
Optimal Sample Rate
16 kHz mono
REST Port
8088
gRPC Port
50052
ASR Capabilities
Feature
Supported
Notes
Arabic transcription
✅
MSA + Gulf dialect
English transcription
✅
Optimized as secondary language
Mixed Arabic/English
✅
Auto-detects language switches
Streaming (real-time)
✅
gRPC bidirectional streaming
Word timestamps
✅
Via gRPC response fields
Confidence scores
✅
Per-utterance confidence (0.0 – 1.0)
Speaker diarization
✅
Available via /api/transcribe endpoint
Punctuation
✅
Automatic punctuation insertion
Number formatting
✅
Spoken numbers converted to digits
ASR Accuracy Tips
Factor
Impact
Recommendation
Audio quality
High
Use 16kHz+ sample rate, minimize background noise
Speaker distance
High
Microphone within 30cm of speaker
Audio format
Medium
FLAC or WAV preferred over lossy formats
Segment length
Medium
5-30 seconds per segment for best accuracy
Speaking pace
Low
Normal speaking pace (120-160 words/min Arabic)
Diacritics context
Low
Model infers diacritics from context
Language Support
Language Support Matrix
Text-to-Speech
Language
Voice(s)
Quality
Dialect
Arabic
Yara, Nouf
⭐⭐⭐⭐⭐
MSA + Gulf
English
Yara_en
⭐⭐⭐⭐
Neutral
Speech Recognition
Language
Quality
WER (Clean Audio)
Notes
Arabic (MSA)
⭐⭐⭐⭐⭐
~8-12%
Primary optimization
Arabic (Gulf)
⭐⭐⭐⭐
~10-15%
Saudi, Emirati, Kuwaiti dialects
Arabic (Egyptian)
⭐⭐⭐
~15-20%
Reasonable accuracy
Arabic (Levantine)
⭐⭐⭐
~15-20%
Syrian, Lebanese, Jordanian
English
⭐⭐⭐⭐
~10-15%
Optimized as secondary language
Mixed AR/EN
⭐⭐⭐⭐
~12-18%
Code-switching handled
WER = Word Error Rate. Lower is better. Measured on clean audio at 16kHz.
Arabic Dialect Details
TTS Dialect Behavior
The Yara and Nouf voices produce Modern Standard Arabic (MSA) pronunciation with Gulf Arabic characteristics:
Feature
Behavior
ق (Qaf)
Pronounced as /q/ (MSA standard)
ج (Jeem)
Pronounced as /dʒ/ (standard)
ث (Tha)
Pronounced as /θ/ (standard)
ذ (Dhal)
Pronounced as /ð/ (standard)
Numbers
Arabic-style reading (e.g., خمسة وعشرون not عشرون وخمسة)
Date format
Day-Month-Year convention
Currency
Saudi Riyal (ريال) recognized by default
ASR Dialect Handling
The ASR engine is trained on multi-dialect Arabic data. It automatically adapts to the speaker's dialect without explicit configuration. However, transcription output is normalized to MSA spelling conventions.
Spoken Dialect
Example Spoken
Transcribed As (MSA)
Gulf: "شلونك"
shlōnak
كيف حالك
Egyptian: "إزيك"
izzayyak
كيف حالك
Levantine: "كيفك"
kīfak
كيف حالك
Note: Dialect-specific transcription (preserving dialectal spelling) is planned for a future release.
Mixed Language (Code-Switching)
SM-AI-MODELS handles Arabic/English code-switching — common in Gulf business communication.
TTS Code-Switching
Code
{ "input": "يرجى إرسال الـ report إلى قسم الـ HR قبل نهاية اليوم", "voice": "Yara"}
The engine automatically:
Detects English words within Arabic text
Switches pronunciation model for English segments
Maintains natural prosody across language boundaries
Tips for best code-switching results:
Pattern
Quality
Example
Arabic sentence with English terms
⭐⭐⭐⭐⭐
"أرسل الـ email الآن"
Full sentence switch
⭐⭐⭐⭐
"شكراً. Thank you for calling."
Word-level alternation
⭐⭐⭐
Complex mixing may reduce naturalness
English sentence with Arabic name
⭐⭐⭐⭐
Use Yara_en for primarily English content
ASR Code-Switching
The ASR engine detects language switches at the word level:
Code
// Input audio: "أبغى أحجز appointment يوم الخميس"// Output:{ "text": "أبغى أحجز appointment يوم الخميس", "language": "ar"}
Character Sets
Supported Arabic Characters
Range
Characters
Support
Arabic letters
ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي
✅
Hamza variants
أ إ آ ء ئ ؤ
✅
Diacritics
َ ِ ُ ّ ْ ً ٍ ٌ
✅
Punctuation
، ؛ ؟ ! .
✅
Arabic numerals
٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
✅
Western numerals
0 1 2 3 4 5 6 7 8 9
✅
Tatweel
ـ (kashida)
✅ Ignored
Unsupported Characters
Characters not in the supported set are silently skipped. Emoji and special Unicode symbols are ignored.
Language Detection (ASR)
The ASR engine automatically detects the spoken language and returns it in the response:
Code
{ "text": "مرحباً بكم في يونيكود سولوشنز", "language": "ar"}
Code
{ "text": "Welcome to Unicode Solutions", "language": "en"}
For mixed-language audio, the language field reflects the dominant language.
Hardware Requirements
SM-AI-MODELS is self-hosted and requires GPU acceleration for optimal performance.