Models

Choose the right text-to-speech model for your use case

Available models

ModelIDLanguagesVoice CloningBest for
Simba Englishsimba-englishEnglish onlyZero-shot + fine-tuningProduction English TTS with highest quality
Simba Multilingualsimba-multilingual50+ languagesZero-shot + fine-tuningMulti-language or mixed-language content

Pass the model ID as the model parameter in your API calls. If omitted, the API defaults to simba-english.

POST
/v1/audio/speech
1curl -X POST https://api.speechify.ai/v1/audio/speech \
2 -H "Authorization: Bearer <token>" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "input": "Hello! This is the Speechify text-to-speech API.",
6 "voice_id": "george",
7 "audio_format": "mp3",
8 "model": "simba-english"
9}'

Simba English

Optimized for English text-to-speech with the highest quality output.

  • Clear, natural-sounding speech
  • Consistent quality across outputs
  • Full support for SSML and emotion control
  • Zero-shot voice cloning from short audio samples
  • Fine-tuned voice cloning from hours of speaker audio (contact sales)

Simba Multilingual

This model is currently experimental and may be subject to changes.

Supports multiple languages, including mixing languages within a single sentence.

  • 6 fully supported languages, 17 in beta, 25 coming soon
  • Automatic language detection when the language parameter is omitted
  • Zero-shot voice cloning works across all supported languages
  • Fine-tuned voice cloning available (contact sales)

See Language Support for the full list.

Voice cloning

Both models support two tiers of voice cloning:

TierInputQualityAvailability
Zero-shot10-30 second audio sampleGoodSelf-serve via API or Console
Fine-tunedHours of speaker audioBestContact sales

See Voice Cloning for implementation details.

FAQ

Use Simba English if your content is English-only — it produces the highest quality output. Use Simba Multilingual if you need non-English languages or mixed-language content.

Yes. Just change the model parameter. All other parameters (voice, format, SSML) work the same across models.

Built-in system voices may differ between models. Cloned voices work with both models.