Create Speech | Speechify API

Synthesize speech audio from text or SSML. Returns the complete audio file plus billing and speech-mark metadata in a single response. For low-latency playback or long-form text, use POST /v1/audio/stream.

Authentication

AuthorizationBearer

Enter your API key with the Bearer prefix, e.g. ‘Bearer sk_…’.

Request

This endpoint expects an object.

inputstringRequired

Plain text or SSML to be synthesized to speech. Refer to https://docs.speechify.ai/docs/api-limits for the input size limits. Emotion, Pitch and Speed Rate are configured in the ssml input, please refer to the ssml documentation for more information: https://docs.speechify.ai/docs/ssml#prosody

voice_idstringRequired

Id of the voice to be used for synthesizing speech. Refer to /v1/voices endpoint for available voices

audio_formatenumOptionalDefaults to wav

The format for the output audio. Note, that the current default is "wav", but there's no guarantee it will not change in the future. We recommend always passing the specific param you expect.

Allowed values:

languagestringOptional

Language of the input. Follow the format of an ISO 639-1 language code and an ISO 3166-1 region code, separated by a hyphen, e.g. en-US. Please refer to the list of the supported languages and recommendations regarding this parameter: https://docs.speechify.ai/docs/language-support.

modelenumOptionalDefaults to simba-english

Model used for audio synthesis. simba-english is optimized for English, simba-multilingual for non-English or mixed input. simba-3.0 is the streaming-native model with lower TTFB and richer expressivity. Currently English only; multilingual coming soon. Non-English voices return 400 until multilingual support ships.

Allowed values:

optionsobjectOptional

GetSpeechOptionsRequest is the wrapper for request parameters to the client

Response

Synthesized speech audio for the requested input.

audio_datastringformat: "byte"

Synthesized speech audio, Base64-encoded

audio_formatenum

The format of the audio data

Allowed values:

billable_characters_countlong

The number of billable characters processed in the request.

speech_marksobject

It is used to annotate the audio data with metadata about the synthesis process, like word timing or phoneme details.

Errors

400

Bad Request Error

401

Unauthorized Error

402

Payment Required Error

403

Forbidden Error

500

Internal Server Error

1	from speechify import Speechify
2
3	client = Speechify(
4	token="YOUR_TOKEN_HERE",
5	)
6
7	client.tts.audio.speech(
8	input="Hello! This is the Speechify text-to-speech API.",
9	voice_id="george",
10	audio_format="mp3",
11	model="simba-english",
12	)

1	{
2	"audio_data": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAIA+AAACABAAZGF0YQAAAAA=",
3	"audio_format": "mp3",
4	"billable_characters_count": 43,
5	"speech_marks": {
6	"chunks": [
7	{
8	"end": 5,
9	"end_time": 0.75,
10	"start": 0,
11	"start_time": 0,
12	"type": "word",
13	"value": "Hello"
14	},
15	{
16	"end": 6,
17	"end_time": 0.85,
18	"start": 5,
19	"start_time": 0.75,
20	"type": "punctuation",
21	"value": "!"
22	},
23	{
24	"end": 10,
25	"end_time": 1.5,
26	"start": 7,
27	"start_time": 0.9,
28	"type": "word",
29	"value": "This"
30	},
31	{
32	"end": 12,
33	"end_time": 1.8,
34	"start": 11,
35	"start_time": 1.5,
36	"type": "word",
37	"value": "is"
38	},
39	{
40	"end": 15,
41	"end_time": 2.2,
42	"start": 13,
43	"start_time": 1.8,
44	"type": "word",
45	"value": "the"
46	},
47	{
48	"end": 23,
49	"end_time": 3,
50	"start": 16,
51	"start_time": 2.2,
52	"type": "word",
53	"value": "Speechify"
54	},
55	{
56	"end": 28,
57	"end_time": 3.5,
58	"start": 24,
59	"start_time": 3,
60	"type": "word",
61	"value": "text-to-speech"
62	},
63	{
64	"end": 32,
65	"end_time": 4,
66	"start": 29,
67	"start_time": 3.5,
68	"type": "word",
69	"value": "API"
70	},
71	{
72	"end": 33,
73	"end_time": 4.1,
74	"start": 32,
75	"start_time": 4,
76	"type": "punctuation",
77	"value": "."
78	}
79	],
80	"end": 33,
81	"end_time": 4.1,
82	"start": 0,
83	"start_time": 0,
84	"type": "sentence",
85	"value": "Hello! This is the Speechify text-to-speech API."
86	}
87	}