Stream Speech
Synthesize speech and stream the audio back as it is generated, for low-latency playback. The Accept header selects the audio container; the response is raw audio bytes (HTTP chunked). For Base64-encoded audio with speech-mark metadata in a single JSON response, use POST /v1/audio/speech.
Authentication
Enter your API key with the Bearer prefix, e.g. ‘Bearer sk_…’.
Headers
Selects the audio container/codec for the streamed response. The
response Content-Type echoes this value, except audio/pcm returns
audio/L16 with rate and channels parameters (raw 16-bit linear
PCM, 24 kHz mono, little-endian).
Request
Plain text or SSML to be synthesized to speech. Refer to https://docs.speechify.ai/docs/api-limits for the input size limits. Emotion, Pitch and Speed Rate are configured in the ssml input, please refer to the ssml documentation for more information: https://docs.speechify.ai/docs/ssml#prosody
Id of the voice to be used for synthesizing speech. Refer to /v1/voices endpoint for available voices
Language of the input. Follow the format of an ISO 639-1 language code and an ISO 3166-1 region code, separated by a hyphen, e.g. en-US. Please refer to the list of the supported languages and recommendations regarding this parameter: https://docs.speechify.ai/docs/language-support.
Model used for audio synthesis. simba-english is optimized for English, simba-multilingual for non-English or mixed input. simba-3.0 is the streaming-native model with lower TTFB and richer expressivity. Currently English only; multilingual coming soon. Non-English voices return 400 until multilingual support ships.
Response headers
Response
Streamed audio. The Content-Type matches the Accept header except
for audio/pcm, which returns audio/L16 with rate and channels
parameters (see the Accept header description).