Speech Synthesis Markup Language (SSML)

Control speech synthesis with markup language

SSML is an XML-based markup language for controlling pitch, rate, pauses, emphasis, and emotion in synthesized speech. Wrap your content in a <speak> tag:

1<speak>Your content to be synthesized here</speak>

Escaping Characters

Transforming text into SSML requires escaping certain characters to ensure correct interpretation:

CharacterEscaped Form
&&amp;
>&gt;
<&lt;
"&quot;
'&apos;
1<!-- Original: Some "text" with 5 < 6 & 4 > 8 in it -->
2<speak>Some &quot;text&quot; with 5 &lt; 6 &amp; 4 &gt; 8 in it</speak>

Supported SSML Tags

The prosody tag controls the expressiveness of synthesized speech by manipulating pitch, rate, and volume.

1<speak>
2 This is a normal speech pattern.
3 <prosody pitch="high" rate="fast" volume="+20%">
4 I'm speaking with a higher pitch, faster than usual, and louder!
5 </prosody>
6 Back to normal speech pattern.
7</speak>

Parameters

pitch
string

Adjusts the pitch of speech delivery.

Values:

  • x-low, low, medium (default), high, x-high
  • Percentage adjustments: -83% to +100% (e.g., +20%, -30%)
rate
string

Alters speech speed.

Values:

  • x-slow, slow, medium (default), fast, x-fast
  • Percentage adjustments: -50% to +9900% (e.g., +20%, -30%)
volume
string

Controls speech loudness.

Values:

  • silent, x-soft, medium (default), loud, x-loud
  • Decibel adjustments: Number with dB suffix (e.g., -6dB)
  • Percentage adjustments (e.g., +20%, -30%)

The break tag controls pausing between words, following W3 specifications.

1<speak>
2 Sometimes it can be useful to add a longer pause at the end of the sentence.
3 <break strength="medium" />
4 Or <break time="100ms" /> sometimes in the <break time="1s" /> middle.
5</speak>

Parameters

strength
string

Specifies pause strength.

Values:

  • none: 0ms
  • x-weak: 250ms
  • weak: 500ms
  • medium: 750ms
  • strong: 1000ms
  • x-strong: 1250ms
time
string

Specifies pause duration (0-10 seconds).

Values:

  • Milliseconds: ms suffix (e.g., 100ms)
  • Seconds: s suffix (e.g., 1s)

The emphasis tag adds or removes emphasis from text, modifying speech similarly to prosody but without setting individual attributes.

1<speak>
2 I already told you I <emphasis level="strong">really like</emphasis> that person.
3</speak>

Parameters

level
string

Specifies emphasis level.

Values:

  • reduced
  • moderate
  • strong

The sub tag replaces pronunciation for contained text, following W3 specifications.

1<speak>
2 For detailed information, please read the <sub alias="Frequently Asked Questions">FAQ</sub> section.
3</speak>

Parameters

alias
stringRequired

Specifies text to be spoken instead of enclosed text.

The speechify:style tag controls emotion of the voice. See Emotion Control for the full list of 13 supported emotions and best practices.

1<speak>
2 <speechify:style emotion="cheerful">Great news! Your order shipped!</speechify:style>
3</speak>

Parameters

emotion
string

Sets the voice emotion. Values: angry, cheerful, sad, terrified, relaxed, fearful, surprised, calm, assertive, energetic, warm, direct, bright.

Examples

1<speak>Welcome to Speechify's Text-to-Speech service.</speak>