Speech marks

Learn how speech marks map text to audio timing for synchronization features.

Overview

Speech marks are returned with every synthesis request and provide a mapping between time and text. They inform the client when each word is spoken in the audio, enabling features like:

  • Text highlighting during playback
  • Precise audio seeking by text position
  • Usage tracking and analytics
  • Synchronization between text and audio

Data structure

Speech marks use the following TypeScript interfaces:

1type Chunk = {
2 start_time: number // Time in milliseconds when this chunk starts in the audio
3 end_time: number // Time in milliseconds when this chunk ends in the audio
4 start: number // Character index where this chunk starts in the original text
5 end: number // Character index where this chunk ends in the original text
6 value: string // The text content of this chunk
7}
8
9type NestedChunk = Chunk & {
10 chunks: Chunk[] // Array of word-level chunks within this sentence/paragraph
11}

Important considerations

  • SSML escaping: Values are returned based on the SSML, so any escaping of &, < and > will be present in the value, start and end fields. Consider using the string tracker library to assist with mapping.

  • Index gaps: The start and end values of each word may have gaps. When looking for a word at a specific index, check for start being >= yourIndex rather than checking if the index is within both start and end bounds.

  • Timing gaps: Similarly, start_time and end_time of each word may have gaps. Follow the same approach as with index gaps.

  • Initial silence: The start_time of the first word is not necessarily 0 like the NestedChunk. There can be silence at the beginning of the sentence that leads to the word starting partway through.

  • Trailing silence: The end_time of the last word does not necessarily correspond with the end of the NestedChunk. There can be silence at the end that will make the NestedChunk longer.

Example output

For the input "Hello, welcome to Speechify", the response includes:

1const chunk: NestedChunk = {
2 start: 0,
3 end: 27,
4 start_time: 0,
5 end_time: 1850,
6 value: 'Hello, welcome to Speechify',
7 chunks: [
8 { start: 0, end: 6, start_time: 125, end_time: 375, value: 'Hello,' },
9 { start: 7, end: 14, start_time: 375, end_time: 750, value: 'welcome' },
10 { start: 15, end: 17, start_time: 750, end_time: 875, value: 'to' },
11 { start: 18, end: 27, start_time: 875, end_time: 1850, value: 'Speechify' },
12 ],
13}

Note how start_time of the first word (125ms) doesn’t match the NestedChunk start (0ms) — there’s initial silence before speech begins.