Skip to content

Text-to-Speech API

Version History

Version Date Changes
v1.2 2024-08-23 1.Added the API documentation for the MaaS ASpeech/OSpeech model
v1.1 2024-07-31 1.Modified the request path
2. ElvenLabs Text-to-Speech added request-id to the response header
v1.0 2024-07-29 Initial release

MaaS ASpeech/OSpeech

Request Method:

POST

Request Path:

{endpoint}/audio/speech

Request header

Parameter Description Example
Authorization AccessKey
Bearer ${AccessKey}
Bearer xxxxxx

Request Body

Parameter Description Example
model The model being used tts-1
input The text input that needs to be converted to audio how are you
voice The voice option available for audio generation
Possible values:
"alloy", "echo", "fable", "onyx", "nova", "shimmer"
alloy
response_format The audio format
Possible values: "mp3", "opus", "aac", "flac", "wav", "pcm"
mp3
speed The audio speed
a range of: 0.25 to 4.0.
1.0

Response Body

File stream

Sample Request

MaaS ASpeech

curl --location '{endpoint}/audio/speech' \
--header 'Authorization: Bearer xxxx' \
--header 'Content-Type: application/json' \
--data '{
    "input":"<speak version='\''1.0'\'' xml:lang='\''en-US'\''><voice xml:lang='\''en-US'\'' xml:gender='\''Female'\'' name='\''en-US-AvaMultilingualNeural'\''>my voice is my passport verify me</voice></speak>",
    "response_format":"audio-16khz-128kbitrate-mono-mp3"
}'

MaaS OSpeech

curl --location '{endpoint}/audio/speech' \
--header 'Authorization: Bearer xxxx' \
--header 'Content-Type: application/json' \
--data '{
    "input":"hi,what is your name?",
    "voice":"alloy",
    "speed":1.0,
    "response_format":"mp3"
}'

Get MaaS Aspeech Regional Language List

Request Method:

GET

Request Path:

{endpointPath}/cognitiveservices/voices/list

Request Header

Parameter Description Example
Authorization AccessKey
Bearer ${AccessKey}
Bearer xxxxxx

Return Parameters

Object Array

Object Parameters:

Parameter Description Example
Name Voice Full Name Microsoft Server Speech Text to Speech Voice (af-ZA, AdriNeural)
DisplayName Display Name Adri
LocalName Local Name Adri
ShortName Short Name af-ZA-AdriNeural
Gender Voice Gender Female
Locale Locale af-ZA
LocaleName Locale Language Name Afrikaans (South Africa)
SampleRateHertz Sampling Rate 48000
VoiceType Voice Type Neural
Status Status GA
WordsPerMinute Words Per Minute 147

Request Example

curl --location '{endpointPath}/cognitiveservices/voices/list' \
--header 'Authorization: Bearer xxxx'

MaaS-nar shortTextToSpeech

Request Method:

POST

Request Path:

/tts-n/text-to-speech/{responseFormat}

pathVariables

Parameter Description Example
responseFormat Audio format (mp3/m4a) mp3

queryParams

Parameter Description Example
voice voice
Possible values
Yifei
voice-speed Speech speed
Possible values
fast
normal
slow
numeric between 0.3-2
normal
voice-volume Volume
Possible values
x-loud
loud
standard
soft
x-soft
normalized
loud

Request header

Parameter Description Example
Accept Fixed value:application/octet-stream application/octet-stream
Authorization AccessKey
Bearer ${AccessKey}
Bearer RWYhq1NsLPAMmieux0Gd
Content-type Possible values
text/plain
application/application/x-www-form-urlencoded
text/vtt
application/x-subrip
text/srt
text/plain

The request body format and the Content-Type in the header need to correspond.

Request body

Please ensure that the Content-Type and the body format are properly aligned.

  • Sending the UTF-8 string in the body:

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Accept: application/octet-stream' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-type: text/plain' \ --data 'hello'

  • Sending a URL-encoded string in the body:

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Accept: application/octet-stream' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data '%E4%BD%A0%E5%A5%BD%E5%95%8A'

  • Sending a UTF-8 text file in the body:

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Accept: application/octet-stream' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-type: text/plain' \ --data '@test.txt'

  • Sending a VTT file in the body:

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Accept: application/octet-stream' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-Type: text/vtt' \ --data '@sing-song_2024-07-29_103928.vtt'

  • Sending an SRT file in the body (Method 1):

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Accept: application/octet-stream' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-Type: application/x-subrip' \ --data '@sing-song_2024-07-29_103928.srt'

Sending an SRT file in the body (Method 2):

```
curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \
--header 'Accept: application/octet-stream' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Content-Type: text/srt' \
--data '@sing-song_2024-07-29_103928.srt'
```

Response headers

Parameter Description Example
x-duration-seconds Duration of the audio in seconds 3

Response Body

File stream

MaaS-nar longTextToSpeech

API Flow

  1. Call the longTextToSpeech API to obtain the statusUrl.
  2. Poll the statusUrl (recommended interval of 5-10s) to get the task result.
  3. If the task has completed successfully, download the audio file using the URL provided in the result field.

Request Method:

POST

Request Path:

/tts-n/text-to-speech/{responseFormat}

Path Variables

Parameter Description Example
responseFormat Audio format: mp3/m4a/wav wav

Query Parameters

Parameter Description Example
voice Voice option
Available Options
Yifei
voice-speed Speech speed
Available options: fast, normal, slow, or a number between 0.3 and 2
normal
voice-volume Volume level
Available options: x-loud, loud, standard, soft, x-soft, or normalized
loud

Request Headers

Parameter Description Example
Authorization AccessKey
Bearer ${AccessKey}
Bearer RWYhq1NsLPAMmieux0Gd
Content-type Available options
text/plain
application/application/x-www-form-urlencoded
text/vtt
application/x-subrip
text/srt
text/plain

Do not set the Accept header.r

Request body format must match the Content-Type specified in the header.

Request Body

Note the content-type and body format correspondence.

  • Sending a UTF-8 string in the body:

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-type: text/plain' \ --data 'hello'

  • Sending a URL-encoded string in the body:

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data '%E4%BD%A0%E5%A5%BD%E5%95%8A'

  • Sending a UTF-8 text file in the body:

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-type: text/plain' \ --data '@test.txt'

  • Sending a VTT file in the body:

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-Type: text/vtt' \ --data '@sing-song_2024-07-29_103928.vtt'

  • Sending an SRT file in the body (Method 1):

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-Type: application/x-subrip' \ --data '@sing-song_2024-07-29_103928.srt'

    Sending an SRT file in the body (Method 2):

    curl --location '${endpointPath}/tts-n/text-to-speech/mp3' \ --header 'Authorization: Bearer ${AccessKey}' \ --header 'Content-Type: text/srt' \ --data '@sing-song_2024-07-29_103928.srt'

Response

Parameter Description Example
statusUrl URL to obtain the task execution result. It can be accessed directly through a GET request without any authorization information. Recommended polling interval is 5-10s.
taskId Task ID 1

Getting the task result through statusUrl

Parameter Description Example
finished Indicates whether the task has finished. True means the polling should be stopped. Boolean
percent Progress of audio generation, ranging from 0 to 100. Integer
succeeded Indicates whether the audio was generated successfully. Boolean
result URL to download the audio if it has been successfully generated. This URL is valid for 10 minutes. String
message Reason for audio generation failure. String
durationInSeconds Duration of the audio in seconds, rounded to the nearest whole second. Integer

MaaS-Ele TextToSpeech

Request Method:

POST

Request Path:

/tts-e/text-to-speech/{voice_id}

pathVariables

Parameter Description Example
voice_id Voice ID, see Appendix for details EXAVITQu4vr4xnSDxMaL

queryParams

Parameter Description Example
enable_logging Privacy mode
true: (default) non-privacy mode
false: privacy mode
true
optimize_streaming_latency Latency optimization (deprecated parameter)
Available values: 1-4
1
output_format Output format
Available values:
mp3_22050_32
mp3_44100_32
mp3_44100_64
mp3_44100_96
mp3_44100_128 (default)
mp3_44100_192
pcm_16000
pcm_22050
pcm_24000
pcm_44100
ulaw_8000
mp3_44100_128

Request Headers

Parameter Description Example
Authorization AccessKey
Bearer ${AccessKey}
Bearer RWYhq1NsLPAMmieux0Gd

Request Body

Parameter Description Example
text Text (required) how are you
model_id Model ID
Available values:
eleven_monolingual_v1 (default)
eleven_multilingual_v2
eleven_turbo_v2_5
eleven_turbo_v2
eleven_multilingual_v1
eleven_monolingual_v1
1
language_code Language code, currently only supported by eleven_turbo_v2_5 ISO 639-1
voice_settings Voice settings {"stability":0,"similarity_boost":1.0}
voice_settings.stability Stability 0
voice_settings.similarity_boost Similarity Boost 1.0
voice_settings.style Voice Style 0
voice_settings.use_speaker_boost Use Speaker Boost
true (default)/false
true
pronunciation_dictionary_locators object[]
List of pronunciation dictionary locators, supports up to 3
"pronunciation_dictionary_locators": [{"pronunciation_dictionary_id": "","version_id": ""}]
pronunciation_dictionary_locators.pronunciation_dictionary_id Pronunciation Dictionary ID 123
pronunciation_dictionary_locators.version_id Version ID 123
seed Deterministic Sampling 123
previous_text Previous Text Content hi
next_text Next Text Content how are you
previous_request_ids string[]
List of previous sample request IDs
["xx","xxx"]
next_request_ids string[]
List of next sample request IDs
["xx","xxx"]

Response Headers

Parameter Description Example
character-cost Character Size 333
request-id Request ID 12342wqwqe

Response Body

File stream

Request Example:

curl --location '${endpointPath}/tts-e/text-to-speech/{voice_id}' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Content-Type: application/json' \
--data '{
  "text": "hi",
  "voice_settings": {
    "stability": 0,
    "similarity_boost": 1.0
  }
}'

Appendix

  • Available voice_id
name voice_id
Sarah EXAVITQu4vr4xnSDxMaL
Laura FGY2WhTYpPnrIDTdsKH5
Charlie IKne3meq5aSn9XLyUdCD
George JBFqnCBsd6RMkjVDRZzb
Callum N2lVS1w4EtoT3dr4eOWO
Liam TX3LPaxmHKxFdv7VOQHJ
Charlotte XB0fDUnXU5powFXDhCwa
Alice Xb7hH8MSUJpSbSDYk0k2
Matilda XrExE9yKIg1WjnnlVkGX
Will bIHbv24MWmeRgasZH58o
Jessica cgSgspJ2msm6clMCkdW9
Eric cjVigY5qzO86Huf0OWal
Chris iP95p4xoKVk53GoZ742B
Brian nPczCjzI2devNBz1zQrb
Daniel onwK4e9ZLuTAKqWW03F9
Lily pFZP5JQG7iQjIQuC4Bku
Bill pqHfZKP75CvOlQylNhV4