Skip to content

MaaS_HL_Speech

T2A v2 (Synchronous Speech Generation)

Request URL

post

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/t2a_v2

content-type

application/json

Request Parameters

Field Name Type Required or Not Description
text string is The text to be synthesized has a length limit of <10000 characters, with paragraph breaks replaced by line breaks. (If you need to control the interval time between voices, add <#x#> between characters, where x is in seconds, supporting 0.01 - 99.99 with a maximum of two decimal places). It supports customizing the voice time interval between texts to achieve the effect of customizing the voice pause time for texts. It should be noted that the text interval time must be set between two texts that can be pronounced, and multiple consecutive time intervals cannot be set.
voice_setting object No Control parameters such as the speed, volume, and intonation of the generated voice.
audio_setting object No Controls the sampling rate, bit rate, and audio format of the generated sound.
pronunciation_dict object No Parameter controlling timbre mixing, one of which and voice_id must be filled in.
timbre_weights array No Weight must be filled in synchronously with voice_id, supporting up to 4 timbre mixtures, with a value range of [1, 100].
stream bool No Whether to enable streaming. Default is false, i.e., streaming is not enabled.
stream_options object No Control the relevant parameters of the streaming process.
language_boost string No The default is null, which enhances the recognition ability of specified minority languages and dialects. After setting, it can improve the speech performance in the specified minority language/dialect scenario. If the minority language type is not clear, you can choose "auto", and the model will automatically judge the minority language type. 支持以下取值:'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto'
subtitle_enable bool No A switch that controls whether to enable the subtitle service. The default value is false.
subtitle_type string No Subtitle granularity, default value is sentence. Available values: sentence: Sentence-level timestamp word: Word-level timestamp word_streaming: Streaming-optimized word-level timestamp, only valid when stream=true Available options: sentence, word, word_streaming
output_format string No Parameter that controls the format of the output result. Optional values are urlhex. The default value is hex. This parameter only takes effect in non-streaming scenarios, and streaming scenarios only support returning the hex format.
aigc_watermark bool No Controls whether to add an audio rhythm marker at the end of the synthesized audio, with a default value of False. This parameter only takes effect for non-streaming synthesis. Only domestic API interfaces support this parameter.

voice_setting parameter

Field Name Type Required or Not Description
speed Float No The speaking rate range is [0.5, 2], with a default value of 1.0. The larger the value, the faster the speaking rate.
vol Float No Volume range (0, 10], default value is 1.0, the larger the value, the higher the volume.
pitch int No The pitch range is [-12, 12], with a default value of 0, where 0 indicates the original timbre output, and the value must be an integer.
voice_id string is The requested timbre number. It is "required" to choose one between this and timbre_weights. Supports two types: system timbre (id) and replicated timbre (id), where the system timbre (ID) is as follows: Green Youth Timbre: male-qn-qingse Elite Youth Timbre: male-qn-jingying Overbearing Youth Timbre: male-qn-badao Youth College Student Timbre: male-qn-daxuesheng Maiden Timbre: female-shaonv Yujie voice: female-yujie Mature female voice: female-chengshu Sweet female voice: female-tianmei Male host: presenter_male Female host: presenter_female Male audiobook 1: audiobook_male_1 Male audiobook 2: audiobook_male_2 Female audiobook 1: audiobook_female_1 Female Audiobook 2: audiobook_female_2 Green Youth Voice - beta: male-qn-qingse-jingpin Elite Youth Voice - beta: male-qn-jingying-jingpin Overbearing Youth Voice - beta: male-qn-badao-jingpin Youthful College Student Voice - beta: male-qn-daxuesheng-jingpin Teenage Girl Voice - beta: female-shaonv-jingpin Elder Sister Voice - beta: female-yujie-jingpin Mature Woman Voice - beta: female-chengshu-jingpin Sweet Woman Voice - beta: female-tianmei-jingpin Clever Boy: clever_boy Cute Boy: cute_boy Lovely Girl: lovely_girl Cartoon Pig Xiaoqi: cartoon_pig Bingjiao Didi: bingjiao_didi Handsome Boyfriend: junlang_nanyou Innocent Schoolmate: chunzhen_xuedi Cold Senior: lengdan_xiongzhang Overbearing Young Master: badao_shaoye Sweetheart Xiaoling: tianxin_xiaoling Playful Cute Girl: qiaopi_mengmei Charming Cougar: wumei_yujie Cute Freshman: diadia_xuemei Elegant Senior: danya_xuejie Santa Claus: Santa_Claus Grinch: Grinch Rudolph: Rudolph Arnold:Arnold Charming Santa:Charming_Santa Charming Lady:Charming_Lady Sweet Girl:Sweet_Girl Cute Elf:Cute_Elf Attractive Girl:Attractive_Girl Serene Woman:Serene_Woman
emotion string No Controls the emotion of synthetic speech; currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral; parameter range ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]
latex_read bool No Controls whether to support reading aloud LaTeX formulas, with the default being false. Please note: 1. The formulas in the request need to have \\$$ added at the beginning and end; 2. If the formulas in the request contain "\", they need to be escaped to "\\". Example: The basic formula for derivatives is \\$$\\frac{d}{dx}(x^n) = nx^{n-1}\\$$
~~english_normalization~~ ~~bool~~ ~~No ~~ ~~This parameter supports English Text Normalization, which can improve the performance of digital reading scenarios but slightly increase latency. If not provided, the default value is false. ~~
text_normalization bool No Whether to enable Chinese and English Text Normalization. Enabling it can improve the performance of digital reading scenarios, but will slightly increase latency. The default value is false

audio_setting parameter

Field Name Type Required or Not Description
sample_rate int No Sampling rate range [8000, 16000, 22050, 24000, 32000, 44100], default value is 32000.
bitrate int No Bitrate range [32000, 64000, 128000, 256000], default value is 128000, only valid for mp3 format audio.
format string No The generated audio format, with a default value of mp3, has an optional range of [mp3, pcm, flac, wav].
channel int No Number of channels, default value is 2 (stereo), optional values are 1 (mono) or 2 (stereo).
force_cbr bool No For audio constant bitrate (CBR) control, false or true is optional. When this parameter is set to true, audio encoding will be performed in a constant bitrate manner. Note: This parameter only takes effect when the audio is set to streaming output and the audio format is mp3.

pronunciation_dict parameter

Field Name Type Required or Not Description
tone list\ No Replace text, symbols, and their corresponding phonetic notations that require special marking. Replace pronunciations (adjust tones/replace pronunciations of other characters). The format is as follows: [" Yan Shaofei /(yan4)(shao3)(fei1)", "Dafei /(da2)(fei1)", "omg/oh my god"] (Tones are represented by numbers, with the first tone (flat tone) as 1, the second tone (rising tone) as 2, the third tone (falling-rising tone) as 3, the fourth tone (falling tone) as 4), and the neutral tone as 5.

timbre_weights parameter

Field Name Type Required or Not Description
voice_id string No The requested timbre ID. Must be filled in synchronously with the weight parameter.
weight int No The weight within the range [1, 100] must be filled in synchronously with voice_id. It supports up to 4 timbre mixtures, with values being integers. The higher the proportion of a single timbre value, the more the synthesized timbre resembles it.

stream_options parameter

Field Name Type Required or Not Description
exclude_aggregated_audio bool No Sets whether the last chunk contains the concatenated voice hex data. The default value is False, meaning that the last chunk contains the complete concatenated voice hex data

Response Parameters

Field Name Parameter Type Description
data object The returned data object may be null and requires a non-null check.
trace_id string The ID of this session, used to help locate issues during consultation/feedback.
extra_info object Related additional information.
base_resp object If the request fails, the corresponding error status code and details.

data parameter

Field Name Type Description
audio string The synthesized audio segment is hex-encoded and generated according to the format defined in the input (mp3/pcm/flac).
subtitle_file string Download link for synthesized subtitles, corresponding to the audio file, accurate to sentences (no more than 50 characters), in milliseconds, and in JSON format.
status int Current audio stream status, 1 indicates synthesizing, 2 indicates synthesis completed.

extra_info parameter

Field Name Type Description
audio_length long Audio duration, accurate to milliseconds.
audio_sample_rate long Sampling rate.
audio_size long Audio size, in ByteDance.
bitrate long Bitrate.
audio_format string The format of the generated audio file, with valid values being mp3/pcm/flac.
audio_channel long Number of audio channels generated, 1: Mono, 2: Stereo.
invisible_character_ratio double Proportion of illegal characters. If the proportion of illegal characters does not exceed 10% (inclusive), the audio will be generated normally and the proportion of illegal characters will be returned; the maximum is 0.1 (10%), and an error will be reported if exceeded.
usage_characters long Billable character count, the number of billable characters generated in this speech synthesis.
word_count long Count of pronounced characters, including Chinese characters, numbers, and letters, excluding punctuation marks

base_resp parameter

Field Name Type Description
status_code int64 Status code. 1000: Unknown error; 1001: Timeout; 1002: Triggered rate limiting; 1004: Authentication failed; 1039: Triggered TPM rate limiting; 1042: Illegal characters exceed 10%; 2013: Input format information is abnormal.
status_msg string Status Details.

T2A Large v2 (Asynchronous Ultra-long Text Speech Generation)

File Upload

Request URL

post

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/file/upload

content-type

form/data

Request Parameters

Field Name Type Required or Not Description
purpose string is File usage purpose. The values and supported formats are as follows: t2a_async_input: File used when creating a speech generation task, supporting documents in txt and zipt formats;
file file is File

Response Parameters

Field Name Type Description
file object File information object, containing detailed information about the file.
base_resp object Basic response information object, containing status code and status message.

file parameter

Field Name Type Description
file_id int64 Unique Device Identifier of the file.
bytes int File size, in bytes.
created_at int64 Creation time of the file, in Unix timestamp format.
filename string File name.
purpose string The purpose of the file is currently t2a_async_input.

base_resp parameter

Field Name Type Description
status_code int64 Status code, 0 indicates success.
status_msg string Status message, success indicates that the request was successful.

Create an asynchronous task

Request URL

post

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/t2a_async_v2

content-type

application/json

Request Parameters

Field Name Type Required or Not Description
text string No Text to be synthesized, with a maximum limit of 50,000 characters. Either this or "text_file_id" must be filled in.
text_file_id long No The file_id returned by the file upload interface. The ID of the text file to be synthesized, with a single txt file length limit of <100,000 characters, and formats supported include txt and zip. It is required to fill in either this or "text". After being passed in, the format will be automatically verified. 1.txt: Length limit <100,000 characters (if you need to control the interval time between voices, add <#x#> between characters, where x is in seconds, supporting 0.01 - 99.99s with a maximum of two decimal places). It supports customizing the voice time interval between texts to achieve the effect of customizing the voice pause time for texts. It should be noted that the text interval time needs to be set between two texts that can be pronounced, and multiple consecutive time intervals cannot be set; 2. zip: Pack and upload, the package should only contain txt or json files (files within the Compressed Packet are of the same format), and json files can have three fields, ["title", "content", "extra"], representing the title, body, and author respectively. Finally, three sets of results will be produced corresponding to the three fields, with a total of 9 files in a folder.If a field does not exist or its content is empty, the corresponding file will not be generated.
voice_setting object No Control parameters such as the speed, volume, and intonation of the generated voice.
audio_setting object No Controls the sampling rate, bit rate, and audio format of the generated sound.
pronunciation_dict object No Parameter controlling timbre mixing, one of which and voice_id must be filled in.
language_boost string No The default is null, which enhances the recognition ability of specified minority languages and dialects. After setting, it can improve the speech performance in the specified minority language/dialect scenario. If the minority language type is not clear, you can choose "auto", and the model will automatically judge the minority language type. 支持以下取值:'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto'
voice_modify object No Supported audio formats: mp3, wav, flac. (Other formats such as pcm, pcmu_raw, pcmu_wav, opus are not supported, and passing them will return a parameter error.)
aigc_watermark boolean No Controls whether to add an audio rhythm marker at the end of the synthesized audio, with a default value of False. This parameter only takes effect for non-streaming synthesis. Only domestic API interfaces support this parameter.

voice_setting parameter

Field Name Type Required or Not Description
speed Float No The speaking rate range is [0.5, 2], with a default value of 1.0. The larger the value, the faster the speaking rate.
vol Float No Volume range (0, 10], default value is 1.0, the larger the value, the higher the volume.
pitch int No The pitch range is [-12, 12], with a default value of 0, where 0 indicates the original timbre output, and the value must be an integer.
voice_id string is The requested timbre number supports two types: system timbre (id) and replicated timbre (id). Among them, the system timbre (ID) is as follows: Youthful Youth Timbre: male-qn-qingse Elite Youth Timbre: male-qn-jingying Overbearing Youth Timbre: male-qn-badao Youth College Student Timbre: male-qn-daxuesheng Maiden Timbre: female-shaonv Mature Sister Timbre: female-yujie Mature female voice: female-chengshu Sweet female voice: female-tianmei Male host: presenter_male Female host: presenter_female Male audiobook 1: audiobook_male_1 Male audiobook 2: audiobook_male_2 Female audiobook 1: audiobook_female_1 Female audiobook 2: audiobook_female_2 Green Youth Voice - beta: male-qn-qingse-jingpin Elite Youth Voice - beta: male-qn-jingying-jingpin Overbearing Youth Voice - beta: male-qn-badao-jingpin Youthful College Student Voice - beta: male-qn-daxuesheng-jingpin Teenage Girl Voice - beta: female-shaonv-jingpin Elder Sister Voice - beta: female-yujie-jingpin Mature Woman Voice - beta: female-chengshu-jingpin Sweet Woman Voice - beta: female-tianmei-jingpin Clever Boy: clever_boy Cute Boy: cute_boy Lovely Girl: lovely_girl Cartoon Pig Xiaoqi: cartoon_pig
emotion string No Controls the emotion of synthetic speech; currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral; parameter range ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]
english_normalization bool No This parameter supports English Text Normalization, which can improve the performance of digital reading scenarios but slightly increase latency. If not provided, the default value is false.

audio_setting parameter

Field Name Type Required or Not Description
audio_sample_rate int No Sampling rate range [8000, 16000, 22050, 24000, 32000, 44100], default value is 32000.
bitrate int No Bitrate range [32000, 64000, 128000, 256000], default value is 128000, only valid for mp3 format audio.
format string No The generated audio format, with a default value of mp3, has an optional range of [mp3, pcm, flac, wav].
channel int No Number of channels, default value is 2 (stereo), optional values are 1 (mono) or 2 (stereo).

pronunciation_dict parameter

Field Name Type Required or Not Description
tone list\ No Replace the text, symbols, and corresponding annotations that require special annotations. Replace pronunciation (adjust tone/replace other character pronunciations). The format is as follows: ["Yan Shaofei/(yan4) (shao3) (fei1) ", "Dafei/(da2) (fei1) ", "omg/oh my god"]. The tone is replaced by numbers, with the first tone (yin ping) being 1, the second tone (yang ping) being 2, the third tone (upper tone) being 3, the fourth tone (de tone) being 4, and the light tone being 5.

voice_modify parameter

Field Name Type Required or Not Description
pitch int No Pitch adjustment (low/bright), range [-100, 100], values closer to -100 result in a lower pitch; closer to 100, a brighter pitch
intensity int No Intensity adjustment (strength/softness), range [-100, 100], values closer to -100 result in a more powerful sound; closer to 100, a softer sound
timbre int No Timbre adjustment (magnetic/crisp), range [-100, 100], with values closer to -100 resulting in a more mellow sound; values closer to 100 resulting in a crisper sound
sound_effects int No Sound effect settings, only one can be selected at a time, available values: spacious_echo (spacious echo) auditorium_echo (auditorium broadcast) lofi_telephone (telephone distortion) robotic (electro sound) Available options: spacious_echo, auditorium_echo, lofi_telephone, robotic

Response Parameters

Field Name Parameter Type Description
taskId long Asynchronous Task ID

Get the status of an asynchronous task

Request URL

get

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/task/{taskId}

Request Parameters

Field Name Type Required or Not Description
taskId string is taskId returned by the asynchronous task creation interface

Response Parameters

Field Name Type Description
taskId string Unique Device Identifier of the task.
status string The status of the task, such as Processing, Success, Failed, Expired.
fileId string Unique Device Identifier of the file.

File Download

Request URL

get

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/files/retrieve?taskId=19153409980342... \\&fileId=261877976...

Request Parameters

Field Name Type Required or Not Description
taskId string is Obtain the taskId returned by the asynchronous task status interface
fileId string is Retrieve the fileId returned by the asynchronous task status retrieval interface

Response Parameters

Field Name Type Description
fileId string Unique Device Identifier of the file.
bytes int File size, in bytes.
createdAt int64 Creation time of the file, in Unix timestamp format.
filename string File name, including the extension.
purpose string The purpose of the file, such as t2a_async.
mediaUrl string The download address of the file, a complete URL containing the access permission signature.
expireTime int64 The expiration time of the file, in Unix timestamp format, indicates the time when the download link of the file will expire.