MaaS_HL_Speech

T2A v2 (Synchronous Speech Generation)

Request URL

post

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/t2a_v2

content-type

application/json

Request Parameters

Field Name	Type	Required or Not	Description
text	string	is	The text to be synthesized has a length limit of <10000 characters, with paragraph breaks replaced by line breaks. (If you need to control the interval time between voices, add <#x#> between characters, where x is in seconds, supporting 0.01 - 99.99 with a maximum of two decimal places). It supports customizing the voice time interval between texts to achieve the effect of customizing the voice pause time for texts. It should be noted that the text interval time must be set between two texts that can be pronounced, and multiple consecutive time intervals cannot be set.
voice_setting	object	No	Control parameters such as the speed, volume, and intonation of the generated voice.
audio_setting	object	No	Controls the sampling rate, bit rate, and audio format of the generated sound.
pronunciation_dict	object	No	Parameter controlling timbre mixing, one of which and voice_id must be filled in.
timbre_weights	array	No	Weight must be filled in synchronously with voice_id, supporting up to 4 timbre mixtures, with a value range of [1, 100].
stream	bool	No	Whether to enable streaming. Default is false, i.e., streaming is not enabled.
stream_options	object	No	Control the relevant parameters of the streaming process.
language_boost	string	No	The default is null, which enhances the recognition ability of specified minority languages and dialects. After setting, it can improve the speech performance in the specified minority language/dialect scenario. If the minority language type is not clear, you can choose "auto", and the model will automatically judge the minority language type. 支持以下取值：'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto'
subtitle_enable	bool	No	A switch that controls whether to enable the subtitle service. The default value is false.
subtitle_type	string	No	Subtitle granularity, default value is sentence. Available values: sentence: Sentence-level timestamp word: Word-level timestamp word_streaming: Streaming-optimized word-level timestamp, only valid when stream=true Available options: sentence, word, word_streaming
output_format	string	No	Parameter that controls the format of the output result. Optional values are urlhex. The default value is hex. This parameter only takes effect in non-streaming scenarios, and streaming scenarios only support returning the hex format.
aigc_watermark	bool	No	Controls whether to add an audio rhythm marker at the end of the synthesized audio, with a default value of False. This parameter only takes effect for non-streaming synthesis. Only domestic API interfaces support this parameter.

voice_setting parameter

Field Name	Type	Required or Not	Description
speed	Float	No	The speaking rate range is [0.5, 2], with a default value of 1.0. The larger the value, the faster the speaking rate.
vol	Float	No	Volume range (0, 10], default value is 1.0, the larger the value, the higher the volume.
pitch	int	No	The pitch range is [-12, 12], with a default value of 0, where 0 indicates the original timbre output, and the value must be an integer.
voice_id	string	is	The requested timbre number. It is "required" to choose one between this and timbre_weights. Supports two types: system timbre (id) and replicated timbre (id), where the system timbre (ID) is as follows: Green Youth Timbre: male-qn-qingse Elite Youth Timbre: male-qn-jingying Overbearing Youth Timbre: male-qn-badao Youth College Student Timbre: male-qn-daxuesheng Maiden Timbre: female-shaonv Yujie voice: female-yujie Mature female voice: female-chengshu Sweet female voice: female-tianmei Male host: presenter_male Female host: presenter_female Male audiobook 1: audiobook_male_1 Male audiobook 2: audiobook_male_2 Female audiobook 1: audiobook_female_1 Female Audiobook 2: audiobook_female_2 Green Youth Voice - beta: male-qn-qingse-jingpin Elite Youth Voice - beta: male-qn-jingying-jingpin Overbearing Youth Voice - beta: male-qn-badao-jingpin Youthful College Student Voice - beta: male-qn-daxuesheng-jingpin Teenage Girl Voice - beta: female-shaonv-jingpin Elder Sister Voice - beta: female-yujie-jingpin Mature Woman Voice - beta: female-chengshu-jingpin Sweet Woman Voice - beta: female-tianmei-jingpin Clever Boy: clever_boy Cute Boy: cute_boy Lovely Girl: lovely_girl Cartoon Pig Xiaoqi: cartoon_pig Bingjiao Didi: bingjiao_didi Handsome Boyfriend: junlang_nanyou Innocent Schoolmate: chunzhen_xuedi Cold Senior: lengdan_xiongzhang Overbearing Young Master: badao_shaoye Sweetheart Xiaoling: tianxin_xiaoling Playful Cute Girl: qiaopi_mengmei Charming Cougar: wumei_yujie Cute Freshman: diadia_xuemei Elegant Senior: danya_xuejie Santa Claus: Santa_Claus Grinch: Grinch Rudolph: Rudolph Arnold：Arnold Charming Santa：Charming_Santa Charming Lady：Charming_Lady Sweet Girl：Sweet_Girl Cute Elf：Cute_Elf Attractive Girl：Attractive_Girl Serene Woman：Serene_Woman
emotion	string	No	Controls the emotion of synthetic speech; currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral; parameter range ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]
latex_read	bool	No	Controls whether to support reading aloud LaTeX formulas, with the default being false. Please note: 1. The formulas in the request need to have \\$$ added at the beginning and end; 2. If the formulas in the request contain "\", they need to be escaped to "\\". Example: The basic formula for derivatives is \\$$\\frac{d}{dx}(x^n) = nx^{n-1}\\$$
~~english_normalization~~	~~bool~~	~~No ~~	~~This parameter supports English Text Normalization, which can improve the performance of digital reading scenarios but slightly increase latency. If not provided, the default value is false. ~~
text_normalization	bool	No	Whether to enable Chinese and English Text Normalization. Enabling it can improve the performance of digital reading scenarios, but will slightly increase latency. The default value is false

audio_setting parameter

Field Name	Type	Required or Not	Description
sample_rate	int	No	Sampling rate range [8000, 16000, 22050, 24000, 32000, 44100], default value is 32000.
bitrate	int	No	Bitrate range [32000, 64000, 128000, 256000], default value is 128000, only valid for mp3 format audio.
format	string	No	The generated audio format, with a default value of mp3, has an optional range of [mp3, pcm, flac, wav].
channel	int	No	Number of channels, default value is 2 (stereo), optional values are 1 (mono) or 2 (stereo).
force_cbr	bool	No	For audio constant bitrate (CBR) control, false or true is optional. When this parameter is set to true, audio encoding will be performed in a constant bitrate manner. Note: This parameter only takes effect when the audio is set to streaming output and the audio format is mp3.

pronunciation_dict parameter

Field Name	Type	Required or Not	Description
tone	list\	No	Replace text, symbols, and their corresponding phonetic notations that require special marking. Replace pronunciations (adjust tones/replace pronunciations of other characters). The format is as follows: [" Yan Shaofei /(yan4)(shao3)(fei1)", "Dafei /(da2)(fei1)", "omg/oh my god"] (Tones are represented by numbers, with the first tone (flat tone) as 1, the second tone (rising tone) as 2, the third tone (falling-rising tone) as 3, the fourth tone (falling tone) as 4), and the neutral tone as 5.

timbre_weights parameter

Field Name	Type	Required or Not	Description
voice_id	string	No	The requested timbre ID. Must be filled in synchronously with the weight parameter.
weight	int	No	The weight within the range [1, 100] must be filled in synchronously with voice_id. It supports up to 4 timbre mixtures, with values being integers. The higher the proportion of a single timbre value, the more the synthesized timbre resembles it.

stream_options parameter

Field Name	Type	Required or Not	Description
exclude_aggregated_audio	bool	No	Sets whether the last chunk contains the concatenated voice hex data. The default value is False, meaning that the last chunk contains the complete concatenated voice hex data

Response Parameters

Field Name	Parameter Type	Description
data	object	The returned data object may be null and requires a non-null check.
trace_id	string	The ID of this session, used to help locate issues during consultation/feedback.
extra_info	object	Related additional information.
base_resp	object	If the request fails, the corresponding error status code and details.

data parameter

Field Name	Type	Description
audio	string	The synthesized audio segment is hex-encoded and generated according to the format defined in the input (mp3/pcm/flac).
subtitle_file	string	Download link for synthesized subtitles, corresponding to the audio file, accurate to sentences (no more than 50 characters), in milliseconds, and in JSON format.
status	int	Current audio stream status, 1 indicates synthesizing, 2 indicates synthesis completed.

extra_info parameter

Field Name	Type	Description
audio_length	long	Audio duration, accurate to milliseconds.
audio_sample_rate	long	Sampling rate.
audio_size	long	Audio size, in ByteDance.
bitrate	long	Bitrate.
audio_format	string	The format of the generated audio file, with valid values being mp3/pcm/flac.
audio_channel	long	Number of audio channels generated, 1: Mono, 2: Stereo.
invisible_character_ratio	double	Proportion of illegal characters. If the proportion of illegal characters does not exceed 10% (inclusive), the audio will be generated normally and the proportion of illegal characters will be returned; the maximum is 0.1 (10%), and an error will be reported if exceeded.
usage_characters	long	Billable character count, the number of billable characters generated in this speech synthesis.
word_count	long	Count of pronounced characters, including Chinese characters, numbers, and letters, excluding punctuation marks

base_resp parameter

Field Name	Type	Description
status_code	int64	Status code. 1000: Unknown error; 1001: Timeout; 1002: Triggered rate limiting; 1004: Authentication failed; 1039: Triggered TPM rate limiting; 1042: Illegal characters exceed 10%; 2013: Input format information is abnormal.
status_msg	string	Status Details.

T2A Large v2 (Asynchronous Ultra-long Text Speech Generation)

File Upload

Request URL

post

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/file/upload

content-type

form/data

Request Parameters

Field Name	Type	Required or Not	Description
purpose	string	is	File usage purpose. The values and supported formats are as follows: t2a_async_input: File used when creating a speech generation task, supporting documents in txt and zipt formats;
file	file	is	File

Response Parameters

Field Name	Type	Description
file	object	File information object, containing detailed information about the file.
base_resp	object	Basic response information object, containing status code and status message.

file parameter

Field Name	Type	Description
file_id	int64	Unique Device Identifier of the file.
bytes	int	File size, in bytes.
created_at	int64	Creation time of the file, in Unix timestamp format.
filename	string	File name.
purpose	string	The purpose of the file is currently t2a_async_input.

base_resp parameter

Field Name	Type	Description
status_code	int64	Status code, 0 indicates success.
status_msg	string	Status message, success indicates that the request was successful.

Create an asynchronous task

Request URL

post

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/t2a_async_v2

content-type

application/json

Request Parameters

Field Name	Type	Required or Not	Description
text	string	No	Text to be synthesized, with a maximum limit of 50,000 characters. Either this or "text_file_id" must be filled in.
text_file_id	long	No	The file_id returned by the file upload interface. The ID of the text file to be synthesized, with a single txt file length limit of <100,000 characters, and formats supported include txt and zip. It is required to fill in either this or "text". After being passed in, the format will be automatically verified. 1.txt: Length limit <100,000 characters (if you need to control the interval time between voices, add <#x#> between characters, where x is in seconds, supporting 0.01 - 99.99s with a maximum of two decimal places). It supports customizing the voice time interval between texts to achieve the effect of customizing the voice pause time for texts. It should be noted that the text interval time needs to be set between two texts that can be pronounced, and multiple consecutive time intervals cannot be set; 2. zip: Pack and upload, the package should only contain txt or json files (files within the Compressed Packet are of the same format), and json files can have three fields, ["title", "content", "extra"], representing the title, body, and author respectively. Finally, three sets of results will be produced corresponding to the three fields, with a total of 9 files in a folder.If a field does not exist or its content is empty, the corresponding file will not be generated.
voice_setting	object	No	Control parameters such as the speed, volume, and intonation of the generated voice.
audio_setting	object	No	Controls the sampling rate, bit rate, and audio format of the generated sound.
pronunciation_dict	object	No	Parameter controlling timbre mixing, one of which and voice_id must be filled in.
language_boost	string	No	The default is null, which enhances the recognition ability of specified minority languages and dialects. After setting, it can improve the speech performance in the specified minority language/dialect scenario. If the minority language type is not clear, you can choose "auto", and the model will automatically judge the minority language type. 支持以下取值：'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto'
voice_modify	object	No	Supported audio formats: mp3, wav, flac. (Other formats such as pcm, pcmu_raw, pcmu_wav, opus are not supported, and passing them will return a parameter error.)
aigc_watermark	boolean	No	Controls whether to add an audio rhythm marker at the end of the synthesized audio, with a default value of False. This parameter only takes effect for non-streaming synthesis. Only domestic API interfaces support this parameter.

voice_setting parameter

Field Name	Type	Required or Not	Description
speed	Float	No	The speaking rate range is [0.5, 2], with a default value of 1.0. The larger the value, the faster the speaking rate.
vol	Float	No	Volume range (0, 10], default value is 1.0, the larger the value, the higher the volume.
pitch	int	No	The pitch range is [-12, 12], with a default value of 0, where 0 indicates the original timbre output, and the value must be an integer.
voice_id	string	is	The requested timbre number supports two types: system timbre (id) and replicated timbre (id). Among them, the system timbre (ID) is as follows: Youthful Youth Timbre: male-qn-qingse Elite Youth Timbre: male-qn-jingying Overbearing Youth Timbre: male-qn-badao Youth College Student Timbre: male-qn-daxuesheng Maiden Timbre: female-shaonv Mature Sister Timbre: female-yujie Mature female voice: female-chengshu Sweet female voice: female-tianmei Male host: presenter_male Female host: presenter_female Male audiobook 1: audiobook_male_1 Male audiobook 2: audiobook_male_2 Female audiobook 1: audiobook_female_1 Female audiobook 2: audiobook_female_2 Green Youth Voice - beta: male-qn-qingse-jingpin Elite Youth Voice - beta: male-qn-jingying-jingpin Overbearing Youth Voice - beta: male-qn-badao-jingpin Youthful College Student Voice - beta: male-qn-daxuesheng-jingpin Teenage Girl Voice - beta: female-shaonv-jingpin Elder Sister Voice - beta: female-yujie-jingpin Mature Woman Voice - beta: female-chengshu-jingpin Sweet Woman Voice - beta: female-tianmei-jingpin Clever Boy: clever_boy Cute Boy: cute_boy Lovely Girl: lovely_girl Cartoon Pig Xiaoqi: cartoon_pig
emotion	string	No	Controls the emotion of synthetic speech; currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral; parameter range ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]
english_normalization	bool	No	This parameter supports English Text Normalization, which can improve the performance of digital reading scenarios but slightly increase latency. If not provided, the default value is false.

audio_setting parameter

Field Name	Type	Required or Not	Description
audio_sample_rate	int	No	Sampling rate range [8000, 16000, 22050, 24000, 32000, 44100], default value is 32000.
bitrate	int	No	Bitrate range [32000, 64000, 128000, 256000], default value is 128000, only valid for mp3 format audio.
format	string	No	The generated audio format, with a default value of mp3, has an optional range of [mp3, pcm, flac, wav].
channel	int	No	Number of channels, default value is 2 (stereo), optional values are 1 (mono) or 2 (stereo).

pronunciation_dict parameter

Field Name	Type	Required or Not	Description
tone	list\	No	Replace the text, symbols, and corresponding annotations that require special annotations. Replace pronunciation (adjust tone/replace other character pronunciations). The format is as follows: ["Yan Shaofei/(yan4) (shao3) (fei1) ", "Dafei/(da2) (fei1) ", "omg/oh my god"]. The tone is replaced by numbers, with the first tone (yin ping) being 1, the second tone (yang ping) being 2, the third tone (upper tone) being 3, the fourth tone (de tone) being 4, and the light tone being 5.

voice_modify parameter

Field Name	Type	Required or Not	Description
pitch	int	No	Pitch adjustment (low/bright), range [-100, 100], values closer to -100 result in a lower pitch; closer to 100, a brighter pitch
intensity	int	No	Intensity adjustment (strength/softness), range [-100, 100], values closer to -100 result in a more powerful sound; closer to 100, a softer sound
timbre	int	No	Timbre adjustment (magnetic/crisp), range [-100, 100], with values closer to -100 resulting in a more mellow sound; values closer to 100 resulting in a crisper sound
sound_effects	int	No	Sound effect settings, only one can be selected at a time, available values: spacious_echo (spacious echo) auditorium_echo (auditorium broadcast) lofi_telephone (telephone distortion) robotic (electro sound) Available options: spacious_echo, auditorium_echo, lofi_telephone, robotic

Response Parameters

Field Name	Parameter Type	Description
taskId	long	Asynchronous Task ID

Get the status of an asynchronous task

Request URL

get

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/task/{taskId}

Request Parameters

Field Name	Type	Required or Not	Description
taskId	string	is	taskId returned by the asynchronous task creation interface

Response Parameters

Field Name	Type	Description
taskId	string	Unique Device Identifier of the task.
status	string	The status of the task, such as Processing, Success, Failed, Expired.
fileId	string	Unique Device Identifier of the file.

File Download

Request URL

get

https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/files/retrieve?taskId=19153409980342... \\&fileId=261877976...

Request Parameters

Field Name	Type	Required or Not	Description
taskId	string	is	Obtain the taskId returned by the asynchronous task status interface
fileId	string	is	Retrieve the fileId returned by the asynchronous task status retrieval interface

Response Parameters

Field Name	Type	Description
fileId	string	Unique Device Identifier of the file.
bytes	int	File size, in bytes.
createdAt	int64	Creation time of the file, in Unix timestamp format.
filename	string	File name, including the extension.
purpose	string	The purpose of the file, such as t2a_async.
mediaUrl	string	The download address of the file, a complete URL containing the access permission signature.
expireTime	int64	The expiration time of the file, in Unix timestamp format, indicates the time when the download link of the file will expire.