MaaS_HL_Speech
T2A v2 (Synchronous Speech Generation)
Request URL
post
https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/t2a_v2
content-type
application/json
Request Parameters
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| text | string | is | The text to be synthesized has a length limit of <10000 characters, with paragraph breaks replaced by line breaks. (If you need to control the interval time between voices, add <#x#> between characters, where x is in seconds, supporting 0.01 - 99.99 with a maximum of two decimal places). It supports customizing the voice time interval between texts to achieve the effect of customizing the voice pause time for texts. It should be noted that the text interval time must be set between two texts that can be pronounced, and multiple consecutive time intervals cannot be set. |
| voice_setting | object | No | Control parameters such as the speed, volume, and intonation of the generated voice. |
| audio_setting | object | No | Controls the sampling rate, bit rate, and audio format of the generated sound. |
| pronunciation_dict | object | No | Parameter controlling timbre mixing, one of which and voice_id must be filled in. |
| timbre_weights | array | No | Weight must be filled in synchronously with voice_id, supporting up to 4 timbre mixtures, with a value range of [1, 100]. |
| stream | bool | No | Whether to enable streaming. Default is false, i.e., streaming is not enabled. |
| stream_options | object | No | Control the relevant parameters of the streaming process. |
| language_boost | string | No | The default is null, which enhances the recognition ability of specified minority languages and dialects. After setting, it can improve the speech performance in the specified minority language/dialect scenario. If the minority language type is not clear, you can choose "auto", and the model will automatically judge the minority language type. 支持以下取值:'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto' |
| subtitle_enable | bool | No | A switch that controls whether to enable the subtitle service. The default value is false. |
| subtitle_type | string | No | Subtitle granularity, default value is sentence. Available values: sentence: Sentence-level timestamp word: Word-level timestamp word_streaming: Streaming-optimized word-level timestamp, only valid when stream=true Available options: sentence, word, word_streaming |
| output_format | string | No | Parameter that controls the format of the output result. Optional values are urlhex. The default value is hex. This parameter only takes effect in non-streaming scenarios, and streaming scenarios only support returning the hex format. |
| aigc_watermark | bool | No | Controls whether to add an audio rhythm marker at the end of the synthesized audio, with a default value of False. This parameter only takes effect for non-streaming synthesis. Only domestic API interfaces support this parameter. |
voice_setting parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| speed | Float | No | The speaking rate range is [0.5, 2], with a default value of 1.0. The larger the value, the faster the speaking rate. |
| vol | Float | No | Volume range (0, 10], default value is 1.0, the larger the value, the higher the volume. |
| pitch | int | No | The pitch range is [-12, 12], with a default value of 0, where 0 indicates the original timbre output, and the value must be an integer. |
| voice_id | string | is | The requested timbre number. It is "required" to choose one between this and timbre_weights. Supports two types: system timbre (id) and replicated timbre (id), where the system timbre (ID) is as follows: Green Youth Timbre: male-qn-qingse Elite Youth Timbre: male-qn-jingying Overbearing Youth Timbre: male-qn-badao Youth College Student Timbre: male-qn-daxuesheng Maiden Timbre: female-shaonv Yujie voice: female-yujie Mature female voice: female-chengshu Sweet female voice: female-tianmei Male host: presenter_male Female host: presenter_female Male audiobook 1: audiobook_male_1 Male audiobook 2: audiobook_male_2 Female audiobook 1: audiobook_female_1 Female Audiobook 2: audiobook_female_2 Green Youth Voice - beta: male-qn-qingse-jingpin Elite Youth Voice - beta: male-qn-jingying-jingpin Overbearing Youth Voice - beta: male-qn-badao-jingpin Youthful College Student Voice - beta: male-qn-daxuesheng-jingpin Teenage Girl Voice - beta: female-shaonv-jingpin Elder Sister Voice - beta: female-yujie-jingpin Mature Woman Voice - beta: female-chengshu-jingpin Sweet Woman Voice - beta: female-tianmei-jingpin Clever Boy: clever_boy Cute Boy: cute_boy Lovely Girl: lovely_girl Cartoon Pig Xiaoqi: cartoon_pig Bingjiao Didi: bingjiao_didi Handsome Boyfriend: junlang_nanyou Innocent Schoolmate: chunzhen_xuedi Cold Senior: lengdan_xiongzhang Overbearing Young Master: badao_shaoye Sweetheart Xiaoling: tianxin_xiaoling Playful Cute Girl: qiaopi_mengmei Charming Cougar: wumei_yujie Cute Freshman: diadia_xuemei Elegant Senior: danya_xuejie Santa Claus: Santa_Claus Grinch: Grinch Rudolph: Rudolph Arnold:Arnold Charming Santa:Charming_Santa Charming Lady:Charming_Lady Sweet Girl:Sweet_Girl Cute Elf:Cute_Elf Attractive Girl:Attractive_Girl Serene Woman:Serene_Woman |
| emotion | string | No | Controls the emotion of synthetic speech; currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral; parameter range ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"] |
| latex_read | bool | No | Controls whether to support reading aloud LaTeX formulas, with the default being false. Please note: 1. The formulas in the request need to have \\$$ added at the beginning and end; 2. If the formulas in the request contain "\", they need to be escaped to "\\". Example: The basic formula for derivatives is \\$$\\frac{d}{dx}(x^n) = nx^{n-1}\\$$ |
| ~~english_normalization~~ | ~~bool~~ | ~~No ~~ | ~~This parameter supports English Text Normalization, which can improve the performance of digital reading scenarios but slightly increase latency. If not provided, the default value is false. ~~ |
| text_normalization | bool | No | Whether to enable Chinese and English Text Normalization. Enabling it can improve the performance of digital reading scenarios, but will slightly increase latency. The default value is false |
audio_setting parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| sample_rate | int | No | Sampling rate range [8000, 16000, 22050, 24000, 32000, 44100], default value is 32000. |
| bitrate | int | No | Bitrate range [32000, 64000, 128000, 256000], default value is 128000, only valid for mp3 format audio. |
| format | string | No | The generated audio format, with a default value of mp3, has an optional range of [mp3, pcm, flac, wav]. |
| channel | int | No | Number of channels, default value is 2 (stereo), optional values are 1 (mono) or 2 (stereo). |
| force_cbr | bool | No | For audio constant bitrate (CBR) control, false or true is optional. When this parameter is set to true, audio encoding will be performed in a constant bitrate manner. Note: This parameter only takes effect when the audio is set to streaming output and the audio format is mp3. |
pronunciation_dict parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| tone | list\ | No | Replace text, symbols, and their corresponding phonetic notations that require special marking. Replace pronunciations (adjust tones/replace pronunciations of other characters). The format is as follows: [" Yan Shaofei /(yan4)(shao3)(fei1)", "Dafei /(da2)(fei1)", "omg/oh my god"] (Tones are represented by numbers, with the first tone (flat tone) as 1, the second tone (rising tone) as 2, the third tone (falling-rising tone) as 3, the fourth tone (falling tone) as 4), and the neutral tone as 5. |
timbre_weights parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| voice_id | string | No | The requested timbre ID. Must be filled in synchronously with the weight parameter. |
| weight | int | No | The weight within the range [1, 100] must be filled in synchronously with voice_id. It supports up to 4 timbre mixtures, with values being integers. The higher the proportion of a single timbre value, the more the synthesized timbre resembles it. |
stream_options parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| exclude_aggregated_audio | bool | No | Sets whether the last chunk contains the concatenated voice hex data. The default value is False, meaning that the last chunk contains the complete concatenated voice hex data |
Response Parameters
| Field Name | Parameter Type | Description |
|---|---|---|
| data | object | The returned data object may be null and requires a non-null check. |
| trace_id | string | The ID of this session, used to help locate issues during consultation/feedback. |
| extra_info | object | Related additional information. |
| base_resp | object | If the request fails, the corresponding error status code and details. |
data parameter
| Field Name | Type | Description |
|---|---|---|
| audio | string | The synthesized audio segment is hex-encoded and generated according to the format defined in the input (mp3/pcm/flac). |
| subtitle_file | string | Download link for synthesized subtitles, corresponding to the audio file, accurate to sentences (no more than 50 characters), in milliseconds, and in JSON format. |
| status | int | Current audio stream status, 1 indicates synthesizing, 2 indicates synthesis completed. |
extra_info parameter
| Field Name | Type | Description |
|---|---|---|
| audio_length | long | Audio duration, accurate to milliseconds. |
| audio_sample_rate | long | Sampling rate. |
| audio_size | long | Audio size, in ByteDance. |
| bitrate | long | Bitrate. |
| audio_format | string | The format of the generated audio file, with valid values being mp3/pcm/flac. |
| audio_channel | long | Number of audio channels generated, 1: Mono, 2: Stereo. |
| invisible_character_ratio | double | Proportion of illegal characters. If the proportion of illegal characters does not exceed 10% (inclusive), the audio will be generated normally and the proportion of illegal characters will be returned; the maximum is 0.1 (10%), and an error will be reported if exceeded. |
| usage_characters | long | Billable character count, the number of billable characters generated in this speech synthesis. |
| word_count | long | Count of pronounced characters, including Chinese characters, numbers, and letters, excluding punctuation marks |
base_resp parameter
| Field Name | Type | Description |
|---|---|---|
| status_code | int64 | Status code. 1000: Unknown error; 1001: Timeout; 1002: Triggered rate limiting; 1004: Authentication failed; 1039: Triggered TPM rate limiting; 1042: Illegal characters exceed 10%; 2013: Input format information is abnormal. |
| status_msg | string | Status Details. |
T2A Large v2 (Asynchronous Ultra-long Text Speech Generation)
File Upload
Request URL
post
https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/file/upload
content-type
form/data
Request Parameters
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| purpose | string | is | File usage purpose. The values and supported formats are as follows: t2a_async_input: File used when creating a speech generation task, supporting documents in txt and zipt formats; |
| file | file | is | File |
Response Parameters
| Field Name | Type | Description |
|---|---|---|
| file | object | File information object, containing detailed information about the file. |
| base_resp | object | Basic response information object, containing status code and status message. |
file parameter
| Field Name | Type | Description |
|---|---|---|
| file_id | int64 | Unique Device Identifier of the file. |
| bytes | int | File size, in bytes. |
| created_at | int64 | Creation time of the file, in Unix timestamp format. |
| filename | string | File name. |
| purpose | string | The purpose of the file is currently t2a_async_input. |
base_resp parameter
| Field Name | Type | Description |
|---|---|---|
| status_code | int64 | Status code, 0 indicates success. |
| status_msg | string | Status message, success indicates that the request was successful. |
Create an asynchronous task
Request URL
post
https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/t2a_async_v2
content-type
application/json
Request Parameters
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| text | string | No | Text to be synthesized, with a maximum limit of 50,000 characters. Either this or "text_file_id" must be filled in. |
| text_file_id | long | No | The file_id returned by the file upload interface. The ID of the text file to be synthesized, with a single txt file length limit of <100,000 characters, and formats supported include txt and zip. It is required to fill in either this or "text". After being passed in, the format will be automatically verified. 1.txt: Length limit <100,000 characters (if you need to control the interval time between voices, add <#x#> between characters, where x is in seconds, supporting 0.01 - 99.99s with a maximum of two decimal places). It supports customizing the voice time interval between texts to achieve the effect of customizing the voice pause time for texts. It should be noted that the text interval time needs to be set between two texts that can be pronounced, and multiple consecutive time intervals cannot be set; 2. zip: Pack and upload, the package should only contain txt or json files (files within the Compressed Packet are of the same format), and json files can have three fields, ["title", "content", "extra"], representing the title, body, and author respectively. Finally, three sets of results will be produced corresponding to the three fields, with a total of 9 files in a folder.If a field does not exist or its content is empty, the corresponding file will not be generated. |
| voice_setting | object | No | Control parameters such as the speed, volume, and intonation of the generated voice. |
| audio_setting | object | No | Controls the sampling rate, bit rate, and audio format of the generated sound. |
| pronunciation_dict | object | No | Parameter controlling timbre mixing, one of which and voice_id must be filled in. |
| language_boost | string | No | The default is null, which enhances the recognition ability of specified minority languages and dialects. After setting, it can improve the speech performance in the specified minority language/dialect scenario. If the minority language type is not clear, you can choose "auto", and the model will automatically judge the minority language type. 支持以下取值:'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'auto' |
| voice_modify | object | No | Supported audio formats: mp3, wav, flac. (Other formats such as pcm, pcmu_raw, pcmu_wav, opus are not supported, and passing them will return a parameter error.) |
| aigc_watermark | boolean | No | Controls whether to add an audio rhythm marker at the end of the synthesized audio, with a default value of False. This parameter only takes effect for non-streaming synthesis. Only domestic API interfaces support this parameter. |
voice_setting parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| speed | Float | No | The speaking rate range is [0.5, 2], with a default value of 1.0. The larger the value, the faster the speaking rate. |
| vol | Float | No | Volume range (0, 10], default value is 1.0, the larger the value, the higher the volume. |
| pitch | int | No | The pitch range is [-12, 12], with a default value of 0, where 0 indicates the original timbre output, and the value must be an integer. |
| voice_id | string | is | The requested timbre number supports two types: system timbre (id) and replicated timbre (id). Among them, the system timbre (ID) is as follows: Youthful Youth Timbre: male-qn-qingse Elite Youth Timbre: male-qn-jingying Overbearing Youth Timbre: male-qn-badao Youth College Student Timbre: male-qn-daxuesheng Maiden Timbre: female-shaonv Mature Sister Timbre: female-yujie Mature female voice: female-chengshu Sweet female voice: female-tianmei Male host: presenter_male Female host: presenter_female Male audiobook 1: audiobook_male_1 Male audiobook 2: audiobook_male_2 Female audiobook 1: audiobook_female_1 Female audiobook 2: audiobook_female_2 Green Youth Voice - beta: male-qn-qingse-jingpin Elite Youth Voice - beta: male-qn-jingying-jingpin Overbearing Youth Voice - beta: male-qn-badao-jingpin Youthful College Student Voice - beta: male-qn-daxuesheng-jingpin Teenage Girl Voice - beta: female-shaonv-jingpin Elder Sister Voice - beta: female-yujie-jingpin Mature Woman Voice - beta: female-chengshu-jingpin Sweet Woman Voice - beta: female-tianmei-jingpin Clever Boy: clever_boy Cute Boy: cute_boy Lovely Girl: lovely_girl Cartoon Pig Xiaoqi: cartoon_pig |
| emotion | string | No | Controls the emotion of synthetic speech; currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral; parameter range ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"] |
| english_normalization | bool | No | This parameter supports English Text Normalization, which can improve the performance of digital reading scenarios but slightly increase latency. If not provided, the default value is false. |
audio_setting parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| audio_sample_rate | int | No | Sampling rate range [8000, 16000, 22050, 24000, 32000, 44100], default value is 32000. |
| bitrate | int | No | Bitrate range [32000, 64000, 128000, 256000], default value is 128000, only valid for mp3 format audio. |
| format | string | No | The generated audio format, with a default value of mp3, has an optional range of [mp3, pcm, flac, wav]. |
| channel | int | No | Number of channels, default value is 2 (stereo), optional values are 1 (mono) or 2 (stereo). |
pronunciation_dict parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| tone | list\ | No | Replace the text, symbols, and corresponding annotations that require special annotations. Replace pronunciation (adjust tone/replace other character pronunciations). The format is as follows: ["Yan Shaofei/(yan4) (shao3) (fei1) ", "Dafei/(da2) (fei1) ", "omg/oh my god"]. The tone is replaced by numbers, with the first tone (yin ping) being 1, the second tone (yang ping) being 2, the third tone (upper tone) being 3, the fourth tone (de tone) being 4, and the light tone being 5. |
voice_modify parameter
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| pitch | int | No | Pitch adjustment (low/bright), range [-100, 100], values closer to -100 result in a lower pitch; closer to 100, a brighter pitch |
| intensity | int | No | Intensity adjustment (strength/softness), range [-100, 100], values closer to -100 result in a more powerful sound; closer to 100, a softer sound |
| timbre | int | No | Timbre adjustment (magnetic/crisp), range [-100, 100], with values closer to -100 resulting in a more mellow sound; values closer to 100 resulting in a crisper sound |
| sound_effects | int | No | Sound effect settings, only one can be selected at a time, available values: spacious_echo (spacious echo) auditorium_echo (auditorium broadcast) lofi_telephone (telephone distortion) robotic (electro sound) Available options: spacious_echo, auditorium_echo, lofi_telephone, robotic |
Response Parameters
| Field Name | Parameter Type | Description |
|---|---|---|
| taskId | long | Asynchronous Task ID |
Get the status of an asynchronous task
Request URL
get
https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/tts/task/{taskId}
Request Parameters
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| taskId | string | is | taskId returned by the asynchronous task creation interface |
Response Parameters
| Field Name | Type | Description |
|---|---|---|
| taskId | string | Unique Device Identifier of the task. |
| status | string | The status of the task, such as Processing, Success, Failed, Expired. |
| fileId | string | Unique Device Identifier of the file. |
File Download
Request URL
get
https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/hailuo/files/retrieve?taskId=19153409980342... \\&fileId=261877976...
Request Parameters
| Field Name | Type | Required or Not | Description |
|---|---|---|---|
| taskId | string | is | Obtain the taskId returned by the asynchronous task status interface |
| fileId | string | is | Retrieve the fileId returned by the asynchronous task status retrieval interface |
Response Parameters
| Field Name | Type | Description |
|---|---|---|
| fileId | string | Unique Device Identifier of the file. |
| bytes | int | File size, in bytes. |
| createdAt | int64 | Creation time of the file, in Unix timestamp format. |
| filename | string | File name, including the extension. |
| purpose | string | The purpose of the file, such as t2a_async. |
| mediaUrl | string | The download address of the file, a complete URL containing the access permission signature. |
| expireTime | int64 | The expiration time of the file, in Unix timestamp format, indicates the time when the download link of the file will expire. |