MaaS_Ele
MaaS_Ele_scribe_v2
Request Protocol
http
Header
| Parameter Name | Type | Required | Description |
|---|---|---|---|
Content-Type |
string | is | Fixed to applicatio n/json |
Authorization |
string | is | Bearer {your_api_key} |
Request URL
POST https://genaiapi.cloudsway.net/v1/ai//stt-e/speech-to-text
Query Parameter
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| enable_logging |
boolean |
Optional, default is true | When enable_logging is set to false, the request will use zero retention mode. This means that the logging and transcription storage features for this request will be unavailable. Zero retention mode is only available to enterprise customers. |
Request Parameters
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| file |
files |
Optional |
File to be transcribed. Supports all major audio and video formats. One of the file or cloud storage URL parameters must be provided. File ze must be less than 3.0GB. |
| language_code | string or null | Optional | ISO-639-1 or ISO-639-3 language code corresponding to the language of the audio file. If known in advance, it can sometimes improve transcription performance. The default is null, in which case the language will be automatically predicted. |
| tag_audio_events | boolean | Optional, default istrue |
Whether to mark audio events such as (laughter), (footsteps), etc. in the transcription. |
| num_speakers |
integer or nul | Optional 1 - 32 | The maximum number of speakers in the uploaded file. Helps predict who is speaking when. The maximum number of predictable speakers is 32. The default value is null, in which case the number of speakers will be set to the maximum supported by the model. |
| timestamps_granularity |
enum | Optional default value isword |
Granularity of timestamps in transcription. 'word' provides word-level timestamps, and 'character' provides character-level timestamps for each word. Allowed values: None, Word, Character |
| diarize | boolean | The optional default value is false |
Whether to mark who the current speaker is in the uploaded file. |
| diarization_threshold |
double or null | Optional0.1 - 0.4 |
Separation threshold applied during speaker diarization. A higher value means a lower likelihood of one speaker being split into two different speakers, but a higher likelihood of two different speakers being merged into one speaker (fewer total speakers predicted). A lower value means a higher likelihood of one speaker being split into two different speakers, but a lower likelihood of two different speakers being merged into one speaker (more total speakers predicted). It can only be set when diarize=True and num_speakers=None.The default value is None, in which case we will select a threshold (usually 0.22) based on the model_id. |
| additional_formats | list of objects | Optional | List of other formats to which the document can be exported. |
| file_format |
enum | Optional, default isOther |
Format of the input audio. Options are 'pcm_s16le_16' or 'other'. For pcm_s16le_16, the input audio must be 16-bit PCM with a sampling rate of 16kHz, mono (single channel), and little-endian byte order. The latency will be lower compared to passing encoded waveforms.Allowed values: pcm_s16le_16, others |
| cloud_storage_url |
string or null | Optional Officially deprecated |
HTTPS URL of the file to be transcribed. One of the file or cloud_storage_url parameters must be provided. The file must be accessible via HTTPS, and the file size must be less than 2GB. Any valid HTTPS URL is accepted, including URLs from cloud storage providers (AWS S3, Google Cloud Storage, Cloudflare R2, etc.), CDNs, or any other HTTPS source.The URL can be pre-signed or include an authentication token in the query parameters. |
| temperature | double or null | Optional0-2 |
Controls the randomness of transcription output. Accepts values between 0.0 and 2.0, where higher values result in more diverse and uncertain outcomes. If omitted, we will use a temperature value based on the model you selected, typically 0. |
| seed |
integer or null | Optional 0-2147483647 |
If specified, our system will try its best to perform deterministic sampling, so that repeated requests with the same seed and parameters should return the same results. However, determinism is not guaranteed. It must be an integer between 0 and 2147483647. |
| use_multi_channel |
boolean |
Optional, default isfalse |
Does the audio file contain multiple channels, with each channel containing a single speaker? When enabled, each channel will be transcribed independently, and the results will be merged. Each word in the response will include a "channel_index" field indicating which channel the word was spoken in. Up to 5 channels are supported. |
| no_verbatim | boolean | Optional, default isfalse |
Only supports the scribe_v2 model. If true, transcription will not have any filler words, false starts, and non-speech sounds. |
| entity_detection |
string or list of strings | Optional |
Only supports the scribe_v2 model. Detect entities in the record. It can be "all" to detect all entities, a single entity type or category string, or a list of entity types/categories. Categories include "pii", "phi", "pci", "other", "offensive_language". Specific parameters can be filled in the official website document When enabled, detected entities will be returned in the entities field, along with their text, type, and character position. Using this parameter incurs additional costs. |
| entity_redaction |
string or list of strings |
Optional |
Only supports the scribe_v2 model. Remove entities from the transcript text. Accepts the same format as entity_detection: "all", categories (such as "pii", "phi"), or specific entity types. Must be a subset of entity_detection. When enabled, the field values matched in entities will not be returned. |
| entity_redaction_mode |
string |
可选(redacted、entity_type、enumerated_entity_type),默认为enumerated_entity_type |
Only supports the scribe_v2 model. How to format the edited entity. Replace 'redacted' with {REDACT}, 'entity_type' with {ENTITY_TYPE}, 'enumerated_entity_type' with {ENTITY_TYPE_N}, where N enumerates each occurrence. Only used when entity_redaction is set. |
| keyterms | list of strings | Optional |
Only supports the scribe_v2 model. A list of keywords used for biased transcription. Keywords are words or phrases that you want the model to recognize more accurately. The number of keywords must not exceed 1000. Each keyword must be less than 50 characters in length. Keywords can contain up to 5 words (after normalization). For example, ["hello", "world", "technical term"]. Using this parameter incurs additional costs. When more than 100 keywords are provided, the minimum billing duration for each request is 20 seconds. |
| source_url | string | Optional |
URL of the audio or video file for transcription. Supports hosted video or audio files, YouTube video URLs, TikTok video URLs, and other video hosting services. |
Regarding scribe_v2 entity detection
Measured original factory interface
-
When the request parameters contain entity_detection, it can trigger the entity detection mechanism, and the character-cost in the response header will incur additional costs
-
When the request contains entity_redaction but does not contain entity_detection, entity detection can still be triggered, but the character-cost in the response header will not incur additional costs
additional_formats
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| format | string | Required |
Enum value: segmented_json docx txt html srt |
| max_characters_per_line | integer or null | OptionalDefaults to 100 |
This parameter is supported when the format is txt or srt |
| include_speakers | boolean | Optional Defaults to true |
|
| include_timestamps | boolean | Optional Defaults to true |
|
| segment_on_silence_longer_than_s | double or null | Optional | |
| max_segment_duration_s | double or null | Optional | |
| max_segment_chars | integer or null | Optional |
Request Example
curl 'https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/stt-e/speech-to-text' \
-H 'Authorization: Bearer {Your AK}' \
-F 'file=@"postman-cloud:///1f0e2394-2f6d-4010-a1e8-ad821a2b9a3a"' \
-F 'tag_audio_events="true"'
Response Example
Single-channel response
{
"words": [
{
"start": 0.159,
"type": "word",
"logprob": 0.0,
"end": 0.359,
"text": "A"
},
{
"start": 0.359,
"type": "spacing",
"logprob": 0.0,
"end": 0.36,
"text": " "
},
{
"start": 0.36,
"type": "word",
"logprob": 0.0,
"end": 0.679,
"text": "shared"
},
{
"start": 0.679,
"type": "spacing",
"logprob": 0.0,
"end": 0.679,
"text": " "
},
{
"start": 0.679,
"type": "word",
"logprob": 0.0,
"end": 0.919,
"text": "goal"
},
{
"start": 0.919,
"type": "spacing",
"logprob": 0.0,
"end": 0.959,
"text": " "
},
{
"start": 0.959,
"type": "word",
"logprob": 0.0,
"end": 1.039,
"text": "is"
},
{
"start": 1.039,
"type": "spacing",
"logprob": 0.0,
"end": 1.059,
"text": " "
},
{
"start": 1.059,
"type": "word",
"logprob": 0.0,
"end": 1.159,
"text": "the"
},
{
"start": 1.159,
"type": "spacing",
"logprob": 0.0,
"end": 1.179,
"text": " "
},
{
"start": 1.179,
"type": "word",
"logprob": 0.0,
"end": 1.519,
"text": "heartbeat"
},
{
"start": 1.519,
"type": "spacing",
"logprob": 0.0,
"end": 1.539,
"text": " "
},
{
"start": 1.539,
"type": "word",
"logprob": 0.0,
"end": 1.659,
"text": "of"
},
{
"start": 1.659,
"type": "spacing",
"logprob": 0.0,
"end": 1.659,
"text": " "
},
{
"start": 1.659,
"type": "word",
"logprob": 0.0,
"end": 2.199,
"text": "teamwork."
}
],
"language_code": "eng",
"transcription_id": "DH4a9n1brE06R8BUQPEx",
"language_probability": 0.9907509684562683,
"text": "A shared goal is the heartbeat of teamwork."
}
Multi-channel Response
{
"transcripts": [
{
"language_code": "en",
"language_probability": 0.98,
"text": "Hello from channel one.",
"words": [
{
"text": "Hello",
"start": 0,
"end": 0.5,
"type": "word",
"speaker_id": "speaker_0",
"logprob": -0.124
}
]
},
{
"language_code": "en",
"language_probability": 0.97,
"text": "Greetings from channel two.",
"words": [
{
"text": "Greetings",
"start": 0.1,
"end": 0.7,
"type": "word",
"speaker_id": "speaker_1",
"logprob": -0.156
}
]
}
]
}
Example of v2 Entity Detection Request
curl 'https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/stt-e/speech-to-text' \
-H 'Authorization: Bearer {Your AK}' \
-H 'Content-Type: application/json' \
-F 'file=@"postman-cloud:///1f0e2394-2f6d-4010-a1e8-ad821a2b9a3a"' \
-F 'entity_detection="pii"'\
-F 'entity_redaction="name"'
v2 Entity Detection Response Instance
{
"language_code": "eng",
"language_probability": 0.9869821071624756,
"text": "My name is {NAME_0}. My date of birth is the 12th of July 1987, and my credit card number is 4242-4242-4242-4242.",
"words": [
{ "text": "My", "start": 0.099, "end": 0.259, "type": "word", "logprob": 0.0 },
{ "text": " ", "start": 0.259, "end": 0.299, "type": "spacing", "logprob": 0.0 },
{ "text": "name", "start": 0.299, "end": 0.42, "type": "word", "logprob": 0.0 },
...
],
"transcription_id": "Y2ZX8AxHUzTPCIualYiE",
"entities": [
{
"text": "{NAME_0}",
"entity_type": "name",
"start_char": 11,
"end_char": 15
},
{
"text": "12th of July 1987",
"entity_type": "dob",
"start_char": 41,
"end_char": 58
},
{
"text": "4242-4242-4242-4242",
"entity_type": "credit_card",
"start_char": 89,
"end_char": 108
}
]
}
MaaS_Ele_voice_clones
Request URL
POST https://genaiapi.cloudsway.net/v1/ai/ {Your EndpointPath}/elevenlabs/voices/add
Request Parameters
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| name | string | Required |
Identifies the name of this voice. This name will be displayed in the dropdown menu of the website. |
| files | files | Required | List of audio recording file paths for voice cloning. |
| remove_background_noise | boolean | Optional | Optional, default is false |
| description | string or null | Optional | Description of voice. |
| labels | string or null | Optional | Serialized label dictionary for speech. |
Request Example
curl 'https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/elevenlabs/voices/add' \
-H 'Authorization: Bearer {Your AK}' \
-H 'Content-Type: application/json' \
-F 'name="girl-voice-1"' \
-F 'files=@"postman-cloud:///1f0e238b-1f03-4200-bab6-8616a7296adb"' \
-F 'remove_background_noise="false"'
Response Example
MaaS_Ele_tts_v3/MaaS_Ele_tts_v1
Request URL
POST https://genaiapi.cloudsway.net/v1/ai/ {Your EndpointPath}/tts-e/text-to-speech/{voice_id}
Parameter
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| voice_id | string | Required | ID of the voice to be used. |
Query Parameter
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| enable_logging | boolean | Optional, default istrue |
When enable_logging is set to false, requests will use zero retention mode. This means that the history feature for this request will be unavailable, including request concatenation. Zero retention mode is only available to enterprise customers. |
| optimize_streaming_latency (Deprecated) |
integer or null | Optional, default is None | You can enable latency optimization, but it will sacrifice quality to some extent. The optimal value of the final latency varies by model. Possible values: 0 - Default mode (no latency optimization) 1 - Normal latency optimization (about 50% of the possible latency improvement of option 3) 2 - Strong latency optimization (about 75% of the possible latency improvement of option 3) 3 - Maximum latency optimization 4 - Maximum latency optimization, but also turn off the text normalizer to further save latency (optimal latency, but may misread numbers, dates, etc.) |
| output_format | enum | Optional, default ismp3_44100_128 |
Output format of the generated audio. The format is codec_sample rate_bit rate. Therefore, an MP3 with a sample rate of 22.05kHz and a bit rate of 32kbps is represented as mp3_22050_32. An MP3 with a bit rate of 192kbps requires you to subscribe to the Creator level or higher. A PCM with a sample rate of 44.1kHz requires you to subscribe to the Professional level or higher. Note that the μ-law format (sometimes written as mu-law, often approximated as u-law) is commonly used for Twilio audio input. Supports 21 enumerated values. mp3_22050_32 mp3_24000_48 mp3_44100_32 mp3_44100_64 mp3_44100_96 mp3_44100_128 mp3_44100_192 pcm_8000 pcm_16000 pcm_22050 pcm_24000 pcm_32000 pcm_44100 pcm_48000 ulaw_8000 alaw_8000 opus_48000_32 opus_48000_64 opus_48000_96 opus_48000_128 |
Request Parameters
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| text | string | Required | Text to be converted to speech. |
| model_id | string | No | Model ID |
| language_code | string or null | Optional | Language codes (ISO 639-1) are used to force the model to use a certain language and perform Text Normalization. If the model does not support the provided language code, an error will be returned. |
| voice_settings | object or null | Optional | Voice settings will override the stored settings for the given voice. These settings are only applied in the given request. |
| pronunciation_dictionary_locators |
list of objects or null | Optional | List of pronunciation dictionary locators (id, version id) to be applied to the text. They will be applied in order. Each request may contain up to 3 locators |
| seed |
integer or null | Optional | If specified, our system will make every effort to perform deterministic sampling, so that repeated requests with the same seed and parameters should return the same results. However, determinism is not guaranteed. It must be an integer between 0 and 4294967295. |
| previous_text | string or null | Optional | The text before the current request text. It can be used to improve speech coherence when concatenating multiple generation results or to influence speech coherence in the current generation. |
| next_text | string or null | Optional | The text following the current request text. When concatenating multiple generation results, it can be used to improve speech coherence, or influence speech coherence during the current generation. |
| previous_request_ids |
list of strings or null | Optional | List of request IDs for samples generated before this generation. When splitting a large task into multiple requests, it can be used to improve speech coherence. It works best when using the same model in each generation. If both previous_text and previous_request_ids are sent, previous_text will be ignored. Up to 3 request IDs can be sent. |
| next_request_ids | list of strings or null | Optional | List of request IDs for samples after this generation. next_request_ids are particularly useful for maintaining speech coherence when regenerating samples with audio quality issues. For example, if you have already generated 3 speech segments and want to improve segment 2, passing the request ID of segment 3 as next_request_id (and the request ID of segment 1 as previous_request_id) will help maintain the natural fluency of the synthesized speech.When using the same model across generations, the best results are achieved. If both next_text and next_request_ids are sent simultaneously, next_text will be ignored. Up to 3 request IDs can be sent. |
| apply_text_normalization | enum | Optional, default isAuto |
This parameter controls Text Normalization through three modes: "Auto", "On", and "Off". When set to "Auto", the system will automatically decide whether to apply Text Normalization (e.g., spelling out numbers). When set to "On", Text Normalization will always be applied, while when set to "Off", it will be skipped. Allowed values: Auto, On, Off |
voice_settings
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| stability | double or null | Optional0-1Default is0.5 |
determines the stability of the voice and the randomness between each generation. A lower value introduces a wider range of emotions into the voice. A higher value may result in a monotonous voice with limited emotions. |
| use_speaker_boost | boolean or null | Optional, default istrue |
This setting can enhance the similarity to the original speaker. Using this setting requires a slightly higher computational load, which in turn increases latency. |
| similarity_boost | double or null | Optional0-1Default value is0.75 |
Determine how closely AI should match the original sound when attempting to replicate it. |
| style | double or null | Optional, default is0 |
Determine the exaggeration level of the voice style. This setting will attempt to amplify the style of the original speaker. If set to a non-zero value, it will indeed consume additional computational resources and may increase latency. |
| speed | double or null | Optional, default is1 |
Adjust the speed of the voice. A value of 1.0 is the default speed, values less than 1.0 will slow down the speech rate, and values greater than 1.0 will speed up the speech rate. |
Request Example
curl 'https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/tts-e/text-to-speech/{voice_id}?enable_logging=false' \
-H 'Authorization: Bearer {Your AK}' \
-H 'Content-Type: application/json' \
-d '{
"text": "A shared goal is the heartbeat of teamwork."
}'
MaaS_Ele_eleven_multilingual_sts_v2
Request URL
https://genaiapi.cloudsway.net/v1/ai/ {Your EndpointPath}/eleven-labs/voice-changer/{voice_id}
Parameter
| Parameter | Type | Required/Optional | Description |
|---|---|---|---|
| voice_id | string | Required | ID of the voice to be used. |
Query Parameter
| Parameter | Type | Required/Optional | Description |
|---|---|---|---|
| enable_logging | boolean |
Optional, default istrue |
When enable_logging is set to false, requests will use zero retention mode. This means that the history feature for this request will be unavailable, including request concatenation. Zero retention mode is only available to enterprise customers. |
| optimize_streaming_latency (Deprecated) |
integer or null | Optional, default is None |
You can enable latency optimization, but it will sacrifice quality to some extent. The optimal value of the final latency varies by model. Possible values: 0 - Default mode (no latency optimization) 1 - Normal latency optimization (about 50% of the possible latency improvement of option 3) 2 - Strong latency optimization (about 75% of the possible latency improvement of option 3) 3 - Maximum latency optimization 4 - Maximum latency optimization, but also turn off the text normalizer to further save latency (optimal latency, but may misread numbers, dates, etc.) |
| output_format | enum | Optional, default ismp3_44100_128 |
Output format of the generated audio. The format is codec_sample rate_bit rate. Therefore, an MP3 with a sample rate of 22.05kHz and a bit rate of 32kbps is represented as mp3_22050_32. An MP3 with a bit rate of 192kbps requires you to subscribe to the Creator level or higher. A PCM with a sample rate of 44.1kHz requires you to subscribe to the Professional level or higher. Note that the μ-law format (sometimes written as mu-law, often approximated as u-law) is commonly used for Twilio audio input. Supports 21 enumerated values. mp3_22050_32 mp3_24000_48 mp3_44100_32 mp3_44100_64 mp3_44100_96 mp3_44100_128 mp3_44100_192 pcm_8000 pcm_16000 pcm_22050 pcm_24000 pcm_32000 pcm_44100 pcm_48000 ulaw_8000 alaw_8000 opus_48000_32 opus_48000_64 opus_48000_96 opus_48000_128 opus_48000_192 |
Request Parameters
| Parameter | Type | Required/Optional | Description |
|---|---|---|---|
| audio |
file | Required | contains audio files that control the content and emotion of the generated speech. |
| model_id | string | Optional (default value is eleven_english_sts_v2) |
The identifier of the model to be used, which you can query using GET /v1/models. This model needs to support the voice-to-voice function, and you can check this using the can_do_voice_conversion attribute. |
| voice_settings | string | Optional | Voice settings will override the stored settings for the given voice. These settings are only applied in the given request and need to be sent as a JSON-encoded string. |
| seed |
integer |
Optional | If specified, our system will make every effort to perform deterministic sampling, so that repeated requests with the same seed and parameters should return the same results. However, determinism is not guaranteed. It must be an integer between 0 and 4294967295. |
| remove_background_noise | boolean | Optional (default is false) |
If enabled, our audio isolation model will be used to remove background noise from the audio input. Only applicable to voice changers. |
| file_format | enum | Optional | Format of the input audio. Options are 'pcm_s16le_16' or 'other'. For pcm_s16le_16, the input audio must be 16-bit PCM with a sampling rate of 16kHz, mono (single channel), and little-endian byte order. The latency will be lower compared to passing encoded waveforms.Allowed values: pcm_s16le_16, others |
Response
Generated audio file
Request Example
curl 'https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/eleven-labs/voice-changer/{voice_id}?enable_logging=false' \
-H 'Authorization: Bearer {Your AK}' \
-H "Content-Type: multipart/form-data" \
-F "audio=@/path/to/input.mp3" \
-F "remove_background_noise=true" \
-F "seed=12345"
Appendix:
- Optional voice_id
| name | voice_id |
|---|---|
| Sarah | EXAVITQu4vr4xnSDxMaL |
| Laura | FGY2WhTYpPnrIDTdsKH5 |
| Charlie | IKne3meq5aSn9XLyUdCD |
| George | JBFqnCBsd6RMkjVDRZzb |
| Callum | N2lVS1w4EtoT3dr4eOWO |
| Liam | TX3LPaxmHKxFdv7VOQHJ |
| Charlotte | XB0fDUnXU5powFXDhCwa |
| Alice | Xb7hH8MSUJpSbSDYk0k2 |
| Matilda | XrExE9yKIg1WjnnlVkGX |
| Will | bIHbv24MWmeRgasZH58o |
| Jessica | cgSgspJ2msm6clMCkdW9 |
| Eric | cjVigY5qzO86Huf0OWal |
| Chris | iP95p4xoKVk53GoZ742B |
| Brian | nPczCjzI2devNBz1zQrb |
| Daniel | onwK4e9ZLuTAKqWW03F9 |
| Lily | pFZP5JQG7iQjIQuC4Bku |
| Bill | pqHfZKP75CvOlQylNhV4 |