MaaS_Ele
Request Protocol
Http
| Parameter Name |
Value |
| x-api-key |
your-api-key |
| content-type |
application/json |
MaaS_Ele_scribe_v1
Request Method
POST
Request URL
https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/stt-e/speech-to-text
Query Parameters
| Attribute Name |
Type |
Required/Optional |
Description |
| enable_logging |
boolean |
Optional Defaults to true |
When enable_logging is set to false zero retention mode will be used for the request. This will mean log and transcript storage features are unavailable for this request. Zero retention mode may only be used by enterprise customers. |
Request Parameters
| Attribute Name |
Type |
Required/Optional |
Description |
| name |
string |
Required |
The name that identifies this voice. This will be displayed in the dropdown of the website. |
| file |
files |
Optional |
The file to transcribe. All major audio and video formats are supported. Exactly one of the file or cloud_storage_url parameters must be provided. The file size must be less than 3.0GB. |
| language_code |
string or null |
Optional |
An ISO-639-1 or ISO-639-3 language_code corresponding to the language of the audio file. Can sometimes improve transcription performance if known beforehand. Defaults to null, in this case the language is predicted automatically. |
| tag_audio_events |
boolean |
Optional Defaults to true |
Whether to tag audio events like (laughter), (footsteps), etc. in the transcription. |
| num_speakers |
integer or nul |
Optional 1-32 |
The maximum amount of speakers talking in the uploaded file. Can help with predicting who speaks when. The maximum amount of speakers that can be predicted is 32. Defaults to null, in this case the amount of speakers is set to the maximum value the model supports. |
| timestamps_granularity |
enum |
OptionalDefaults to word |
The granularity of the timestamps in the transcription. ‘word’ provides word-level timestamps and ‘character’ provides character-level timestamps per word.Allowed values:none word character |
| diarize |
boolean |
OptionalDefaults to false |
Whether to annotate which speaker is currently talking in the uploaded file. |
| diarization_threshold |
double or null |
Optional 0.1-0.4 |
Diarization threshold to apply during speaker diarization. A higher value means there will be a lower chance of one speaker being diarized as two different speakers but also a higher chance of two different speakers being diarized as one speaker (less total speakers predicted). A low value means there will be a higher chance of one speaker being diarized as two different speakers but also a lower chance of two different speakers being diarized as one speaker (more total speakers predicted). Can only be set when diarize=True and num_speakers=None. Defaults to None, in which case we will choose a threshold based on the model_id (0.22 usually). |
| additional_formats |
list of objects |
Optional |
A list of additional formats to export the transcript to. |
| file_format |
enum |
Optional Defaults to other |
The format of input audio. Options are ‘pcm_s16le_16’ or ‘other’ For pcm_s16le_16, the input audio must be 16-bit PCM at a 16kHz sample rate, single channel (mono), and little-endian byte order. Latency will be lower than with passing an encoded waveform.Allowed values: pcm_s16le_16 other |
| cloud_storage_url |
string or null |
Optional |
The HTTPS URL of the file to transcribe. Exactly one of the file or cloud_storage_url parameters must be provided. The file must be accessible via HTTPS and the file size must be less than 2GB. Any valid HTTPS URL is accepted, including URLs from cloud storage providers (AWS S3, Google Cloud Storage, Cloudflare R2, etc.), CDNs, or any other HTTPS source. URLs can be pre-signed or include authentication tokens in query parameters. |
| temperature |
double or null |
Optional0-2 |
Controls the randomness of the transcription output. Accepts values between 0.0 and 2.0, where higher values result in more diverse and less deterministic results. If omitted, we will use a temperature based on the model you selected which is usually 0. |
| seed |
integer or null |
Optional0-2147483647 |
If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed. Must be an integer between 0 and 2147483647. |
| use_multi_channel |
boolean |
Optional Defaults to false |
Whether the audio file contains multiple channels where each channel contains a single speaker. When enabled, each channel will be transcribed independently and the results will be combined. Each word in the response will include a ‘channel_index’ field indicating which channel it was spoken on. A maximum of 5 channels is supported. |
| Attribute Name |
Type |
Required/Optional |
Description |
| format |
string |
Required |
Enum value:segmented_jsondocxpdftxthtmlsrt |
| max_characters_per_line |
integer or null |
OptionalDefaults to 100 |
support this parameter when format is txt or srt |
| include_speakers |
boolean |
Optional Defaults to true |
|
| include_timestamps |
boolean |
Optional Defaults to true |
|
| segment_on_silence_longer_than_s |
double or null |
Optional |
|
| max_segment_duration_s |
double or null |
Optional |
|
| max_segment_chars |
integer or null |
Optional |
|
Request Example
curl 'https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/stt-e/speech-to-text' \
-H 'Authorization: Bearer {Your AK}' \
-H 'Content-Type: application/json' \
-F 'file=@"postman-cloud:///1f0e2394-2f6d-4010-a1e8-ad821a2b9a3a"' \
-F 'tag_audio_events="true"'
Response Example
Single channel response
{
"words": [
{
"start": 0.159,
"type": "word",
"logprob": 0.0,
"end": 0.359,
"text": "A"
},
{
"start": 0.359,
"type": "spacing",
"logprob": 0.0,
"end": 0.36,
"text": " "
},
{
"start": 0.36,
"type": "word",
"logprob": 0.0,
"end": 0.679,
"text": "shared"
},
{
"start": 0.679,
"type": "spacing",
"logprob": 0.0,
"end": 0.679,
"text": " "
},
{
"start": 0.679,
"type": "word",
"logprob": 0.0,
"end": 0.919,
"text": "goal"
},
{
"start": 0.919,
"type": "spacing",
"logprob": 0.0,
"end": 0.959,
"text": " "
},
{
"start": 0.959,
"type": "word",
"logprob": 0.0,
"end": 1.039,
"text": "is"
},
{
"start": 1.039,
"type": "spacing",
"logprob": 0.0,
"end": 1.059,
"text": " "
},
{
"start": 1.059,
"type": "word",
"logprob": 0.0,
"end": 1.159,
"text": "the"
},
{
"start": 1.159,
"type": "spacing",
"logprob": 0.0,
"end": 1.179,
"text": " "
},
{
"start": 1.179,
"type": "word",
"logprob": 0.0,
"end": 1.519,
"text": "heartbeat"
},
{
"start": 1.519,
"type": "spacing",
"logprob": 0.0,
"end": 1.539,
"text": " "
},
{
"start": 1.539,
"type": "word",
"logprob": 0.0,
"end": 1.659,
"text": "of"
},
{
"start": 1.659,
"type": "spacing",
"logprob": 0.0,
"end": 1.659,
"text": " "
},
{
"start": 1.659,
"type": "word",
"logprob": 0.0,
"end": 2.199,
"text": "teamwork."
}
],
"language_code": "eng",
"transcription_id": "DH4a9n1brE06R8BUQPEx",
"language_probability": 0.9907509684562683,
"text": "A shared goal is the heartbeat of teamwork."
}
Multi channel response
{
"transcripts": [
{
"language_code": "en",
"language_probability": 0.98,
"text": "Hello from channel one.",
"words": [
{
"text": "Hello",
"start": 0,
"end": 0.5,
"type": "word",
"speaker_id": "speaker_0",
"logprob": -0.124
}
]
},
{
"language_code": "en",
"language_probability": 0.97,
"text": "Greetings from channel two.",
"words": [
{
"text": "Greetings",
"start": 0.1,
"end": 0.7,
"type": "word",
"speaker_id": "speaker_1",
"logprob": -0.156
}
]
}
]
}
MaaS_Ele_voice_clones
Request Method
POST
Request URL
https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/elevenlabs/voices/add
Request Parameters
| Attribute Name |
Type |
Required/Optional |
Description |
| name |
string |
Required |
The name that identifies this voice. This will be displayed in the dropdown of the website. |
| files |
files |
Required |
A list of file paths to audio recordings intended for voice cloning. |
| remove_background_noise |
boolean |
Optional |
Optional Defaults to false |
| description |
string or null |
Optional |
A description of the voice. |
| labels |
string or null |
Optional |
Serialized labels dictionary for the voice. |
Request Example
curl 'https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/elevenlabs/voices/add' \
-H 'Authorization: Bearer {Your AK}' \
-H 'Content-Type: application/json' \
-F 'name="girl-voice-1"' \
-F 'files=@"postman-cloud:///1f0e238b-1f03-4200-bab6-8616a7296adb"' \
-F 'remove_background_noise="false"'
Response Example
{
"voice_id": "c38kUX8pkfYO2kHyqfFy",
"requires_verification": false
}
MaaS_Ele_tts_v3
Request Method
POST
Request URL
https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/tts-e/text-to-speech/{voice_id}
Path parameters
| Attribute Name |
Type |
Required/Optional |
Description |
| voice_id |
string |
Required |
ID of the voice to be used. |
Query parameters
| Attribute Name |
Type |
Required/Optional |
Description |
| enable_logging |
boolean |
Optional Defaults to true |
When enable_logging is set to false zero retention mode will be used for the request. This will mean history features are unavailable for this request, including request stitching. Zero retention mode may only be used by enterprise customers. |
| output_format |
enum |
Optional Defaults to mp3_44100_128 |
Output format of the generated audio. Formatted as codec_sample_rate_bitrate. So an mp3 with 22.05kHz sample rate at 32kbs is represented as mp3_22050_32. MP3 with 192kbps bitrate requires you to be subscribed to Creator tier or above. PCM with 44.1kHz sample rate requires you to be subscribed to Pro tier or above. Note that the μ-law format (sometimes written mu-law, often approximated as u-law) is commonly used for Twilio audio inputs.21 enum values Supported.mp3_22050_32 mp3_24000_48 mp3_44100_32 mp3_44100_64 mp3_44100_96 mp3_44100_128 mp3_44100_192 pcm_8000 pcm_16000 pcm_22050 pcm_24000 pcm_32000 pcm_44100 pcm_48000 ulaw_8000 alaw_8000 opus_48000_32 opus_48000_64 opus_48000_96 opus_48000_128 |
Request
| Attribute Name |
Type |
Required/Optional |
Description |
| text |
string |
Required |
The text that will get converted into speech. |
| language_code |
string or null |
Optional |
Language code (ISO 639-1) used to enforce a language for the model and text normalization. If the model does not support provided language code, an error will be returned. |
| voice_settings |
object or null |
Optional |
Voice settings overriding stored settings for the given voice. They are applied only on the given request. |
| pronunciation_dictionary_locators |
list of objects or null |
Optional |
A list of pronunciation dictionary locators (id, version_id) to be applied to the text. They will be applied in order. You may have up to 3 locators per request |
| seed |
integer or null |
Optional |
If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed. Must be integer between 0 and 4294967295. |
| previous_text |
string or null |
Optional |
The text that came before the text of the current request. Can be used to improve the speech's continuity when concatenating together multiple generations or to influence the speech's continuity in the current generation. |
| next_text |
string or null |
Optional |
The text that comes after the text of the current request. Can be used to improve the speech's continuity when concatenating together multiple generations or to influence the speech's continuity in the current generation. |
| previous_request_ids |
list of strings or null |
Optional |
A list of request_id of the samples that were generated before this generation. Can be used to improve the speech’s continuity when splitting up a large task into multiple requests. The results will be best when the same model is used across the generations. In case both previous_text and previous_request_ids is send, previous_text will be ignored. A maximum of 3 request_ids can be send. |
| next_request_ids |
list of strings or null |
Optional |
A list of request_id of the samples that come after this generation. next_request_ids is especially useful for maintaining the speech’s continuity when regenerating a sample that has had some audio quality issues. For example, if you have generated 3 speech clips, and you want to improve clip 2, passing the request id of clip 3 as a next_request_id (and that of clip 1 as a previous_request_id) will help maintain natural flow in the combined speech. The results will be best when the same model is used across the generations. In case both next_text and next_request_ids is send, next_text will be ignored. A maximum of 3 request_ids can be send. |
| apply_text_normalization |
enum |
Optional Defaults to auto |
This parameter controls text normalization with three modes: ‘auto’, ‘on’, and ‘off’. When set to ‘auto’, the system will automatically decide whether to apply text normalization (e.g., spelling out numbers). With ‘on’, text normalization will always be applied, while with ‘off’, it will be skipped.Allowed values:auto on off |
voice_settings
| Attribute Name |
Type |
Required/Optional |
Description |
| stability |
double or null |
Optional 0-1Defaults to 0.5 |
Determines how stable the voice is and the randomness between each generation. Lower values introduce broader emotional range for the voice. Higher values can result in a monotonous voice with limited emotion. |
| use_speaker_boost |
boolean or null |
Optional Defaults to true |
This setting boosts the similarity to the original speaker. Using this setting requires a slightly higher computational load, which in turn increases latency. |
| similarity_boost |
double or null |
Optional 0-1Defaults to 0.75 |
Determines how closely the AI should adhere to the original voice when attempting to replicate it. |
| style |
double or null |
Optional Defaults to 0 |
Determines the style exaggeration of the voice. This setting attempts to amplify the style of the original speaker. It does consume additional computational resources and might increase latency if set to anything other than 0. |
| speed |
double or null |
Optional Defaults to 1 |
Adjusts the speed of the voice. A value of 1.0 is the default speed, while values less than 1.0 slow down the speech, and values greater than 1.0 speed it up. |
Request Example
curl 'https://genaiapi.cloudsway.net/v1/ai/{Your EndpointPath}/tts-e/text-to-speech/{voice_id}?enable_logging=false' \
-H 'Authorization: Bearer {Your AK}' \
-H 'Content-Type: application/json' \
-d '{
"text": "A shared goal is the heartbeat of teamwork."
}'