MaaS_GP_4o_transcribe
MaaS_GP_4o_mini_transcribe & MaaS_GP_4o_transcribe_diarize
Request Protocol
Https
Request Header
| Parameter Name | Value |
|---|---|
| Authorization | Bearer |
| Content-Type | multipart/form-data |
Request URL
https://genaiapi-m2.cloudsway.net/v1/ai/${ENDPOINT_PATH}/audio/transcriptions
Request Body
Request Parameters
| Attribute Name | Type | Required/Optional | Description |
|---|---|---|---|
| file | file | Required | Audio file object (not file name) to be transcribed, supporting one of the following formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm. |
| chunking_strategy |
"auto" or object |
Optional | Controls how audio is segmented into clips. When set to "auto", the server first normalizes the loudness and then uses Voice Activity Detection (VAD) technology to select boundaries. The server-side VAD object allows manual adjustment of VAD detection parameters. If not set, the audio will be transcribed as a single block by default, which is required when using gpt-4o-transcribe-diarize to process inputs longer than 30 seconds. |
| include |
array |
Optional |
Additional information to be included in the transcription response. logprobs will return the probability of tokens in the log probability response to understand the model's confidence in the transcribed content.logprobsis only valid whenresponse_formatis set tojsonand only applies to some models MaaS_GP_4o_transcribeand MaaS_GP_4o_mini_transcribe. This field does not support gpt-4o-transcribe-diarize when in use |
| known_speaker_names |
array |
Optional |
A list of speaker names corresponding to the audio samples in known_speaker_references[]. Each entry should be a short identifier (e.g., customer or agent). Up to 4 speakers are supported. Not applicable to MaaS_GP_4o_transcribe and MaaS_GP_4o_mini_transcribe |
| known_speaker_references |
array |
Optional |
Optional list of audio samples (as data URLs) containing references to known speakers matching known_speaker_names. Each sample must be between 2 and 10 seconds long, and can use any file in the same format as the input audio. Not applicable to MaaS_GP_4o_transcribeand MaaS_GP_4o_mini_transcribe |
| language | string | Optional | The language of the input audio. Using the ISO-639-1 format (e.g., en) as the input language can improve accuracy and reduce latency. |
| prompt |
string | Optional | An optional text used to guide the model's style or continue from the previous audio segment. The prompt should match the audio language. This field is not supported when using gpt-4o-transcripte-diarize. |
| response_format |
string |
Optional | The output format supports: json, text, srt, verbose_json, vtt, and diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, only support The format is json . For gpt-4o-transcripte-diarize, the supported formats are json, text, and diarized_json, when using diarized_json, speaker annotations need to be received. |
| stream |
boolean | Optional | Default is false. If set to true, the model response data will be streamed to the Client using SSE as it is generated. Note: The whisper-1 model does not support streaming, and using this parameter will be ignored. |
| temperature | number | Optional | Default is 0. The sampling temperature ranges from 0 to 1. Higher values, such as 0.8, will make the output more random, while lower values, such as 0.2, will make the output more targeted and deterministic. If set to 0, the model will automatically increase the temperature using logarithms and probabilities until a specific threshold is reached. |
| timestamp_granularities |
array |
Optional |
The default is segment. Timestamp granularity to be populated for this transcription. Using thisparameter requires setting response_format to verbose_json. Supports either or both of the following options: word or segment. Note: Segment timestamps do not incur additional latency, but generating word timestamps will result in additional latency. This option is not applicable to gpt-4o-transcribe-diarize. |
Request Example
Default
Request
curl 'https://genaiapi-m2.cloudsway.net/v1/ai/$ENDPOINT_PATH/audio/transcriptions' \
-H "Authorization: Bearer ${your AK}" \
-F file="@/path/to/file/audio.mp3" \
Response
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
"usage": {
"type": "tokens",
"input_tokens": 14,
"input_token_details": {
"text_tokens": 0,
"audio_tokens": 14
},
"output_tokens": 45,
"total_tokens": 59
}
}
Streaming
Request
curl 'https://genaiapi-m2.cloudsway.net/v1/ai/$ENDPOINT_PATH/audio/transcriptions' \
-H "Authorization: Bearer ${your AK}" \
-F file="@/path/to/file/audio.mp3" \
-F stream=true
Response
data: {"type":"transcript.text.delta","delta":"I","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]}]}
data: {"type":"transcript.text.delta","delta":" see","logprobs":[{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]}]}
data: {"type":"transcript.text.delta","delta":" skies","logprobs":[{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]}]}
data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]}]}
data: {"type":"transcript.text.delta","delta":" blue","logprobs":[{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]}]}
data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]}]}
data: {"type":"transcript.text.delta","delta":" clouds","logprobs":[{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]}]}
data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]}]}
data: {"type":"transcript.text.delta","delta":" white","logprobs":[{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]}]}
data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0014890312,"bytes":[44]}]}
data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]}]}
data: {"type":"transcript.text.delta","delta":" bright","logprobs":[{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]}]}
data: {"type":"transcript.text.delta","delta":" blessed","logprobs":[{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]}]}
data: {"type":"transcript.text.delta","delta":" days","logprobs":[{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]}]}
data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.00001700133,"bytes":[44]}]}
data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]}]}
data: {"type":"transcript.text.delta","delta":" dark","logprobs":[{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]}]}
data: {"type":"transcript.text.delta","delta":" sacred","logprobs":[{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]}]}
data: {"type":"transcript.text.delta","delta":" nights","logprobs":[{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]}]}
data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0036910512,"bytes":[44]}]}
data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]}]}
data: {"type":"transcript.text.delta","delta":" I","logprobs":[{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]}]}
data: {"type":"transcript.text.delta","delta":" think","logprobs":[{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]}]}
data: {"type":"transcript.text.delta","delta":" to","logprobs":[{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]}]}
data: {"type":"transcript.text.delta","delta":" myself","logprobs":[{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]}]}
data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.29254505,"bytes":[44]}]}
data: {"type":"transcript.text.delta","delta":" what","logprobs":[{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]}]}
data: {"type":"transcript.text.delta","delta":" a","logprobs":[{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]}]}
data: {"type":"transcript.text.delta","delta":" wonderful","logprobs":[{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]}]}
data: {"type":"transcript.text.delta","delta":" world","logprobs":[{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]}]}
data: {"type":"transcript.text.delta","delta":".","logprobs":[{"token":".","logprob":-0.014231676,"bytes":[46]}]}
data: {"type":"transcript.text.done","text":"I see skies of blue and clouds of white, the bright blessed days, the dark sacred nights, and I think to myself, what a wonderful world.","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]},{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]},{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]},{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]},{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]},{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]},{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]},{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]},{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]},{"token":",","logprob":-0.0014890312,"bytes":[44]},{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]},{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]},{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]},{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]},{"token":",","logprob":-0.00001700133,"bytes":[44]},{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]},{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]},{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]},{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]},{"token":",","logprob":-0.0036910512,"bytes":[44]},{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]},{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]},{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]},{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]},{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]},{"token":",","logprob":-0.29254505,"bytes":[44]},{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]},{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]},{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]},{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]},{"token":".","logprob":-0.014231676,"bytes":[46]}],"usage":{"input_tokens":14,"input_token_details":{"text_tokens":0,"audio_tokens":14},"output_tokens":45,"total_tokens":59}}
Logprobs
Request
curl 'https://genaiapi-m2.cloudsway.net/v1/ai/$ENDPOINT_PATH/audio/transcriptions' \
-H "Authorization: Bearer ${your AK}" \
-F file="@/path/to/file/audio.mp3" \
-F "include[]=logprobs" \
-F response_format="json"
Response
{
"text": "Hey, my knee is hurting and I want to see the doctor tomorrow ideally.",
"logprobs": [
{ "token": "Hey", "logprob": -1.0415299, "bytes": [72, 101, 121] },
{ "token": ",", "logprob": -9.805982e-5, "bytes": [44] },
{ "token": " my", "logprob": -0.00229799, "bytes": [32, 109, 121] },
{
"token": " knee",
"logprob": -4.7159858e-5,
"bytes": [32, 107, 110, 101, 101]
},
{ "token": " is", "logprob": -0.043909557, "bytes": [32, 105, 115] },
{
"token": " hurting",
"logprob": -1.1041146e-5,
"bytes": [32, 104, 117, 114, 116, 105, 110, 103]
},
{ "token": " and", "logprob": -0.011076359, "bytes": [32, 97, 110, 100] },
{ "token": " I", "logprob": -5.3193703e-6, "bytes": [32, 73] },
{
"token": " want",
"logprob": -0.0017156356,
"bytes": [32, 119, 97, 110, 116]
},
{ "token": " to", "logprob": -7.89631e-7, "bytes": [32, 116, 111] },
{ "token": " see", "logprob": -5.5122365e-7, "bytes": [32, 115, 101, 101] },
{ "token": " the", "logprob": -0.0040786397, "bytes": [32, 116, 104, 101] },
{
"token": " doctor",
"logprob": -2.3392786e-6,
"bytes": [32, 100, 111, 99, 116, 111, 114]
},
{
"token": " tomorrow",
"logprob": -7.89631e-7,
"bytes": [32, 116, 111, 109, 111, 114, 114, 111, 119]
},
{
"token": " ideally",
"logprob": -0.5800861,
"bytes": [32, 105, 100, 101, 97, 108, 108, 121]
},
{ "token": ".", "logprob": -0.00011093382, "bytes": [46] }
],
"usage": {
"type": "tokens",
"input_tokens": 14,
"input_token_details": {
"text_tokens": 0,
"audio_tokens": 14
},
"output_tokens": 45,
"total_tokens": 59
}
}
Unified Domain Name Call
Note that the parameters of the fusion interface are the same as those of the non-fusion interface, but the model field is required, and the product code of the model should be passed in.
Request URL
https:// genaiapi-m2.cloudsway.net /v1/audio/transcriptions