Skip to content

MaaS_GP_4o_transcribe

MaaS_GP_4o_mini_transcribe & MaaS_GP_4o_transcribe_diarize

Request Protocol

Https

Request Header

Parameter Name Value
Authorization Bearer
Content-Type multipart/form-data

Request URL

https://genaiapi-m2.cloudsway.net/v1/ai/${ENDPOINT_PATH}/audio/transcriptions

Request Body

Request Parameters

Attribute Name Type Required/Optional Description
file file Required Audio file object (not file name) to be transcribed, supporting one of the following formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
chunking_strategy
"auto" or object
Optional Controls how audio is segmented into clips. When set to "auto", the server first normalizes the loudness and then uses Voice Activity Detection (VAD) technology to select boundaries. The server-side VAD object allows manual adjustment of VAD detection parameters. If not set, the audio will be transcribed as a single block by default, which is required when using gpt-4o-transcribe-diarize to process inputs longer than 30 seconds.
include
array
Optional
Additional information to be included in the transcription response. logprobs will return the probability of tokens in the log probability response to understand the model's confidence in the transcribed content.logprobsis only valid whenresponse_formatis set tojsonand only applies to some models MaaS_GP_4o_transcribeand MaaS_GP_4o_mini_transcribe. This field does not support gpt-4o-transcribe-diarize when in use
known_speaker_names
array
Optional
A list of speaker names corresponding to the audio samples in known_speaker_references[]. Each entry should be a short identifier (e.g., customer or agent). Up to 4 speakers are supported. Not applicable to MaaS_GP_4o_transcribe and MaaS_GP_4o_mini_transcribe
known_speaker_references
array
Optional
Optional list of audio samples (as data URLs) containing references to known speakers matching known_speaker_names. Each sample must be between 2 and 10 seconds long, and can use any file in the same format as the input audio. Not applicable to MaaS_GP_4o_transcribeand MaaS_GP_4o_mini_transcribe
language string Optional The language of the input audio. Using the ISO-639-1 format (e.g., en) as the input language can improve accuracy and reduce latency.
prompt
string Optional An optional text used to guide the model's style or continue from the previous audio segment. The prompt should match the audio language. This field is not supported when using gpt-4o-transcripte-diarize.
response_format
string
Optional The output format supports: json,
text, srt, verbose_json, vtt, and diarized_json.
For gpt-4o-transcribe and gpt-4o-mini-transcribe, only support
The format is json . For gpt-4o-transcripte-diarize, the supported formats are json, text, and
diarized_json, when using diarized_json, speaker annotations need to be received.
stream
boolean Optional Default is false. If set to true, the model response data will be streamed to the Client using SSE as it is generated.
Note: The whisper-1 model does not support streaming, and using this parameter will be ignored.
temperature number Optional Default is 0.
The sampling temperature ranges from 0 to 1. Higher values, such as 0.8, will make the output more random, while lower values, such as 0.2, will make the output more targeted and deterministic. If set to 0, the model will automatically increase the temperature using logarithms and probabilities until a specific threshold is reached.
timestamp_granularities
array
Optional
The default is segment.
Timestamp granularity to be populated for this transcription. Using thisparameter requires setting response_format to verbose_json. Supports either or both of the following options: word or segment.
Note: Segment timestamps do not incur additional latency, but generating word timestamps will result in additional latency. This option is not applicable to gpt-4o-transcribe-diarize.

Request Example

Default

Request

curl 'https://genaiapi-m2.cloudsway.net/v1/ai/$ENDPOINT_PATH/audio/transcriptions' \
  -H "Authorization: Bearer ${your AK}" \
  -F file="@/path/to/file/audio.mp3" \

Response

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "input_token_details": {
      "text_tokens": 0,
      "audio_tokens": 14
    },
    "output_tokens": 45,
    "total_tokens": 59
  }
}

Streaming

Request

curl 'https://genaiapi-m2.cloudsway.net/v1/ai/$ENDPOINT_PATH/audio/transcriptions' \
  -H "Authorization: Bearer ${your AK}" \
  -F file="@/path/to/file/audio.mp3" \
  -F stream=true

Response

data: {"type":"transcript.text.delta","delta":"I","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]}]}

data: {"type":"transcript.text.delta","delta":" see","logprobs":[{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]}]}

data: {"type":"transcript.text.delta","delta":" skies","logprobs":[{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]}]}

data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]}]}

data: {"type":"transcript.text.delta","delta":" blue","logprobs":[{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]}]}

data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]}]}

data: {"type":"transcript.text.delta","delta":" clouds","logprobs":[{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]}]}

data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]}]}

data: {"type":"transcript.text.delta","delta":" white","logprobs":[{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]}]}

data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0014890312,"bytes":[44]}]}

data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]}]}

data: {"type":"transcript.text.delta","delta":" bright","logprobs":[{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]}]}

data: {"type":"transcript.text.delta","delta":" blessed","logprobs":[{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]}]}

data: {"type":"transcript.text.delta","delta":" days","logprobs":[{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]}]}

data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.00001700133,"bytes":[44]}]}

data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]}]}

data: {"type":"transcript.text.delta","delta":" dark","logprobs":[{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]}]}

data: {"type":"transcript.text.delta","delta":" sacred","logprobs":[{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]}]}

data: {"type":"transcript.text.delta","delta":" nights","logprobs":[{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]}]}

data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0036910512,"bytes":[44]}]}

data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]}]}

data: {"type":"transcript.text.delta","delta":" I","logprobs":[{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]}]}

data: {"type":"transcript.text.delta","delta":" think","logprobs":[{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]}]}

data: {"type":"transcript.text.delta","delta":" to","logprobs":[{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]}]}

data: {"type":"transcript.text.delta","delta":" myself","logprobs":[{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]}]}

data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.29254505,"bytes":[44]}]}

data: {"type":"transcript.text.delta","delta":" what","logprobs":[{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]}]}

data: {"type":"transcript.text.delta","delta":" a","logprobs":[{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]}]}

data: {"type":"transcript.text.delta","delta":" wonderful","logprobs":[{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]}]}

data: {"type":"transcript.text.delta","delta":" world","logprobs":[{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]}]}

data: {"type":"transcript.text.delta","delta":".","logprobs":[{"token":".","logprob":-0.014231676,"bytes":[46]}]}

data: {"type":"transcript.text.done","text":"I see skies of blue and clouds of white, the bright blessed days, the dark sacred nights, and I think to myself, what a wonderful world.","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]},{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]},{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]},{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]},{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]},{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]},{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]},{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]},{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]},{"token":",","logprob":-0.0014890312,"bytes":[44]},{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]},{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]},{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]},{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]},{"token":",","logprob":-0.00001700133,"bytes":[44]},{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]},{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]},{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]},{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]},{"token":",","logprob":-0.0036910512,"bytes":[44]},{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]},{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]},{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]},{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]},{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]},{"token":",","logprob":-0.29254505,"bytes":[44]},{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]},{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]},{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]},{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]},{"token":".","logprob":-0.014231676,"bytes":[46]}],"usage":{"input_tokens":14,"input_token_details":{"text_tokens":0,"audio_tokens":14},"output_tokens":45,"total_tokens":59}}

Logprobs

Request

curl 'https://genaiapi-m2.cloudsway.net/v1/ai/$ENDPOINT_PATH/audio/transcriptions' \
  -H "Authorization: Bearer ${your AK}" \
  -F file="@/path/to/file/audio.mp3" \
  -F "include[]=logprobs" \
  -F response_format="json"

Response

{
  "text": "Hey, my knee is hurting and I want to see the doctor tomorrow ideally.",
  "logprobs": [
    { "token": "Hey", "logprob": -1.0415299, "bytes": [72, 101, 121] },
    { "token": ",", "logprob": -9.805982e-5, "bytes": [44] },
    { "token": " my", "logprob": -0.00229799, "bytes": [32, 109, 121] },
    {
      "token": " knee",
      "logprob": -4.7159858e-5,
      "bytes": [32, 107, 110, 101, 101]
    },
    { "token": " is", "logprob": -0.043909557, "bytes": [32, 105, 115] },
    {
      "token": " hurting",
      "logprob": -1.1041146e-5,
      "bytes": [32, 104, 117, 114, 116, 105, 110, 103]
    },
    { "token": " and", "logprob": -0.011076359, "bytes": [32, 97, 110, 100] },
    { "token": " I", "logprob": -5.3193703e-6, "bytes": [32, 73] },
    {
      "token": " want",
      "logprob": -0.0017156356,
      "bytes": [32, 119, 97, 110, 116]
    },
    { "token": " to", "logprob": -7.89631e-7, "bytes": [32, 116, 111] },
    { "token": " see", "logprob": -5.5122365e-7, "bytes": [32, 115, 101, 101] },
    { "token": " the", "logprob": -0.0040786397, "bytes": [32, 116, 104, 101] },
    {
      "token": " doctor",
      "logprob": -2.3392786e-6,
      "bytes": [32, 100, 111, 99, 116, 111, 114]
    },
    {
      "token": " tomorrow",
      "logprob": -7.89631e-7,
      "bytes": [32, 116, 111, 109, 111, 114, 114, 111, 119]
    },
    {
      "token": " ideally",
      "logprob": -0.5800861,
      "bytes": [32, 105, 100, 101, 97, 108, 108, 121]
    },
    { "token": ".", "logprob": -0.00011093382, "bytes": [46] }
  ],
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "input_token_details": {
      "text_tokens": 0,
      "audio_tokens": 14
    },
    "output_tokens": 45,
    "total_tokens": 59
  }
}

Unified Domain Name Call

Note that the parameters of the fusion interface are the same as those of the non-fusion interface, but the model field is required, and the product code of the model should be passed in.

Request URL

https:// genaiapi-m2.cloudsway.net /v1/audio/transcriptions

Request Example

curl 'http://genaiapi-m2.cloudsway.net/v1/audio/transcriptions' \
-H 'Authorization: Bearer {Your_AK}' \
-F 'file=@"/path/to/file"' \
-F 'response_format="json"' \
-F 'model="MaaS_GP_4o_transcribe_diarize"'