Skip to content

MaaS-AFast-asr

Public Information

Parameter Description Example
basePath Base path for MaaS API https://genaiapi.cloudsway.net/
endpointPath Random path segment generated for MaaS API LPUqHEAjfonOmohV
AccessKey Access key for MaaS API RWxxxxxxxx0Gd

Based on the example above, the final request path for the Quick Transcription API is:

https://genaiapi.cloudsway.net/v1/ai/LPUqHEAjfonOmohV/speechtotext/transcriptions:transcribe?api-version=2024-11-15

Request Method

POST

Request Path

{basePath}/v1/ai/{endpointPath}/speechtotext/transcriptions:transcribe?api-version=2024-11-15

Request Header

Parameter Required Description
Authorization Yes AccessKey
Bearer ${AccessKey}
Bearer RWxxxxxxxx0Gd

Query Parameters

Parameter Required Description
api-version Yes Fixed value: 2024-11-15

Request Form Data

Parameter Required Type Description
audio Yes Audio file Audio file
definition No JSON string Configuration options

definition

Parameter Required Description
channels No List of zero-based indices of channels to be transcribed separately. Unless diarization is enabled, a maximum of two channels is supported. By default, the Quick Transcription API merges all input channels into a single channel before transcription. If you don't want this, you can transcribe each channel independently. 

For stereo audio files, specify [0,1][0], and [1] to transcribe each channel separately. Otherwise, stereo audio will be merged into mono and only a single channel will be transcribed. 

If the audio is stereo and diarization is enabled, the channels attribute cannot be set to [0,1]. The speech service does not support diarization for multiple channels. 

For mono audio, the channels attribute is ignored, and the audio is always transcribed as mono.
diarization No Diarization configuration. Diarization is the process of identifying and separating speakers in a single audio channel. For example, specify "diarization": {"maxSpeakers": 2, "enabled": true}. The transcription file will then include a speaker entry for each transcribed phrase (e.g., "speaker": 0 or "speaker": 1).
locales No, but recommended if you know the expected language The list of languages should match the expected languages of the audio data to be transcribed. 

If you know the language setting of the audio file, specifying it can improve transcription accuracy and minimize latency. If a single language is specified, that language will be used for transcription. 

However, if you are unsure of the language used, you can specify multiple languages. The more precise the candidate language list, the more accurate the language recognition might be. 

If no language is specified or the specified language is not present in the audio file, the speech service will attempt to recognize the language. If it cannot recognize the language, an error will be returned. 

Supported language settings include: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN.
profanityFilterMode No Specifies how to handle profanity in the recognition results. Accepted values are None (disable profanity filtering), Masked (replace profanity with asterisks), Removed (remove all profanity from the results), or Tags (add profanity tags). The default value is Masked.

Request Example

curl --request POST \
  --url 'https://genaiapi.cloudsway.net/v1/ai/qyBrSaFJYTUwsWcM/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
  --header 'Authorization: Bearer ${AccessKey}' \
  --header 'Content-Type: multipart/form-data' \
  --form 'audio=@path/to/your/audio/file' \
  --form 'definition={
    "channels": [0],
    "locales": ["zh-CN"],
    "diarization": {
      "maxSpeakers": 2,
      "enabled": true
    },
    "profanityFilterMode": "Masked"
  }'

Response

Field Name Type Description
durationMilliseconds Integer Total duration of the audio file in milliseconds.
combinedPhrases Array List of combined phrases.
phrases Array Detailed information of each phrase.

combinedPhrases

Field Name Type Description
text String Combined phrase text.

phrases

Field Name Type Description
speaker String Speaker identifier.
offsetMilliseconds Integer Offset of the phrase in the audio in milliseconds.
durationMilliseconds Integer Duration of the phrase in milliseconds.
text String Text of the phrase.
words Array Detailed information of each word in the phrase.
locale String Locale identifier of the phrase.
confidence Float Confidence score of the phrase recognition.

words

Field Name Type Description
text String Text of the word.
offsetMilliseconds Integer Offset of the word in the phrase in milliseconds.
durationMilliseconds Integer Duration of the word in milliseconds.

Response Example

{
    "durationMilliseconds": 1920,
    "combinedPhrases": [
        {
            "text": "Hello,我是谁啊?"
        }
    ],
    "phrases": [
        {
            "speaker": null,
            "offsetMilliseconds": 160,
            "durationMilliseconds": 1440,
            "text": "Hello,我是谁啊?",
            "words": [
                {
                    "text": "Hello,",
                    "offsetMilliseconds": 160,
                    "durationMilliseconds": 560
                },
                {
                    "text": "我",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 240
                },
                {
                    "text": "是",
                    "offsetMilliseconds": 960,
                    "durationMilliseconds": 160
                },
                {
                    "text": "谁",
                    "offsetMilliseconds": 1120,
                    "durationMilliseconds": 240
                },
                {
                    "text": "啊?",
                    "offsetMilliseconds": 1360,
                    "durationMilliseconds": 240
                }
            ],
            "locale": "zh-CN",
            "confidence": 0.7978613
        }
    ]
}

Supported Audio Files

Size up to 25MB

  • WAV
  • MP3
  • OPUS/OGG
  • FLAC
  • WMA
  • AAC
  • ALAW in WAV container
  • MULAW in WAV container
  • AMR
  • WebM
  • M4A
  • SPEEX