MaaS-AFast-asr

Public Information

Parameter	Description	Example
basePath	Base path for MaaS API	https://genaiapi.cloudsway.net/
endpointPath	Random path segment generated for MaaS API	LPUqHEAjfonOmohV
AccessKey	Access key for MaaS API	RWxxxxxxxx0Gd

Based on the example above, the final request path for the Quick Transcription API is:

https://genaiapi.cloudsway.net/v1/ai/LPUqHEAjfonOmohV/speechtotext/transcriptions:transcribe?api-version=2024-11-15

Request Method

POST

Request Path

{basePath}/v1/ai/{endpointPath}/speechtotext/transcriptions:transcribe?api-version=2024-11-15

Request Header

Parameter	Required	Description
Authorization	Yes	AccessKey Bearer ${AccessKey} Bearer RWxxxxxxxx0Gd

Query Parameters

Parameter	Required	Description
api-version	Yes	Fixed value: 2024-11-15

Request Form Data

Parameter	Required	Type	Description
audio	Yes	Audio file	Audio file
definition	No	JSON string	Configuration options

definition

Parameter	Required	Description
channels	No	List of zero-based indices of channels to be transcribed separately. Unless diarization is enabled, a maximum of two channels is supported. By default, the Quick Transcription API merges all input channels into a single channel before transcription. If you don't want this, you can transcribe each channel independently. For stereo audio files, specify `[0,1]`, `[0]`, and `[1]` to transcribe each channel separately. Otherwise, stereo audio will be merged into mono and only a single channel will be transcribed. If the audio is stereo and diarization is enabled, the `channels` attribute cannot be set to `[0,1]`. The speech service does not support diarization for multiple channels. For mono audio, the `channels` attribute is ignored, and the audio is always transcribed as mono.
diarization	No	Diarization configuration. Diarization is the process of identifying and separating speakers in a single audio channel. For example, specify `"diarization": {"maxSpeakers": 2, "enabled": true}`. The transcription file will then include a `speaker` entry for each transcribed phrase (e.g., `"speaker": 0` or `"speaker": 1`).
locales	No, but recommended if you know the expected language	The list of languages should match the expected languages of the audio data to be transcribed. If you know the language setting of the audio file, specifying it can improve transcription accuracy and minimize latency. If a single language is specified, that language will be used for transcription. However, if you are unsure of the language used, you can specify multiple languages. The more precise the candidate language list, the more accurate the language recognition might be. If no language is specified or the specified language is not present in the audio file, the speech service will attempt to recognize the language. If it cannot recognize the language, an error will be returned. Supported language settings include: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN.
profanityFilterMode	No	Specifies how to handle profanity in the recognition results. Accepted values are `None` (disable profanity filtering), `Masked` (replace profanity with asterisks), `Removed` (remove all profanity from the results), or `Tags` (add profanity tags). The default value is `Masked`.

Request Example

curl --request POST \
  --url 'https://genaiapi.cloudsway.net/v1/ai/qyBrSaFJYTUwsWcM/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
  --header 'Authorization: Bearer ${AccessKey}' \
  --header 'Content-Type: multipart/form-data' \
  --form 'audio=@path/to/your/audio/file' \
  --form 'definition={
    "channels": [0],
    "locales": ["zh-CN"],
    "diarization": {
      "maxSpeakers": 2,
      "enabled": true
    },
    "profanityFilterMode": "Masked"
  }'

Response

Field Name	Type	Description
durationMilliseconds	Integer	Total duration of the audio file in milliseconds.
combinedPhrases	Array	List of combined phrases.
phrases	Array	Detailed information of each phrase.

combinedPhrases

Field Name	Type	Description
text	String	Combined phrase text.

phrases

Field Name	Type	Description
speaker	String	Speaker identifier.
offsetMilliseconds	Integer	Offset of the phrase in the audio in milliseconds.
durationMilliseconds	Integer	Duration of the phrase in milliseconds.
text	String	Text of the phrase.
words	Array	Detailed information of each word in the phrase.
locale	String	Locale identifier of the phrase.
confidence	Float	Confidence score of the phrase recognition.

words

Field Name	Type	Description
text	String	Text of the word.
offsetMilliseconds	Integer	Offset of the word in the phrase in milliseconds.
durationMilliseconds	Integer	Duration of the word in milliseconds.

Response Example

{
    "durationMilliseconds": 1920,
    "combinedPhrases": [
        {
            "text": "Hello，我是谁啊？"
        }
    ],
    "phrases": [
        {
            "speaker": null,
            "offsetMilliseconds": 160,
            "durationMilliseconds": 1440,
            "text": "Hello，我是谁啊？",
            "words": [
                {
                    "text": "Hello，",
                    "offsetMilliseconds": 160,
                    "durationMilliseconds": 560
                },
                {
                    "text": "我",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 240
                },
                {
                    "text": "是",
                    "offsetMilliseconds": 960,
                    "durationMilliseconds": 160
                },
                {
                    "text": "谁",
                    "offsetMilliseconds": 1120,
                    "durationMilliseconds": 240
                },
                {
                    "text": "啊？",
                    "offsetMilliseconds": 1360,
                    "durationMilliseconds": 240
                }
            ],
            "locale": "zh-CN",
            "confidence": 0.7978613
        }
    ]
}

Supported Audio Files

Size up to 25MB

WAV
MP3
OPUS/OGG
FLAC
WMA
AAC
ALAW in WAV container
MULAW in WAV container
AMR
WebM
M4A
SPEEX