MaaS-AFast-asr
Public Information
Parameter | Description | Example |
---|---|---|
basePath | Base path for MaaS API | https://genaiapi.cloudsway.net/ |
endpointPath | Random path segment generated for MaaS API | LPUqHEAjfonOmohV |
AccessKey | Access key for MaaS API | RWxxxxxxxx0Gd |
Based on the example above, the final request path for the Quick Transcription
API is:
https://genaiapi.cloudsway.net/v1/ai/LPUqHEAjfonOmohV/speechtotext/transcriptions:transcribe?api-version=2024-11-15
Request Method
POST
Request Path
{basePath}/v1/ai/{endpointPath}/speechtotext/transcriptions:transcribe?api-version=2024-11-15
Request Header
Parameter | Required | Description |
---|---|---|
Authorization | Yes | AccessKey Bearer ${AccessKey} Bearer RWxxxxxxxx0Gd |
Query Parameters
Parameter | Required | Description |
---|---|---|
api-version | Yes | Fixed value: 2024-11-15 |
Request Form Data
Parameter | Required | Type | Description |
---|---|---|---|
audio | Yes | Audio file | Audio file |
definition | No | JSON string | Configuration options |
definition
Parameter | Required | Description |
---|---|---|
channels | No | List of zero-based indices of channels to be transcribed separately. Unless diarization is enabled, a maximum of two channels is supported. By default, the Quick Transcription API merges all input channels into a single channel before transcription. If you don't want this, you can transcribe each channel independently. For stereo audio files, specify [0,1] , [0] , and [1] to transcribe each channel separately. Otherwise, stereo audio will be merged into mono and only a single channel will be transcribed. If the audio is stereo and diarization is enabled, the channels attribute cannot be set to [0,1] . The speech service does not support diarization for multiple channels. For mono audio, the channels attribute is ignored, and the audio is always transcribed as mono. |
diarization | No | Diarization configuration. Diarization is the process of identifying and separating speakers in a single audio channel. For example, specify "diarization": {"maxSpeakers": 2, "enabled": true} . The transcription file will then include a speaker entry for each transcribed phrase (e.g., "speaker": 0 or "speaker": 1 ). |
locales | No, but recommended if you know the expected language | The list of languages should match the expected languages of the audio data to be transcribed. If you know the language setting of the audio file, specifying it can improve transcription accuracy and minimize latency. If a single language is specified, that language will be used for transcription. However, if you are unsure of the language used, you can specify multiple languages. The more precise the candidate language list, the more accurate the language recognition might be. If no language is specified or the specified language is not present in the audio file, the speech service will attempt to recognize the language. If it cannot recognize the language, an error will be returned. Supported language settings include: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN. |
profanityFilterMode | No | Specifies how to handle profanity in the recognition results. Accepted values are None (disable profanity filtering), Masked (replace profanity with asterisks), Removed (remove all profanity from the results), or Tags (add profanity tags). The default value is Masked . |
Request Example
curl --request POST \
--url 'https://genaiapi.cloudsway.net/v1/ai/qyBrSaFJYTUwsWcM/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Content-Type: multipart/form-data' \
--form 'audio=@path/to/your/audio/file' \
--form 'definition={
"channels": [0],
"locales": ["zh-CN"],
"diarization": {
"maxSpeakers": 2,
"enabled": true
},
"profanityFilterMode": "Masked"
}'
Response
Field Name | Type | Description |
---|---|---|
durationMilliseconds | Integer | Total duration of the audio file in milliseconds. |
combinedPhrases | Array | List of combined phrases. |
phrases | Array | Detailed information of each phrase. |
combinedPhrases
Field Name | Type | Description |
---|---|---|
text | String | Combined phrase text. |
phrases
Field Name | Type | Description |
---|---|---|
speaker | String | Speaker identifier. |
offsetMilliseconds | Integer | Offset of the phrase in the audio in milliseconds. |
durationMilliseconds | Integer | Duration of the phrase in milliseconds. |
text | String | Text of the phrase. |
words | Array | Detailed information of each word in the phrase. |
locale | String | Locale identifier of the phrase. |
confidence | Float | Confidence score of the phrase recognition. |
words
Field Name | Type | Description |
---|---|---|
text | String | Text of the word. |
offsetMilliseconds | Integer | Offset of the word in the phrase in milliseconds. |
durationMilliseconds | Integer | Duration of the word in milliseconds. |
Response Example
{
"durationMilliseconds": 1920,
"combinedPhrases": [
{
"text": "Hello,我是谁啊?"
}
],
"phrases": [
{
"speaker": null,
"offsetMilliseconds": 160,
"durationMilliseconds": 1440,
"text": "Hello,我是谁啊?",
"words": [
{
"text": "Hello,",
"offsetMilliseconds": 160,
"durationMilliseconds": 560
},
{
"text": "我",
"offsetMilliseconds": 720,
"durationMilliseconds": 240
},
{
"text": "是",
"offsetMilliseconds": 960,
"durationMilliseconds": 160
},
{
"text": "谁",
"offsetMilliseconds": 1120,
"durationMilliseconds": 240
},
{
"text": "啊?",
"offsetMilliseconds": 1360,
"durationMilliseconds": 240
}
],
"locale": "zh-CN",
"confidence": 0.7978613
}
]
}
Supported Audio Files
Size up to 25MB
- WAV
- MP3
- OPUS/OGG
- FLAC
- WMA
- AAC
- ALAW in WAV container
- MULAW in WAV container
- AMR
- WebM
- M4A
- SPEEX