ASR API
Version History
Version | Date | Changes |
---|---|---|
v1.1 | 2024-12-03 | Added three product API documents: MaaS-AFast-asr, MaaS-Arealtime-asr, and MaaS-ASpeech-Translation |
v1.0 | 2024-08-29 | Initial release |
MaaS Whisper
Public Information
Parameter | Description | Example |
---|---|---|
basePath | The base path for invoking the mass api, including the fixed path /v1/ai | https://genaiapi.cloudsway.net/v1/ai |
endpointPath | The randomly generated segment of the mass api | RkBOAlaWzKcubSji |
AccessKey | The AccessKey for invoking the mass api | RWxxxxxxxx0Gd |
According to the above example, the final path for requesting the Voice-to-Text interface is https://genaiapi.cloudsway.net/v1/ai/RkBOAlawzKcubSji
Request Method
POST
Request Path
{basePath}/{endpointPath}/audio/transcriptions
Request Header
Parameter | Description | Example |
---|---|---|
Authorization | AccessKey Bearer ${AccessKey} |
Bearer RWxxxxxxxx0Gd |
Request Body
Parameter | Type | Required | Description | Example |
---|---|---|---|---|
file | File | Yes | Audio file in formats such as mp3, mp4, mpwweg, mpga, m4a, wav, webm, with a file size limit of 25M | |
prompt | String | No | Prompt | "Generate a video of a sunset over the ocean." |
response_format | String | No | The format in which the model returns the result | json,verbose_json |
temperature | String | No | Temperature, a value between 0 and 1 | |
language | String | No | The language of the specified audio file | "en"(English),"zh"(Chinese),"es"(Spanish), etc |
timestamp_granularities | String | No | The granularity of the timestamp | "none": no timestamp. "word": timestamp for each word. "sentence": timestamp for each sentence. |
Response
Parameter | Type | Description | Example |
---|---|---|---|
text | String | Speech-to-Text |
Example
Request
curl --request POST \
--url https://genaiapi.cloudsway.net/v1/ai/RkBOAlaWzKcubSji/audio/transcriptions \
--header 'Accept: */*' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--header 'content-type: multipart/form-data' \
--form 'prompt=A poetic description of early morning, including words like dawn, quiet, mist, and possibility' \
--form response_format=verbose_json \
--form temperature=0.1 \
--form language=en \
--form timestamp_granularities=none \
--form 'file=xx.wav'
Response
{
"text": "In this ancient town, plum blossoms bloom silently. The white petals are like snow, falling on the branches, welcoming the cold winter. The fragrance of the flowers is elegant, and it touches the heart, as if it is the scent of time. In this ancient town, plum blossoms bloom silently. Every plum blossom is a small miracle, which blooms in the coldness of life. They are not afraid of the cold, they are firm, symbolizing hope and rebirth. The blooming of plum blossoms is like the praise of nature for life, warming everyone's heart. Each blossom is a small miracle, symbolizing hope and rebirth. Standing under the plum trees, it is as if you can hear the rain of years. Flowers bloom and fall, spring and autumn come. Plum blossoms witness the turning of time, and witness people's joy and sorrow. They are the guardians of memory, quietly preserving the story of this town. Standing under the plum tree, one can almost hear the whispers of time. Plum blossoms are not just a plant, but also a spiritual symbol. It teaches us to keep hope in adversity, to find warmth in the cold winter. Every year's blooming is a praise of life, a hope for the future. Plum blossoms teach us to keep hope alive in adversity. Let's cherish the beauty before us and embrace every moment of life bravely. Let's cherish the beauty before us and embrace every moment of life bravely."
}
MaaS-AFast-asr
Public Information
Parameter | Description | Example |
---|---|---|
basePath | Base path for MaaS API | https://genaiapi.cloudsway.net/ |
endpointPath | Random path segment generated for MaaS API | LPUqHEAjfonOmohV |
AccessKey | Access key for MaaS API | RWxxxxxxxx0Gd |
Based on the example above, the final request path for the Quick Transcription
API is:
https://genaiapi.cloudsway.net/v1/ai/LPUqHEAjfonOmohV/speechtotext/transcriptions:transcribe?api-version=2024-11-15
Request Method
POST
Request Path
{basePath}/v1/ai/{endpointPath}/speechtotext/transcriptions:transcribe?api-version=2024-11-15
Request Header
Parameter | Required | Description |
---|---|---|
Authorization | Yes | AccessKey Bearer ${AccessKey} Bearer RWxxxxxxxx0Gd |
Query Parameters
Parameter | Required | Description |
---|---|---|
api-version | Yes | Fixed value: 2024-11-15 |
Request Form Data
Parameter | Required | Type | Description |
---|---|---|---|
audio | Yes | Audio file | Audio file |
definition | No | JSON string | Configuration options |
definition
Parameter | Required | Description |
---|---|---|
channels | No | List of zero-based indices of channels to be transcribed separately. Unless diarization is enabled, a maximum of two channels is supported. By default, the Quick Transcription API merges all input channels into a single channel before transcription. If you don't want this, you can transcribe each channel independently. For stereo audio files, specify [0,1] , [0] , and [1] to transcribe each channel separately. Otherwise, stereo audio will be merged into mono and only a single channel will be transcribed. If the audio is stereo and diarization is enabled, the channels attribute cannot be set to [0,1] . The speech service does not support diarization for multiple channels. For mono audio, the channels attribute is ignored, and the audio is always transcribed as mono. |
diarization | No | Diarization configuration. Diarization is the process of identifying and separating speakers in a single audio channel. For example, specify "diarization": {"maxSpeakers": 2, "enabled": true} . The transcription file will then include a speaker entry for each transcribed phrase (e.g., "speaker": 0 or "speaker": 1 ). |
locales | No, but recommended if you know the expected language | The list of languages should match the expected languages of the audio data to be transcribed. If you know the language setting of the audio file, specifying it can improve transcription accuracy and minimize latency. If a single language is specified, that language will be used for transcription. However, if you are unsure of the language used, you can specify multiple languages. The more precise the candidate language list, the more accurate the language recognition might be. If no language is specified or the specified language is not present in the audio file, the speech service will attempt to recognize the language. If it cannot recognize the language, an error will be returned. Supported language settings include: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN. |
profanityFilterMode | No | Specifies how to handle profanity in the recognition results. Accepted values are None (disable profanity filtering), Masked (replace profanity with asterisks), Removed (remove all profanity from the results), or Tags (add profanity tags). The default value is Masked . |
Request Example
curl --request POST \
--url 'https://genaiapi.cloudsway.net/v1/ai/qyBrSaFJYTUwsWcM/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Content-Type: multipart/form-data' \
--form 'audio=@path/to/your/audio/file' \
--form 'definition={
"channels": [0],
"locales": ["zh-CN"],
"diarization": {
"maxSpeakers": 2,
"enabled": true
},
"profanityFilterMode": "Masked"
}'
Response
Field Name | Type | Description |
---|---|---|
durationMilliseconds | Integer | Total duration of the audio file in milliseconds. |
combinedPhrases | Array | List of combined phrases. |
phrases | Array | Detailed information of each phrase. |
combinedPhrases
Field Name | Type | Description |
---|---|---|
text | String | Combined phrase text. |
phrases
Field Name | Type | Description |
---|---|---|
speaker | String | Speaker identifier. |
offsetMilliseconds | Integer | Offset of the phrase in the audio in milliseconds. |
durationMilliseconds | Integer | Duration of the phrase in milliseconds. |
text | String | Text of the phrase. |
words | Array | Detailed information of each word in the phrase. |
locale | String | Locale identifier of the phrase. |
confidence | Float | Confidence score of the phrase recognition. |
words
Field Name | Type | Description |
---|---|---|
text | String | Text of the word. |
offsetMilliseconds | Integer | Offset of the word in the phrase in milliseconds. |
durationMilliseconds | Integer | Duration of the word in milliseconds. |
Response Example
{
"durationMilliseconds": 1920,
"combinedPhrases": [
{
"text": "Hello,我是谁啊?"
}
],
"phrases": [
{
"speaker": null,
"offsetMilliseconds": 160,
"durationMilliseconds": 1440,
"text": "Hello,我是谁啊?",
"words": [
{
"text": "Hello,",
"offsetMilliseconds": 160,
"durationMilliseconds": 560
},
{
"text": "我",
"offsetMilliseconds": 720,
"durationMilliseconds": 240
},
{
"text": "是",
"offsetMilliseconds": 960,
"durationMilliseconds": 160
},
{
"text": "谁",
"offsetMilliseconds": 1120,
"durationMilliseconds": 240
},
{
"text": "啊?",
"offsetMilliseconds": 1360,
"durationMilliseconds": 240
}
],
"locale": "zh-CN",
"confidence": 0.7978613
}
]
}
Supported Audio Files
Size up to 25MB
- WAV
- MP3
- OPUS/OGG
- FLAC
- WMA
- AAC
- ALAW in WAV container
- MULAW in WAV container
- AMR
- WebM
- M4A
- SPEEX
MaaS-Arealtime-asr
Request Protocol
Http
Header
Parameter | Type | Description |
---|---|---|
Authorization | string | Authentication token |
Request Path
https://genaiapi.cloudsway.net/v1/ai/{endpoint}/audio/recognize
Request FormData
Parameter | Type | Description |
---|---|---|
file | file | Audio file to be recognized, maximum duration of 30 seconds |
recognitionLanguages | string | Possible languages of the audio, separated by commas. Example: en-US,es-MX |
timeout | string | Speech interval timeout |
Request Example
curl --location 'https://genaiapi.cloudsway.net/v1/ai/QEnOdgDqcLVKmTCP/audio/recognize' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--form 'file=@path/to/your/audio/file' \
--form 'recognitionLanguages="en-US,es-MX"' \
--form 'timeout="100"'
Response
Parameter | Type | Description |
---|---|---|
text | string | Recognized audio text |
language | string | Recognized audio language |
duration | int | Audio duration in units |
durationInSeconds | int | Audio duration in seconds |
Example
{
"text": "Cuando abrió los ojos por la mañana fue porque una joven empleada doméstica había entrado en su habitación para encender el fuego.",
"language": "es-MX",
"duration": 65200000,
"durationInSeconds": 7
}
Additional Information
Supported Languages
Language | Locale (BCP-47) |
---|---|
Arabic | ar-AE, ar-BH, ar-DZ, ar-EG, ar-IQ, ar-JO, ar-KW, ar-LY, ar-MA, ar-OM, ar-QA, ar-SA, ar-SY, ar-YE |
Danish | da-DK |
Dutch | nl-NL |
English | en-AU |
Estonian | et-EE |
Finnish | fi-FI |
French | fr-CA, fr-FR |
German | de-DE |
Greek | el-GR |
Gujarati | gu-IN |
Hebrew | he-IL |
Hindi | hi-IN |
Hungarian | hu-HU |
Indonesian | id-ID |
Bengali | bn-IN |
Bulgarian | bg-BG |
Catalan | ca-ES |
Chinese | zh-CN, zh-HK, zh-TW |
Croatian | hr-HR |
Czech | cs-CZ |
Irish | ga-IE |
Italian | it-IT |
Japanese | ja-JP |
Kannada | kn-IN |
Malayalam | ml-IN |
Korean | ko-KR |
Latvian | lv-LV |
Lithuanian | lt-LT |
Maltese | mt-MT |
Marathi | mr-IN |
Norwegian | nb-NO |
Polish | pl-PL |
Portuguese | pt-BR, pt-PT |
Romanian | ro-RO |
Russian | ru-RU |
Slovak | sk-SK |
Slovenian | sl-SI |
Spanish | es-AR, es-BO, es-CL, es-CO, es-CR, es-CU, es-DO, es-EC, es-SV, es-GQ, es-GT, es-HN, es-MX, es-NI, es-PA, es-PY, es-PE, es-PR, es-ES, es-UY, es-US, es-VE |
Swedish | sv-SE |
Tamil | ta-IN |
Telugu | te-IN |
Thai | th-TH |
Turkish | tr-TR |
Ukrainian | uk-UA |
Vietnamese | vi-VN |
MaaS-ASpeech-Translation
Request Method
POST
Request Path
{basePath}/v1/ai/{endpointPath}/audio/realtime/translation
Request Header
Parameter | Description | Example |
---|---|---|
Authorization | AccessKey Bearer ${AccessKey} |
Bearer RWxxxxxxxx0Gd |
Host | Host address of the service | genaiapi.cloudsway.net |
Request Parameters
Field Name | Type | Required | Description | Example Value |
---|---|---|---|---|
targetLanguages | String | Yes | List of target translation languages | "en-US", "ja" |
file | File | Yes | Audio file (audio file within 30 seconds). Audio files longer than 30 seconds will only transcribe and translate the first 30 seconds | "C:\Users\zhcn_continuous_mode_sample.wav" |
recognitionLanguages | String | Yes | List of recognition languages. When multiple are provided, hasRecognize must be enabled; otherwise, only the first will be recognized | "zh-CN", "en-US" |
hasRecognize | String | No | Whether recognition is required; defaults to false | "true" |
SegmentationSilenceTimeoutMs | String | No | Segmentation silence timeout setting (in milliseconds), defaults to 2000 | "1000" |
Return Values
Field Name | Type | Description | Example Value |
---|---|---|---|
text | String | Original text | "Good morning, Steve. Good morning, Katie. ..." |
translations | Object | Translation results, containing texts in different languages | {"ja": "おはようございます、スティーブ。おはようございます、ケイティ。..."} |
language | String | Original language of the audio | "en-US" |
duration | Integer | Audio duration, in hundred-nanosecond units | 286400000 |
resultId | String | Unique identifier for the task result | "5518458c7dec4003b9281662d9c763a7" |
durationInSeconds | Integer | Audio duration, in seconds | 29 |
Example
Request
curl --location --request POST 'https://genaiapi.cloudsway.net/v1/ai/YAzGCqDxSYQFlYie/audio/realtime/translation' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--form 'targetLanguages="en-US,ja"' \
--form 'file=@path/to/your/audio/file"' \
--form 'recognitionLanguages="zh-CN,en-US"' \
--form 'hasRecognize="true"' \
--form 'SegmentationSilenceTimeoutMs="1000"'
Return Values
{
"text": "秋天总是那么那么富有诗意,树叶渐渐变红街道旁的银杏树开始落叶,人们穿上厚重的外套,享受着凉爽的秋风。黄昏时分,夕阳洒在街道上,给忙碌的一天增添了一抹温暖。无论是散步还是小憩,这个季节总能带来宁静和满足。",
"translations": {
"en-US": "Autumn is always so poetic, the leaves are turning red, the ginkgo trees along the streets are starting to lose their leaves, and people are wearing heavy coats and enjoying the cool autumn breeze. At dusk, the setting sun shines on the streets, adding a touch of warmth to a busy day. Whether it's a walk or a nap, this season always brings tranquility and fulfillment.",
"ja": "秋はいつもとても詩的で、葉は赤く色づき、通り沿いのイチョウの木は葉を失い始め、人々は厚手のコートを着て涼しい秋の風を楽しんでいます。 夕暮れ時には、夕日が通りを照らし、忙しい一日に暖かさを加えます。 散歩でも昼寝でも、この季節はいつも静けさと充実感をもたらします。"
},
"language": "zh-CN",
"duration": 260400000,
"resultId": "ad03ee3a708e435dbe0ee808bb68f918",
"durationInSeconds": 27
}