Skip to content

ASR API

Version History

Version Date Changes
v1.1 2024-12-03 Added three product API documents: MaaS-AFast-asr, MaaS-Arealtime-asr, and MaaS-ASpeech-Translation
v1.0 2024-08-29 Initial release

MaaS Whisper

Public Information

Parameter Description Example
basePath The base path for invoking the mass api, including the fixed path /v1/ai https://genaiapi.cloudsway.net/v1/ai
endpointPath The randomly generated segment of the mass api RkBOAlaWzKcubSji
AccessKey The AccessKey for invoking the mass api RWxxxxxxxx0Gd

According to the above example, the final path for requesting the Voice-to-Text interface is https://genaiapi.cloudsway.net/v1/ai/RkBOAlawzKcubSji

Request Method

POST

Request Path

{basePath}/{endpointPath}/audio/transcriptions

Request Header

Parameter Description Example
Authorization AccessKey
Bearer ${AccessKey}
Bearer RWxxxxxxxx0Gd

Request Body

Parameter Type Required Description Example
file File Yes Audio file in formats such as mp3, mp4, mpwweg, mpga, m4a, wav, webm, with a file size limit of 25M
prompt String No Prompt "Generate a video of a sunset over the ocean."
response_format String No The format in which the model returns the result json,verbose_json
temperature String No Temperature, a value between 0 and 1
language String No The language of the specified audio file "en"(English),"zh"(Chinese),"es"(Spanish), etc
timestamp_granularities String No The granularity of the timestamp "none": no timestamp.
"word": timestamp for each word.
"sentence": timestamp for each sentence.

Response

Parameter Type Description Example
text String Speech-to-Text

Example

Request

curl --request POST \
--url https://genaiapi.cloudsway.net/v1/ai/RkBOAlaWzKcubSji/audio/transcriptions \
--header 'Accept: */*' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--header 'content-type: multipart/form-data' \
--form 'prompt=A poetic description of early morning, including words like dawn, quiet, mist, and possibility' \
--form response_format=verbose_json \
--form temperature=0.1 \
--form language=en \
--form timestamp_granularities=none \
--form 'file=xx.wav'

Response

{
  "text": "In this ancient town, plum blossoms bloom silently. The white petals are like snow, falling on the branches, welcoming the cold winter. The fragrance of the flowers is elegant, and it touches the heart, as if it is the scent of time. In this ancient town, plum blossoms bloom silently. Every plum blossom is a small miracle, which blooms in the coldness of life. They are not afraid of the cold, they are firm, symbolizing hope and rebirth. The blooming of plum blossoms is like the praise of nature for life, warming everyone's heart. Each blossom is a small miracle, symbolizing hope and rebirth. Standing under the plum trees, it is as if you can hear the rain of years. Flowers bloom and fall, spring and autumn come. Plum blossoms witness the turning of time, and witness people's joy and sorrow. They are the guardians of memory, quietly preserving the story of this town. Standing under the plum tree, one can almost hear the whispers of time. Plum blossoms are not just a plant, but also a spiritual symbol. It teaches us to keep hope in adversity, to find warmth in the cold winter. Every year's blooming is a praise of life, a hope for the future. Plum blossoms teach us to keep hope alive in adversity. Let's cherish the beauty before us and embrace every moment of life bravely. Let's cherish the beauty before us and embrace every moment of life bravely."
}

MaaS-AFast-asr

Public Information

Parameter Description Example
basePath Base path for MaaS API https://genaiapi.cloudsway.net/
endpointPath Random path segment generated for MaaS API LPUqHEAjfonOmohV
AccessKey Access key for MaaS API RWxxxxxxxx0Gd

Based on the example above, the final request path for the Quick Transcription API is:

https://genaiapi.cloudsway.net/v1/ai/LPUqHEAjfonOmohV/speechtotext/transcriptions:transcribe?api-version=2024-11-15

Request Method

POST

Request Path

{basePath}/v1/ai/{endpointPath}/speechtotext/transcriptions:transcribe?api-version=2024-11-15

Request Header

Parameter Required Description
Authorization Yes AccessKey
Bearer ${AccessKey}
Bearer RWxxxxxxxx0Gd

Query Parameters

Parameter Required Description
api-version Yes Fixed value: 2024-11-15

Request Form Data

Parameter Required Type Description
audio Yes Audio file Audio file
definition No JSON string Configuration options

definition

Parameter Required Description
channels No List of zero-based indices of channels to be transcribed separately. Unless diarization is enabled, a maximum of two channels is supported. By default, the Quick Transcription API merges all input channels into a single channel before transcription. If you don't want this, you can transcribe each channel independently. 

For stereo audio files, specify [0,1][0], and [1] to transcribe each channel separately. Otherwise, stereo audio will be merged into mono and only a single channel will be transcribed. 

If the audio is stereo and diarization is enabled, the channels attribute cannot be set to [0,1]. The speech service does not support diarization for multiple channels. 

For mono audio, the channels attribute is ignored, and the audio is always transcribed as mono.
diarization No Diarization configuration. Diarization is the process of identifying and separating speakers in a single audio channel. For example, specify "diarization": {"maxSpeakers": 2, "enabled": true}. The transcription file will then include a speaker entry for each transcribed phrase (e.g., "speaker": 0 or "speaker": 1).
locales No, but recommended if you know the expected language The list of languages should match the expected languages of the audio data to be transcribed. 

If you know the language setting of the audio file, specifying it can improve transcription accuracy and minimize latency. If a single language is specified, that language will be used for transcription. 

However, if you are unsure of the language used, you can specify multiple languages. The more precise the candidate language list, the more accurate the language recognition might be. 

If no language is specified or the specified language is not present in the audio file, the speech service will attempt to recognize the language. If it cannot recognize the language, an error will be returned. 

Supported language settings include: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN.
profanityFilterMode No Specifies how to handle profanity in the recognition results. Accepted values are None (disable profanity filtering), Masked (replace profanity with asterisks), Removed (remove all profanity from the results), or Tags (add profanity tags). The default value is Masked.

Request Example

curl --request POST \
  --url 'https://genaiapi.cloudsway.net/v1/ai/qyBrSaFJYTUwsWcM/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
  --header 'Authorization: Bearer ${AccessKey}' \
  --header 'Content-Type: multipart/form-data' \
  --form 'audio=@path/to/your/audio/file' \
  --form 'definition={
    "channels": [0],
    "locales": ["zh-CN"],
    "diarization": {
      "maxSpeakers": 2,
      "enabled": true
    },
    "profanityFilterMode": "Masked"
  }'

Response

Field Name Type Description
durationMilliseconds Integer Total duration of the audio file in milliseconds.
combinedPhrases Array List of combined phrases.
phrases Array Detailed information of each phrase.
combinedPhrases
Field Name Type Description
text String Combined phrase text.
phrases
Field Name Type Description
speaker String Speaker identifier.
offsetMilliseconds Integer Offset of the phrase in the audio in milliseconds.
durationMilliseconds Integer Duration of the phrase in milliseconds.
text String Text of the phrase.
words Array Detailed information of each word in the phrase.
locale String Locale identifier of the phrase.
confidence Float Confidence score of the phrase recognition.
words
Field Name Type Description
text String Text of the word.
offsetMilliseconds Integer Offset of the word in the phrase in milliseconds.
durationMilliseconds Integer Duration of the word in milliseconds.

Response Example

{
    "durationMilliseconds": 1920,
    "combinedPhrases": [
        {
            "text": "Hello,我是谁啊?"
        }
    ],
    "phrases": [
        {
            "speaker": null,
            "offsetMilliseconds": 160,
            "durationMilliseconds": 1440,
            "text": "Hello,我是谁啊?",
            "words": [
                {
                    "text": "Hello,",
                    "offsetMilliseconds": 160,
                    "durationMilliseconds": 560
                },
                {
                    "text": "我",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 240
                },
                {
                    "text": "是",
                    "offsetMilliseconds": 960,
                    "durationMilliseconds": 160
                },
                {
                    "text": "谁",
                    "offsetMilliseconds": 1120,
                    "durationMilliseconds": 240
                },
                {
                    "text": "啊?",
                    "offsetMilliseconds": 1360,
                    "durationMilliseconds": 240
                }
            ],
            "locale": "zh-CN",
            "confidence": 0.7978613
        }
    ]
}

Supported Audio Files

Size up to 25MB

  • WAV
  • MP3
  • OPUS/OGG
  • FLAC
  • WMA
  • AAC
  • ALAW in WAV container
  • MULAW in WAV container
  • AMR
  • WebM
  • M4A
  • SPEEX

MaaS-Arealtime-asr

Request Protocol

Http

Parameter Type Description
Authorization string Authentication token

Request Path

https://genaiapi.cloudsway.net/v1/ai/{endpoint}/audio/recognize

Request FormData

Parameter Type Description
file file Audio file to be recognized, maximum duration of 30 seconds
recognitionLanguages string Possible languages of the audio, separated by commas. Example: en-US,es-MX
timeout string Speech interval timeout

Request Example

curl --location 'https://genaiapi.cloudsway.net/v1/ai/QEnOdgDqcLVKmTCP/audio/recognize' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--form 'file=@path/to/your/audio/file' \
--form 'recognitionLanguages="en-US,es-MX"' \
--form 'timeout="100"'

Response

Parameter Type Description
text string Recognized audio text
language string Recognized audio language
duration int Audio duration in units
durationInSeconds int Audio duration in seconds

Example

{
    "text": "Cuando abrió los ojos por la mañana fue porque una joven empleada doméstica había entrado en su habitación para encender el fuego.",
    "language": "es-MX",
    "duration": 65200000,
    "durationInSeconds": 7
}

Additional Information

Supported Languages

Language Locale (BCP-47)
Arabic ar-AE, ar-BH, ar-DZ, ar-EG, ar-IQ, ar-JO, ar-KW, ar-LY, ar-MA, ar-OM, ar-QA, ar-SA, ar-SY, ar-YE
Danish da-DK
Dutch nl-NL
English en-AU
Estonian et-EE
Finnish fi-FI
French fr-CA, fr-FR
German de-DE
Greek el-GR
Gujarati gu-IN
Hebrew he-IL
Hindi hi-IN
Hungarian hu-HU
Indonesian id-ID
Bengali bn-IN
Bulgarian bg-BG
Catalan ca-ES
Chinese zh-CN, zh-HK, zh-TW
Croatian hr-HR
Czech cs-CZ
Irish ga-IE
Italian it-IT
Japanese ja-JP
Kannada kn-IN
Malayalam ml-IN
Korean ko-KR
Latvian lv-LV
Lithuanian lt-LT
Maltese mt-MT
Marathi mr-IN
Norwegian nb-NO
Polish pl-PL
Portuguese pt-BR, pt-PT
Romanian ro-RO
Russian ru-RU
Slovak sk-SK
Slovenian sl-SI
Spanish es-AR, es-BO, es-CL, es-CO, es-CR, es-CU, es-DO, es-EC, es-SV, es-GQ, es-GT, es-HN, es-MX, es-NI, es-PA, es-PY, es-PE, es-PR, es-ES, es-UY, es-US, es-VE
Swedish sv-SE
Tamil ta-IN
Telugu te-IN
Thai th-TH
Turkish tr-TR
Ukrainian uk-UA
Vietnamese vi-VN

MaaS-ASpeech-Translation

Request Method

POST

Request Path

{basePath}/v1/ai/{endpointPath}/audio/realtime/translation

Request Header

Parameter Description Example
Authorization AccessKey
Bearer ${AccessKey}
Bearer RWxxxxxxxx0Gd
Host Host address of the service genaiapi.cloudsway.net

Request Parameters

Field Name Type Required Description Example Value
targetLanguages String Yes List of target translation languages "en-US", "ja"
file File Yes Audio file (audio file within 30 seconds). Audio files longer than 30 seconds will only transcribe and translate the first 30 seconds "C:\Users\zhcn_continuous_mode_sample.wav"
recognitionLanguages String Yes List of recognition languages. When multiple are provided, hasRecognize must be enabled; otherwise, only the first will be recognized "zh-CN", "en-US"
hasRecognize String No Whether recognition is required; defaults to false "true"
SegmentationSilenceTimeoutMs String No Segmentation silence timeout setting (in milliseconds), defaults to 2000 "1000"

Return Values

Field Name Type Description Example Value
text String Original text "Good morning, Steve. Good morning, Katie. ..."
translations Object Translation results, containing texts in different languages {"ja": "おはようございます、スティーブ。おはようございます、ケイティ。..."}
language String Original language of the audio "en-US"
duration Integer Audio duration, in hundred-nanosecond units 286400000
resultId String Unique identifier for the task result "5518458c7dec4003b9281662d9c763a7"
durationInSeconds Integer Audio duration, in seconds 29

Example

Request

curl --location --request POST 'https://genaiapi.cloudsway.net/v1/ai/YAzGCqDxSYQFlYie/audio/realtime/translation' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--form 'targetLanguages="en-US,ja"' \
--form 'file=@path/to/your/audio/file"' \
--form 'recognitionLanguages="zh-CN,en-US"' \
--form 'hasRecognize="true"' \
--form 'SegmentationSilenceTimeoutMs="1000"'

Return Values

{
  "text": "秋天总是那么那么富有诗意,树叶渐渐变红街道旁的银杏树开始落叶,人们穿上厚重的外套,享受着凉爽的秋风。黄昏时分,夕阳洒在街道上,给忙碌的一天增添了一抹温暖。无论是散步还是小憩,这个季节总能带来宁静和满足。",
  "translations": {
    "en-US": "Autumn is always so poetic, the leaves are turning red, the ginkgo trees along the streets are starting to lose their leaves, and people are wearing heavy coats and enjoying the cool autumn breeze. At dusk, the setting sun shines on the streets, adding a touch of warmth to a busy day. Whether it's a walk or a nap, this season always brings tranquility and fulfillment.",
    "ja": "秋はいつもとても詩的で、葉は赤く色づき、通り沿いのイチョウの木は葉を失い始め、人々は厚手のコートを着て涼しい秋の風を楽しんでいます。 夕暮れ時には、夕日が通りを照らし、忙しい一日に暖かさを加えます。 散歩でも昼寝でも、この季節はいつも静けさと充実感をもたらします。"
  },
  "language": "zh-CN",
  "duration": 260400000,
  "resultId": "ad03ee3a708e435dbe0ee808bb68f918",
  "durationInSeconds": 27
}