ASR API

Version History

Version	Date	Changes
v1.1	2024-12-03	Added three product API documents: MaaS-AFast-asr, MaaS-Arealtime-asr, and MaaS-ASpeech-Translation
v1.0	2024-08-29	Initial release

MaaS Whisper

Public Information

Parameter	Description	Example
basePath	The base path for invoking the mass api, including the fixed path /v1/ai	https://genaiapi.cloudsway.net/v1/ai
endpointPath	The randomly generated segment of the mass api	RkBOAlaWzKcubSji
AccessKey	The AccessKey for invoking the mass api	RWxxxxxxxx0Gd

According to the above example, the final path for requesting the Voice-to-Text interface is https://genaiapi.cloudsway.net/v1/ai/RkBOAlawzKcubSji

Request Method

POST

Request Path

{basePath}/{endpointPath}/audio/transcriptions

Request Header

Parameter	Description	Example
Authorization	AccessKey Bearer ${AccessKey}	Bearer RWxxxxxxxx0Gd

Request Body

Parameter	Type	Required	Description	Example
file	File	Yes	Audio file in formats such as mp3, mp4, mpwweg, mpga, m4a, wav, webm, with a file size limit of 25M
prompt	String	No	Prompt	"Generate a video of a sunset over the ocean."
response_format	String	No	The format in which the model returns the result	json,verbose_json
temperature	String	No	Temperature, a value between 0 and 1
language	String	No	The language of the specified audio file	"en"（English），"zh"（Chinese），"es"（Spanish）, etc
timestamp_granularities	String	No	The granularity of the timestamp	"none": no timestamp. "word": timestamp for each word. "sentence": timestamp for each sentence.

Response

Parameter	Type	Description	Example
text	String	Speech-to-Text

Example

Request

curl --request POST \
--url https://genaiapi.cloudsway.net/v1/ai/RkBOAlaWzKcubSji/audio/transcriptions \
--header 'Accept: */*' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--header 'content-type: multipart/form-data' \
--form 'prompt=A poetic description of early morning, including words like dawn, quiet, mist, and possibility' \
--form response_format=verbose_json \
--form temperature=0.1 \
--form language=en \
--form timestamp_granularities=none \
--form 'file=xx.wav'

Response

{
  "text": "In this ancient town, plum blossoms bloom silently. The white petals are like snow, falling on the branches, welcoming the cold winter. The fragrance of the flowers is elegant, and it touches the heart, as if it is the scent of time. In this ancient town, plum blossoms bloom silently. Every plum blossom is a small miracle, which blooms in the coldness of life. They are not afraid of the cold, they are firm, symbolizing hope and rebirth. The blooming of plum blossoms is like the praise of nature for life, warming everyone's heart. Each blossom is a small miracle, symbolizing hope and rebirth. Standing under the plum trees, it is as if you can hear the rain of years. Flowers bloom and fall, spring and autumn come. Plum blossoms witness the turning of time, and witness people's joy and sorrow. They are the guardians of memory, quietly preserving the story of this town. Standing under the plum tree, one can almost hear the whispers of time. Plum blossoms are not just a plant, but also a spiritual symbol. It teaches us to keep hope in adversity, to find warmth in the cold winter. Every year's blooming is a praise of life, a hope for the future. Plum blossoms teach us to keep hope alive in adversity. Let's cherish the beauty before us and embrace every moment of life bravely. Let's cherish the beauty before us and embrace every moment of life bravely."
}

MaaS-AFast-asr

Public Information

Parameter	Description	Example
basePath	Base path for MaaS API	https://genaiapi.cloudsway.net/
endpointPath	Random path segment generated for MaaS API	LPUqHEAjfonOmohV
AccessKey	Access key for MaaS API	RWxxxxxxxx0Gd

Based on the example above, the final request path for the Quick Transcription API is:

https://genaiapi.cloudsway.net/v1/ai/LPUqHEAjfonOmohV/speechtotext/transcriptions:transcribe?api-version=2024-11-15

Request Method

POST

Request Path

{basePath}/v1/ai/{endpointPath}/speechtotext/transcriptions:transcribe?api-version=2024-11-15

Request Header

Parameter	Required	Description
Authorization	Yes	AccessKey Bearer ${AccessKey} Bearer RWxxxxxxxx0Gd

Query Parameters

Parameter	Required	Description
api-version	Yes	Fixed value: 2024-11-15

Request Form Data

Parameter	Required	Type	Description
audio	Yes	Audio file	Audio file
definition	No	JSON string	Configuration options

definition

Parameter	Required	Description
channels	No	List of zero-based indices of channels to be transcribed separately. Unless diarization is enabled, a maximum of two channels is supported. By default, the Quick Transcription API merges all input channels into a single channel before transcription. If you don't want this, you can transcribe each channel independently. For stereo audio files, specify `[0,1]`, `[0]`, and `[1]` to transcribe each channel separately. Otherwise, stereo audio will be merged into mono and only a single channel will be transcribed. If the audio is stereo and diarization is enabled, the `channels` attribute cannot be set to `[0,1]`. The speech service does not support diarization for multiple channels. For mono audio, the `channels` attribute is ignored, and the audio is always transcribed as mono.
diarization	No	Diarization configuration. Diarization is the process of identifying and separating speakers in a single audio channel. For example, specify `"diarization": {"maxSpeakers": 2, "enabled": true}`. The transcription file will then include a `speaker` entry for each transcribed phrase (e.g., `"speaker": 0` or `"speaker": 1`).
locales	No, but recommended if you know the expected language	The list of languages should match the expected languages of the audio data to be transcribed. If you know the language setting of the audio file, specifying it can improve transcription accuracy and minimize latency. If a single language is specified, that language will be used for transcription. However, if you are unsure of the language used, you can specify multiple languages. The more precise the candidate language list, the more accurate the language recognition might be. If no language is specified or the specified language is not present in the audio file, the speech service will attempt to recognize the language. If it cannot recognize the language, an error will be returned. Supported language settings include: de-DE, en-IN, en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN.
profanityFilterMode	No	Specifies how to handle profanity in the recognition results. Accepted values are `None` (disable profanity filtering), `Masked` (replace profanity with asterisks), `Removed` (remove all profanity from the results), or `Tags` (add profanity tags). The default value is `Masked`.

Request Example

curl --request POST \
  --url 'https://genaiapi.cloudsway.net/v1/ai/qyBrSaFJYTUwsWcM/speechtotext/transcriptions:transcribe?api-version=2024-11-15' \
  --header 'Authorization: Bearer ${AccessKey}' \
  --header 'Content-Type: multipart/form-data' \
  --form 'audio=@path/to/your/audio/file' \
  --form 'definition={
    "channels": [0],
    "locales": ["zh-CN"],
    "diarization": {
      "maxSpeakers": 2,
      "enabled": true
    },
    "profanityFilterMode": "Masked"
  }'

Response

Field Name	Type	Description
durationMilliseconds	Integer	Total duration of the audio file in milliseconds.
combinedPhrases	Array	List of combined phrases.
phrases	Array	Detailed information of each phrase.

combinedPhrases

Field Name	Type	Description
text	String	Combined phrase text.

phrases

Field Name	Type	Description
speaker	String	Speaker identifier.
offsetMilliseconds	Integer	Offset of the phrase in the audio in milliseconds.
durationMilliseconds	Integer	Duration of the phrase in milliseconds.
text	String	Text of the phrase.
words	Array	Detailed information of each word in the phrase.
locale	String	Locale identifier of the phrase.
confidence	Float	Confidence score of the phrase recognition.

words

Field Name	Type	Description
text	String	Text of the word.
offsetMilliseconds	Integer	Offset of the word in the phrase in milliseconds.
durationMilliseconds	Integer	Duration of the word in milliseconds.

Response Example

{
    "durationMilliseconds": 1920,
    "combinedPhrases": [
        {
            "text": "Hello，我是谁啊？"
        }
    ],
    "phrases": [
        {
            "speaker": null,
            "offsetMilliseconds": 160,
            "durationMilliseconds": 1440,
            "text": "Hello，我是谁啊？",
            "words": [
                {
                    "text": "Hello，",
                    "offsetMilliseconds": 160,
                    "durationMilliseconds": 560
                },
                {
                    "text": "我",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 240
                },
                {
                    "text": "是",
                    "offsetMilliseconds": 960,
                    "durationMilliseconds": 160
                },
                {
                    "text": "谁",
                    "offsetMilliseconds": 1120,
                    "durationMilliseconds": 240
                },
                {
                    "text": "啊？",
                    "offsetMilliseconds": 1360,
                    "durationMilliseconds": 240
                }
            ],
            "locale": "zh-CN",
            "confidence": 0.7978613
        }
    ]
}

Supported Audio Files

Size up to 25MB

WAV
MP3
OPUS/OGG
FLAC
WMA
AAC
ALAW in WAV container
MULAW in WAV container
AMR
WebM
M4A
SPEEX

MaaS-Arealtime-asr

Request Protocol

Http

Parameter	Type	Description
Authorization	string	Authentication token

Request Path

https://genaiapi.cloudsway.net/v1/ai/{endpoint}/audio/recognize

Request FormData

Parameter	Type	Description
file	file	Audio file to be recognized, maximum duration of 30 seconds
recognitionLanguages	string	Possible languages of the audio, separated by commas. Example: en-US,es-MX
timeout	string	Speech interval timeout

Request Example

curl --location 'https://genaiapi.cloudsway.net/v1/ai/QEnOdgDqcLVKmTCP/audio/recognize' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--form 'file=@path/to/your/audio/file' \
--form 'recognitionLanguages="en-US,es-MX"' \
--form 'timeout="100"'

Response

Parameter	Type	Description
text	string	Recognized audio text
language	string	Recognized audio language
duration	int	Audio duration in units
durationInSeconds	int	Audio duration in seconds

Example

{
    "text": "Cuando abrió los ojos por la mañana fue porque una joven empleada doméstica había entrado en su habitación para encender el fuego.",
    "language": "es-MX",
    "duration": 65200000,
    "durationInSeconds": 7
}

Additional Information

Supported Languages

Language	Locale (BCP-47)
Arabic	ar-AE, ar-BH, ar-DZ, ar-EG, ar-IQ, ar-JO, ar-KW, ar-LY, ar-MA, ar-OM, ar-QA, ar-SA, ar-SY, ar-YE
Danish	da-DK
Dutch	nl-NL
English	en-AU
Estonian	et-EE
Finnish	fi-FI
French	fr-CA, fr-FR
German	de-DE
Greek	el-GR
Gujarati	gu-IN
Hebrew	he-IL
Hindi	hi-IN
Hungarian	hu-HU
Indonesian	id-ID
Bengali	bn-IN
Bulgarian	bg-BG
Catalan	ca-ES
Chinese	zh-CN, zh-HK, zh-TW
Croatian	hr-HR
Czech	cs-CZ
Irish	ga-IE
Italian	it-IT
Japanese	ja-JP
Kannada	kn-IN
Malayalam	ml-IN
Korean	ko-KR
Latvian	lv-LV
Lithuanian	lt-LT
Maltese	mt-MT
Marathi	mr-IN
Norwegian	nb-NO
Polish	pl-PL
Portuguese	pt-BR, pt-PT
Romanian	ro-RO
Russian	ru-RU
Slovak	sk-SK
Slovenian	sl-SI
Spanish	es-AR, es-BO, es-CL, es-CO, es-CR, es-CU, es-DO, es-EC, es-SV, es-GQ, es-GT, es-HN, es-MX, es-NI, es-PA, es-PY, es-PE, es-PR, es-ES, es-UY, es-US, es-VE
Swedish	sv-SE
Tamil	ta-IN
Telugu	te-IN
Thai	th-TH
Turkish	tr-TR
Ukrainian	uk-UA
Vietnamese	vi-VN

MaaS-ASpeech-Translation

Request Method

POST

Request Path

{basePath}/v1/ai/{endpointPath}/audio/realtime/translation

Request Header

Parameter	Description	Example
Authorization	AccessKey Bearer ${AccessKey}	Bearer RWxxxxxxxx0Gd
Host	Host address of the service	genaiapi.cloudsway.net

Request Parameters

Field Name	Type	Required	Description	Example Value
targetLanguages	String	Yes	List of target translation languages	"en-US", "ja"
file	File	Yes	Audio file (audio file within 30 seconds). Audio files longer than 30 seconds will only transcribe and translate the first 30 seconds	"C:\Users\zhcn_continuous_mode_sample.wav"
recognitionLanguages	String	Yes	List of recognition languages. When multiple are provided, hasRecognize must be enabled; otherwise, only the first will be recognized	"zh-CN", "en-US"
hasRecognize	String	No	Whether recognition is required; defaults to false	"true"
SegmentationSilenceTimeoutMs	String	No	Segmentation silence timeout setting (in milliseconds), defaults to 2000	"1000"

Return Values

Field Name	Type	Description	Example Value
text	String	Original text	"Good morning, Steve. Good morning, Katie. ..."
translations	Object	Translation results, containing texts in different languages	{"ja": "おはようございます、スティーブ。おはようございます、ケイティ。..."}
language	String	Original language of the audio	"en-US"
duration	Integer	Audio duration, in hundred-nanosecond units	286400000
resultId	String	Unique identifier for the task result	"5518458c7dec4003b9281662d9c763a7"
durationInSeconds	Integer	Audio duration, in seconds	29

Example

Request

curl --location --request POST 'https://genaiapi.cloudsway.net/v1/ai/YAzGCqDxSYQFlYie/audio/realtime/translation' \
--header 'Authorization: Bearer ${AccessKey}' \
--header 'Connection: keep-alive' \
--form 'targetLanguages="en-US,ja"' \
--form 'file=@path/to/your/audio/file"' \
--form 'recognitionLanguages="zh-CN,en-US"' \
--form 'hasRecognize="true"' \
--form 'SegmentationSilenceTimeoutMs="1000"'

Return Values

{
  "text": "秋天总是那么那么富有诗意，树叶渐渐变红街道旁的银杏树开始落叶，人们穿上厚重的外套，享受着凉爽的秋风。黄昏时分，夕阳洒在街道上，给忙碌的一天增添了一抹温暖。无论是散步还是小憩，这个季节总能带来宁静和满足。",
  "translations": {
    "en-US": "Autumn is always so poetic, the leaves are turning red, the ginkgo trees along the streets are starting to lose their leaves, and people are wearing heavy coats and enjoying the cool autumn breeze. At dusk, the setting sun shines on the streets, adding a touch of warmth to a busy day. Whether it's a walk or a nap, this season always brings tranquility and fulfillment.",
    "ja": "秋はいつもとても詩的で、葉は赤く色づき、通り沿いのイチョウの木は葉を失い始め、人々は厚手のコートを着て涼しい秋の風を楽しんでいます。 夕暮れ時には、夕日が通りを照らし、忙しい一日に暖かさを加えます。 散歩でも昼寝でも、この季節はいつも静けさと充実感をもたらします。"
  },
  "language": "zh-CN",
  "duration": 260400000,
  "resultId": "ad03ee3a708e435dbe0ee808bb68f918",
  "durationInSeconds": 27
}