Text-to-Speech

Text-to-Speech (TTS) models are a technology that converts written text into speech, enabling computers to "read" text aloud. This is achieved by parsing the input text into phonetic units and then generating speech that sounds natural. TTS models are commonly used in applications such as navigation systems, virtual assistants, and audiobooks.

The following models are available for purchase:

MaaS_DB_Speech
MaaS-Ele
MaaS ASpeech
MaaS OSpeech
MaaS_T2A_V2_01HD
MaaS_T2A_V2_02HD
MaaS_T2A_V2_01Turbo

MaaS_DB_Speech

The MaaS_DB_Speech large model, relying on the powerful capabilities of the new - generation large model, can deeply analyze the context, accurately and intelligently predict key information such as emotions and intonations contained in the text, and then generate super - natural, high - fidelity, and highly personalized voices, comprehensively meeting the diverse individual needs of different users. Compared with traditional speech synthesis technologies, it performs excellently in many aspects such a naturalness, sound quality, rhythm, breath control, emotion, and the expression of modal particles. The output voice is highly similar to human voices.

Exceptionally High Naturalness Resembling Real Human Voices

The model uses sophisticated algorithms to delicately simulate the subtle details of human vocalization. The smooth transition of speech, the appropriate changes in speaking speed and rhythm control are extremely close to those of real human voices, making the listeners feel as if they are in a real - life conversation scene, with a natural and smooth communication experience.

Abundant Timbre Catering to Diverse Needs

According to different scenarios where users are, such as formal reading, daily conversation, and engaging narration, the model can quickly match and generate suitable voice styles. Whether it is a lively and playful style full of vitality or a solemn and dignified style, it can be accurately presented to meet diverse scenario requirements.

Wide Adaptability Compatible with All Kinds of Texts

Regardless of facing various types of texts, such as news articles, story scripts, or professional theses, the model can quickly adapt and output high - quality voices that conform to the context.

MaaS-Ele

MaaS-Ele is an AI-based text-to-speech and voice cloning model that offers a multitude of features and services.

High-Quality Voice Generation

The AI voice generator of MaaS-Ele can render human intonation and inflection with remarkable fidelity, adjusting the delivery of speech according to the context.
Multi-Language Support

Capable of generating voices in 32 languages and over 100 voice options, it is suitable for creating voiceovers for games, videos, podcasts, and various other content.
Voice Cloning

Provides voice cloning capabilities, allowing users to create unique voices and customize settings.
Diverse Applications

Suitable for a range of applications including text-to-speech, voice-to-voice, dubbing, and sound effect generation.
Advanced Features

Offers a richer set of features compared to other text-to-speech services, including telephone format support and multilingual generation.
Project Support

For users needing to generate longer content, the project feature is recommended to handle lengthy text content.
Generation Limit

Each generation can handle up to 5,000 characters.

MaaS-Ele’s text-to-speech service supports the following audio output formats:

MP3
WAV

By default, the audio generated on the website is in MP3 format, but other options such as PCM and μ-law formats are also available.

These formats are suitable for a variety of applications, including the creation of videos, e-learning modules, and audiobooks.

MaaS ASpeech

MaaS-ASpeech employs voice generation technology to produce high-quality, natural-sounding speech output. This model leverages cutting-edge machine learning and deep learning techniques to achieve superior speech synthesis, having been rigorously trained on extensive voice and text datasets to meet high standards of naturalness, clarity, and emotional expression.

Highly Natural Speech Output

By utilizing deep learning technologies, the generated speech closely mimics human natural speech, exhibiting smooth intonation and emotional expression.
Multi-Language and Multi-Dialect Support

It supports a wide range of languages and dialects, offering localized voice experiences for global users.
Rapid Response and Low Latency

Optimized algorithms and high-performance cloud computing resources provide a quick response voice generation experience, maintaining low latency even under large-scale usage.
High Availability and Scalability

Built on a cloud platform, it offers high reliability and scalability, suitable for a variety of application scenarios, from small-scale apps to large enterprise-level deployments.

MaaS OSpeech

MaaS OSpeech processes the input text through deep learning and neural network technologies, generating high-quality, natural-sounding speech output. This model, trained on extensive voice datasets, is capable of comprehending and synthesizing speech with various intonations and emotions.

Natural and Realistic Speech Output

Utilizing state-of-the-art deep learning techniques, the generated speech surpasses traditional TTS systems in naturalness and fluency, capturing the emotional and tonal variations of human speech.
Multi-Language and Multi-Accent Support

It supports a wide range of languages and accents, catering to the needs of users from different regions and cultures, achieving localized speech synthesis.
Real-Time Response

Leveraging the powerful computational capabilities of the cloud platform, the MaaS OSpeech model can swiftly process and generate speech, fulfilling the demands of real-time interactive applications.

MaaS_T2A Series

MaaS_T2A Series supports synchronous generation based on text-to-speech, with a maximum of 10,000 characters for single text transmission. The interface itself is a stateless one, meaning that during a single call, the amount of information received by the model is only the content passed into the interface, without involving business logic. Meanwhile, the model does not store the data you pass in.

Supported Features:

Supports selection of over 100 system voices and custom cloned voices;
Supports adjustment of volume, intonation, speech rate, and output format;
Supports proportional audio mixing;
Supports fixed interval control;
Supports multiple audio specifications and formats, including: mp3, pcm, flac, wav. Note: Wav format is only supported for non-streaming output;
Supports streaming output.

This series currently supports three models.

Model	Features
MaaS_T2A_V2_01HD	Has extremely high replication similarity and outstanding sound quality performance
MaaS_T2A_V2_02HD	Has more excellent rhythm, stability and replication similarity, and outstanding sound quality performance
MaaS_T2A_V2_01Turbo	Has more excellent rhythm and stability, enhanced small language capability, and outstanding performance