MaaS_GLM
Request Protocol
Https
Request Header
| Parameter Name | Value |
|---|---|
| Authorization | Bearer |
| Content-Type | multipart/form-data |
Request URL
https:// genaiapi-m2.cloudsway.net /v1/ai/{endpoint}/chat/completions
Request Body
Request Parameters
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| do_sample | Boolean value | true |
Whether to sample the output to increase diversity. |
| temperature | Floating point number | (Model Dependence) | Controls the randomness of the output, with higher values indicating greater randomness. |
| top_p | Floating point number | (Model Dependence) | Control diversity through nucleus sampling, recommended to choose one between this and temperature. |
| max_tokens | Integer | (Model Dependence) | Limit the maximum number of tokens generated per call. |
| stream | Boolean value | false |
Whether to return the response in streaming mode. |
| thinking | Object | {"type": "enabled"} |
Whether to enable chain-of-thought deep thinking is only supported by GLM-4.5 and above. |
| reasoning_effort | String | max xhigh high medium low minimal none |
Controls the inference level of the model, only supported by GLM-5.2 and above. |
Parameter Details
do_sample
do_sample is a boolean value (true or false) used to determine whether to sample the model's output.
-
true(default): Randomly sample based on the probability distribution of each token to increase the diversity and creativity of the text. Suitable for scenarios such as content creation and conversation. -
false: Uses a greedy strategy, always selecting the next token with the highest probability. It has high output determinacy and is suitable for scenarios that require precise and factual answers.
Best Practices:
-
When reproducible and deterministic output is required, set it to
false. -
When you want the model to generate more diverse and interesting content, set it to
true, and use it in conjunction withtemperatureortop_p.
temperature
temperature (temperature) parameter controls the randomness of the model output.
-
Lower values (e.g., 0.2): The probability distribution is more "sharp", and the output is more deterministic and conservative.
-
Higher values (e.g., 0.8): The probability distribution is more "flat", and the output is more random and diverse.
Best Practices:
-
In scenarios that require rigor and factual accuracy (such as knowledge Q\&A), it is recommended to use a lower
temperature. -
In scenarios that require creativity (such as content creation), you can try a higher
temperature. -
It is recommended to use only one of
temperatureandtop_p.
top_p
top_p (nucleus sampling) controls diversity by sampling from the smallest set of tokens whose cumulative probability exceeds a threshold.
-
Lower values (e.g., 0.2): Limit the sampling range and result in more deterministic outputs.
-
Higher values (e.g., 0.9): Expand the sampling range and output more diverse results.
Best Practices:
-
If you want to ensure content quality while achieving a certain degree of diversity,
top_pis a good choice (recommended value 0.8-0.95). -
It is generally not recommended to modify both
temperatureandtop_psimultaneously.
max_tokens
max_tokens is used to limit the maximum number of tokens generated by a single model call. GLM-4.6 supports a maximum output length of 128K, while GLM-4.5 supports a maximum output length of 96K. It is recommended to set it to no less than 1024. Tokens are the basic units of text, and typically 1 token is approximately equal to 0.75 English words or 1.5 Chinese characters. Setting an appropriate max_tokens can control the response length and cost, avoiding overly long outputs.If the model completes its response before reaching the max_tokens limit, it will end naturally; if the limit is reached, the output may be truncated.
-
Function: Prevent the generation of overly long text and control API call costs.
-
Note:
max_tokenslimits the length of the generated content, excluding the input.
Best Practices:
- Set
max_tokensappropriately based on the application scenario. If a short answer is needed, it can be set to a smaller value (e.g., 50).
Default max_tokens and supported maximum max_tokens for each model:
stream
stream is a boolean value used to control how the API responds.
-
false(default): Returns the complete response all at once, which is simple to implement but has a long waiting time. -
true: Return content in streaming (SSE) mode, significantly enhancing the experience of real-time interactive applications.
Best Practices:
- For applications such as chatbots and real-time code generation, it is highly recommended to set it to
true.
thinking
thinking parameter is used to control whether the model enables "Chain of Thought" to perform deeper thinking and reasoning.
-
Type: Object
-
Supported models:
GLM-4.5and above
Attributes:
-
type(string): -
enabled(default): Enable Chain of Thought.GLM-5.2GLM-5.1GLM-5GLM-5-TurboGLM-5v-TurboGLM-4.6GLM-4.6VGLM-4.5allow the model to automatically determine whether to think,GLM-4.7GLM-4.5Venforce thinking. -
disabled: Turn off the thought chain.
Best Practices:
-
It is recommended to enable when the model needs to perform complex reasoning and planning.
-
For simple tasks, it can be turned off to obtain a faster response.
reasoning_effort
reasoning_effort parameter is used to control the reasoning level of the model when "Chain of Thought" is enabled.
-
Type: String
-
Supported models:
GLM-5.2and above -
Supported:
maxxhighhighmediumlowminimalnone -
high: Enhanced Reasoning -
max: Deep Inference (default)
Note:
- To maintain compatibility with other protocols, passing in
noneorminimalmodels will abandon thinking; passing inlowmediumwill be mapped tohigh; passing inxhighwill be mapped tomax.
Request Example
curl --location
--request POST 'https://genaiapi-m2.cloudsway.net/v1/ai/{endpoint}/chat/completions' \
--header 'Authorization: Bearer ${YOUR_AK}' \
--header 'Content-Type: application/json' \
-d '{
"model": "glm-5.2",
"messages": [
{
"role": "user",
"content": "As a marketing expert, please create a catchy slogan for my product."
}
],
"thinking": {
"type": "enabled"
},
"max_tokens": 65536,
"temperature": 1.0
}'
Return Example
{
"id": "chatcmpl-LucZfRaIraogqFUo0ieR6KhB",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I'm GLM, trained by Z.ai. How can I assist you today? Whether you have questions or just want to chat, I'm happy to help.",
"reasoning_content": "Let me consider how to respond to this greeting thoughtfully.\n\nThe user has sent a simple \"Hi\" - this is likely the beginning of a conversation. I should respond in a way that's both welcoming and open-ended to encourage further interaction.\n\nI need to introduce myself and indicate my readiness to help. A warm, professional greeting would be appropriate here. I should also invite them to share what's on their mind or what they need assistance with.\n\nSince this is an initial greeting, I'll keep my response concise but friendly, making it clear that I'm here to help with whatever they might need."
},
"finish_reason": "stop",
"native_finish_reason": "stop"
}
],
"created": 1782453513,
"model": "MaaS_GLM_5.2_20260617",
"object": "chat.completion",
"usage": {
"prompt_tokens": 13,
"completion_tokens": 158,
"total_tokens": 171,
"completion_tokens_details": {
"accepted_prediction_tokens": 0,
"audio_tokens": 0,
"image_tokens": 0,
"reasoning_tokens": 121,
"rejected_prediction_tokens": 0
},
"prompt_tokens_details": {
"audio_tokens": 0,
"cached_tokens": 0,
"image_tokens": 0
}
}
}