实时语音接口文档

版本历史

版本号	日期	修改内容
v1.1	2024-11-11	调整MaaS 4o realtime preview连接路径，优化格式
v1.0	2024-11-09	初版

MaaS 4o realtime preview

请求方法

Websocket

请求路径：

wss://{domain}/v1/realtime?model={modelName}

请求路径参数：

参数	描述	示例
model	模型名称

请求体：

当连接到MaaS 4o realtime preview 服务器后，客户端可发送如下事件。

1. session.update

发送此事件以更新会话的默认配置。客户端可以随时发送此事件以更新会话配置，任何字段都可以随时更新，除了"voice"字段。服务器将响应一个session.updated事件，显示完整的有效配置。只有存在的字段会被更新，因此清除像"instructions"这样的字段的正确方法是传递一个空字符串。

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为"session.update"。
session	object	session配置。

1.1 session 对象属性

属性名	类型	描述
modalities	array	模型可以响应的模态集合。要禁用音频，将其设置为["text"]。
instructions	string	默认的系统指令（即系统消息），在模型调用之前添加。此字段允许客户端指导模型的期望响应。可以指示模型响应的内容和格式（例如，“非常简洁”、“表现友好”、“以下是好的响应示例”）以及音频行为（例如，“说话快”、“在声音中注入情感”、“经常笑”）。这些指令不能保证模型会遵循，但它们为模型提供了期望行为的指导。注意，服务器设置了默认指令，如果此字段未设置，将使用默认指令，并在会话开始时的session.created事件中可见。
voice	string	模型用于响应的声音。支持的声音有alloy、ash、ballad、coral、echo、sage、shimmer和verse。一旦模型至少用音频响应一次，就不能更改。
input_audio_format	string	输入音频的格式。选项有pcm16、g711_ulaw或g711_alaw。
output_audio_format	string	输出音频的格式。选项有pcm16、g711_ulaw或g711_alaw。
input_audio_transcription	object	输入音频转录的配置，默认关闭，可以设置为null以关闭。输入音频转录不是模型的原生功能，因为模型直接处理音频。转录通过Whisper异步运行，应被视为粗略指导，而不是模型理解的表示。
turn_detection	object	轮次检测的配置。可以设置为null以关闭。服务器VAD表示模型将根据音量检测语音的开始和结束，并在用户语音结束时响应。
tools	array	模型可用的工具（函数）。
tool_choice	string	模型选择工具的方式。选项有auto、none、required或指定函数。
temperature	number	模型的采样温度，范围为[0.6, 1.2]。默认值为0.8。
max_response_output_tokens	int或"inf"	单个助手响应的最大输出token数，包括工具调用。提供1到4096之间的整数以限制输出token，或使用inf表示给定模型的最大可用token。默认值为inf。

2. input_audio_buffer.append

发送此事件以将音频字节追加到输入音频缓冲区。音频缓冲区是一个临时存储，可以写入并稍后提交。在服务器VAD模式下，音频缓冲区用于检测语音，服务器将决定何时提交。当服务器VAD被禁用时，必须手动提交音频缓冲区。客户端可以选择在每个事件中放置多少音频，最大为15 MiB，例如从客户端流式传输较小的块可能会使VAD更具响应性。与其他客户端事件不同，服务器不会发送此事件的确认响应。

参数

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为"input_audio_buffer.append"。
audio	string	Base64编码的音频字节。必须是会话配置中`input_audio_format`字段指定的格式。

示例

{
    "event_id": "event_456",
    "type": "input_audio_buffer.append",
    "audio": "Base64EncodedAudioData"
}

3. input_audio_buffer.commit

发送此事件以提交用户输入的音频缓冲区，这将在对话中创建一个新的用户消息项。如果输入音频缓冲区为空，此事件将产生错误。在服务器VAD模式下，客户端不需要发送此事件，服务器将自动提交音频缓冲区。提交输入音频缓冲区将触发输入音频转录（如果在会话配置中启用），但不会从模型创建响应。服务器将响应一个input_audio_buffer.committed事件。

参数

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为"input_audio_buffer.commit"。

示例

{
    "event_id": "event_789",
    "type": "input_audio_buffer.commit"
}

4. input_audio_buffer.clear

发送此事件以清除缓冲区中的音频字节。服务器将响应一个input_audio_buffer.cleared事件。

参数

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为"input_audio_buffer.clear"。

示例

{
    "event_id": "event_012",
    "type": "input_audio_buffer.clear"
}

5. conversation_item_create

向对话的上下文中添加新的项，包括消息、函数调用和函数调用响应。此事件可用于填充对话的“历史”，也可用于在对话进行中添加新项，但目前的限制是无法填充助手音频消息。如果成功，服务器将响应一个conversation.item.created事件，否则将发送一个error事件。

参数

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为`conversation.item.create`。
previous_item_id	string	前一个项的ID，新项将插入在其之后。如果未设置，新项将附加到对话的末尾。如果设置了此ID，则允许在对话中间插入项。如果找不到此ID，将返回错误，且不会添加新项。
item	object	要添加到对话中的项。

5.1 item对象的结构

属性

属性名	类型	描述
id	string	项的唯一ID，可以由客户端生成以帮助管理服务器端的上下文，但不是必需的，因为如果未提供，服务器将生成一个。
type	string	项的类型（message, function_call, function_call_output）。
status	string	项的状态（completed, incomplete）。这些状态对对话没有影响，但为了与`conversation.item.created`事件保持一致，可以接受此属性。
role	string	消息发送者的角色（user, assistant, system），仅适用于消息项。
content	array	消息的内容，适用于消息项。角色为system的消息项仅支持`input_text`内容，角色为user的消息项支持`input_text`和`input_audio`内容，角色为assistant的消息项支持`text`内容。
call_id	string	函数调用的ID（适用于`function_call`和`function_call_output`项）。如果在`function_call_output`项中传递，服务器将检查对话历史中是否存在具有相同ID的`function_call`项。
name	string	被调用函数的名称（适用于`function_call`项）。
arguments	string	函数调用的参数（适用于`function_call`项）。
output	string	函数调用的输出（适用于`function_call_output`项）。

5.1.1 content 属性的结构

属性名	类型	描述
type	string	内容类型（input_text, input_audio, text）。
text	string	文本内容，适用于`input_text`和`text`内容类型。
audio	string	Base64编码的音频字节，适用于`input_audio`内容类型。
transcript	string	音频的文本记录，适用于`input_audio`内容类型。

示例

{
    "event_id": "event_345",
    "type": "conversation.item.create",
    "previous_item_id": null,
    "item": {
        "id": "msg_001",
        "type": "message",
        "role": "user",
        "content": [
            {
                "type": "input_text",
                "text": "Hello, how are you?"
            }
        ]
    }
}

6. conversation.item.truncate

发送此事件以截断之前助手消息的音频。服务器将比实时更快地生成音频，因此当用户打断时，此事件可用于截断已经发送到客户端但尚未播放的音频。这将同步服务器对音频的理解与客户端的播放。截断音频将删除服务器端的文本记录，以确保上下文中没有用户未听到的文本。如果成功，服务器将响应一个conversation.item.truncated事件。

参数

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为`conversation.item.truncate`。
item_id	string	要截断的助手消息项的ID。只有助手消息项可以被截断。
content_index	int	要截断的内容部分的索引。设为0。
audio_end_ms	int	要截断的音频时长（以毫秒为单位）。如果`audio_end_ms`大于实际音频时长，服务器将响应错误。

示例

{
    "event_id": "event_678",
    "type": "conversation.item.truncate",
    "item_id": "msg_002",
    "content_index": 0,
    "audio_end_ms": 1500
}

7. conversation.item.delete

当您想从对话历史中删除任何项时发送此事件。服务器将响应一个conversation.item.deleted事件，除非该项不存在于对话历史中，在这种情况下，服务器将响应一个错误。

参数

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为`conversation.item.delete`。
item_id	string	要删除的项的ID。

示例

{
    "event_id": "event_901",
    "type": "conversation.item.delete",
    "item_id": "msg_003"
}

8. response.create

此事件指示服务器创建一个响应，这意味着触发模型推理。当处于服务器VAD模式时，服务器将自动创建响应。一个响应将至少包含一个项，并可能有两个项，在这种情况下，第二个将是一个函数调用。这些项将附加到对话历史中。服务器将响应一个response.created事件，随后是为创建的项和内容生成的事件，最后是response.done事件，以指示响应已完成。response.create事件包括推理配置，如指令和温度。这些字段将仅覆盖此响应的会话配置。

参数

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为`response.create`。
response	object	响应资源。

8.1 response对象属性:

属性名	类型	描述
id	string	响应的唯一ID。
object	string	对象类型，必须为`realtime.response`。
status	string	响应的最终状态（completed, cancelled, failed, incomplete）。
status_details	object	关于状态的附加细节。
output	array	response生成的输出列表
usage	object	响应的使用统计，这将与计费相对应。实时API会话将维护对话上下文并附加新项到对话中，因此先前回合的输出（文本和音频令牌）将成为后续回合的输入。

8.1.1 status_details属性

属性名	类型	描述
type	string	导致响应失败的错误类型，与状态字段对应（cancelled, incomplete, failed）。
reason	string	响应未完成的原因。对于被取消的响应，可能是`turn_detected`（服务器VAD检测到新的发言开始）或`client_cancelled`（客户端发送了取消事件）。对于不完整的响应，可能是`max_output_tokens`或`content_filter`（服务器端安全过滤器被激活并中断了响应）。
error	object	导致响应失败的错误描述，当状态为失败时填充。

8.1.1.1 error属性

属性名	类型	描述
type	string	错误类型。
code	string	错误代码（如果有）。

8.1.2 usage的属性

属性名	类型	描述
total_tokens	int	响应中令牌的总数，包括输入和输出文本及音频令牌。
input_tokens	int	响应中使用的输入令牌数量，包括文本和音频令牌。
output_tokens	int	响应中发送的输出令牌数量，包括文本和音频令牌。
input_token_details	object	关于响应中使用的输入令牌的详细信息。
output_token_details	object	关于响应中使用的输出令牌的详细信息。

8.1.2.1 input_token_details属性

属性名	类型	描述
cached_tokens	int	缓存token
text_tokens	int	文本token
audio_tokens	int	音频token

8.1.2.2 output_token_details属性

属性名	类型	描述
text_tokens	int	响应中使用的文本令牌数量。
audio_tokens	int	响应中使用的音频令牌数量。

示例

{
    "event_id": "event_234",
    "type": "response.create",
    "response": {
        "modalities": ["text", "audio"],
        "instructions": "Please assist the user.",
        "voice": "alloy",
        "output_audio_format": "pcm16",
        "tools": [\
            {\
                "type": "function",\
                "name": "calculate_sum",\
                "description": "Calculates the sum of two numbers.",\
                "parameters": {\
                    "type": "object",\
                    "properties": {\
                        "a": { "type": "number" },\
                        "b": { "type": "number" }\
                    },\
                    "required": ["a", "b"]\
                }\
            }\
        ],
        "tool_choice": "auto",
        "temperature": 0.7,
        "max_output_tokens": 150
    }
}

9. response.cancel

发送此事件以取消正在进行的响应。如果没有响应可以取消，服务器将响应一个response.cancelled事件或一个错误。

参数

参数名	类型	描述
event_id	string	可选的客户端生成的ID，用于标识此事件。
type	string	事件类型，必须为`response.cancel`。

示例

{
    "event_id": "event_567",
    "type": "response.cancel"
}

服务端

Maas Realtime 服务器可返回如下事件：

1. error

当发生错误时返回，此错误可能是客户端问题或服务器问题。大多数错误是可恢复的，会话将保持打开状态。我们建议实现者默认监控和记录错误消息。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为"error"。
error	object	错误的详细信息。

error属性

属性名	类型	描述
type	string	错误类型（例如，"invalid_request_error"、"server_error"）。
code	string	错误代码（如果有）。
message	string	可读的人类错误消息。
param	string	与错误相关的参数（如果有）。
event_id	string	导致错误的客户端事件的event_id（如果适用）。

示例

{
    "event_id": "event_890",
    "type": "error",
    "error": {
        "type": "invalid_request_error",
        "code": "invalid_event",
        "message": "The 'type' field is missing.",
        "param": null,
        "event_id": "event_567"
    }
}

2. session.created

当会话被创建时返回。在建立新连接时自动发出，作为第一个服务器事件。此事件将包含默认的会话配置。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`session.created`。
session	object	实时会话对象配置。

2.1 session属性

属性名	类型	描述
modalities	array	模型可以响应的模式集合。要禁用音频，将其设置为`["text"]`。
instructions	string	默认的系统指令（即系统消息），在模型调用之前添加。此字段允许客户端指导模型的期望响应。模型可以被指示响应内容和格式（例如，“非常简洁”，“表现友好”，“以下是良好响应的示例”）以及音频行为（例如，“快速说话”，“在声音中注入情感”，“频繁笑声”）。这些指令不保证模型会遵循，但为模型提供了期望行为的指导。注意，如果此字段未设置，服务器将设置默认指令，并在会话开始时在`session.created`事件中可见。
voice	string	模型用于响应的声音。支持的声音有alloy、ash、ballad、coral、echo、sage、shimmer和verse。一旦模型至少响应过一次音频，无法更改。
input_audio_format	string	输入音频的格式。选项有pcm16、g711_ulaw或g711_alaw。
output_audio_format	string	输出音频的格式。选项有pcm16、g711_ulaw或g711_alaw。
input_audio_transcription	object	输入音频转录的配置，默认为关闭，可以设置为null以在开启后关闭。输入音频转录并不是模型的本地功能，因为模型直接处理音频。转录通过Whisper异步运行，应视为粗略指导，而不是模型理解的表示。
turn_detection	object	转换检测的配置。可以设置为null以关闭。服务器VAD意味着模型将根据音频音量检测语音的开始和结束，并在用户语音结束时响应。
tools	array	模型可用的工具（数）。
tool_choice	string	模型选择工具的方式。选项有auto、none、required或指定一个函数。
temperature	number	模型的采样温度，限制在[0.6, 1.2]之间。默认为0.8。
max_response_output_tokens	int或"inf"	单个助手响应的最大输出令牌数，包括工具调用。提供一个介于1和4096之间的整数以限制输出令牌，或提供inf以获取给定模型的最大可用令牌。默认为inf。

2.1.1 input_audio_transcription属性

属性名	类型	描述
model	string	用于转录的模型，whisper-1 是当前唯一支持的模型型。

2.1.2 turn_detection属性

属性名	类型	描述
type	string	转换检测的类型，目前仅支持`server_vad`。
threshold	number	VAD的激活阈值（0.0到1.0），默认为0.5。更高的阈值将需要更大的音频来激活模型，因此在嘈杂环境中可能表现更好。
prefix_padding_ms	int	在VAD检测到的语音之前包含的音频量（以毫秒为单位）。默认为300毫秒。
silence_duration_ms	int	检测语音停止的静默持续时间（以毫秒为单位）。默认为500毫秒。较短的值将使模型更快地响应，但可能会在用户短暂停顿时插入。

2.1.3 tools属性

属性名	类型	描述
type	string	工具的类型，即函数。
name	string	函数的名称。
description	string	函数的描述，包括何时以及如何调用的指导，以及在调用时告诉用户的内容（如果有）。
parameters	object	函数的参数，以JSON Schema表示。

示例

{
    "event_id": "event_1234",
    "type": "session.created",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-10-01",
        "modalities": ["text", "audio"],
        "instructions": "",
        "voice": "alloy",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 200
        },
        "tools": [],
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": null
    }
}

3. session.updated

当会话通过session.update事件更新时返回，除非发生错误。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`session.updated`。
session	object	实时会话对象配置。

3.1 session属性

属性名	类型	描述
modalities	array	模型可以响应的模式集合。要禁用音频，将其设置为`["text"]`。
instructions	string	默认的系统指令（即系统消息），在模型调用之前添加。此字段允许客户端指导模型的期望响应。模型可以被指示响应内容和格式（例如，“非常简洁”，“表现友好”，“以下是良好响应的示例”）以及音频行为（例如，“快速说话”，“在声音中注入情感”，“频繁笑声”）。这些指令不保证模型会遵循，但为模型提供了期望行为的指导。注意，如果此字段未设置，服务器将设置默认指令，并在会话开始时在`session.created`事件中可见。
voice	string	模型用于响应的声音。支持的声音有alloy、ash、ballad、coral、echo、sage、shimmer和verse。一旦模型至少响应过一次音频，无法更改。
input_audio_format	string	输入音频的格式。选项有pcm16、g711_ulaw或g711_alaw。
output_audio_format	string	输出音频的格式。选项有pcm16、g711_ulaw或g711_alaw。
input_audio_transcription	object	输入音频转录的配置，默认为关闭，可以设置为null以在开启后关闭。输入音频转录并不是模型的本地功能，因为模型直接处理音频。转录通过Whisper异步运行，应视为粗略指导，而不是模型理解的表示。
turn_detection	object	转换检测的配置。可以设置为null以关闭。服务器VAD意味着模型将根据音频音量检测语音的开始和结束，并在用户语音结束时响应。
tools	array	模型可用的工具（数）。
tool_choice	string	模型选择工具的方式。选项有auto、none、required或指定一个函数。
temperature	number	模型的采样温度，限制在[0.6, 1.2]之间。默认为0.8。
max_response_output_tokens	int或"inf"	单个助手响应的最大输出令牌数，包括工具调用。提供一个介于1和4096之间的整数以限制输出令牌，或提供inf以获取给定模型的最大可用令牌。默认为inf。

3.1.1 input_audio_transcription属性

属性名	类型	描述
model	string	用于转录的模型，whisper-1 是当前唯一支持的模型型。

3.1.2 turn_detection属性

属性名	类型	描述
type	string	转换检测的类型，目前仅支持`server_vad`。
threshold	number	VAD的激活阈值（0.0到1.0），默认为0.5。更高的阈值将需要更大的音频来激活模型，因此在嘈杂环境中可能表现更好。
prefix_padding_ms	int	在VAD检测到的语音之前包含的音频量（以毫秒为单位）。默认为300毫秒。
silence_duration_ms	int	检测语音停止的静默持续时间（以毫秒为单位）。默认为500毫秒。较短的值将使模型更快地响应，但可能会在用户短暂停顿时插入。

3.1.3 tools属性

属性名	类型	描述
type	string	工具的类型，即函数。
name	string	函数的名称。
description	string	函数的描述，包括何时以及如何调用的指导，以及在调用时告诉用户的内容（如果有）。
parameters	object	函数的参数，以JSON Schema表示。

会话对象示例

{
    "event_id": "event_5678",
    "type": "session.updated",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-10-01",
        "modalities": ["text"],
        "instructions": "New instructions",
        "voice": "alloy",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "whisper-1"
        },
        "turn_detection": null,
        "tools": [],
        "tool_choice": "none",
        "temperature": 0.7,
        "max_response_output_tokens": 200
    }
}

4. conversation.created

当会话被创建时返回。此事件在会话创建后立即发出。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`conversation.created`。
conversation	object	会话资源。

4.1 conversation属性

参数名	类型	描述
id	string	conversation唯一ID。
object	string	The object type, must be "realtime.conversation".

示例

{
    "event_id": "event_9101",
    "type": "conversation.created",
    "conversation": {
        "id": "conv_001",
        "object": "realtime.conversation"
    }
}

5. conversation.item.created

当会话项被创建时返回。产生此事件的场景有几种：

服务器正在生成响应，如果成功，将产生一个或两个项，这些项的类型为message（角色为assistant）或类型为function_call。
输入音频缓冲区已被提交，由客户端或服务器（在server_vad模式下）进行。服务器将获取输入音频缓冲区的内容，并将其添加到新的用户消息项中。
客户端发送了conversation.item.create事件以向会话添加新项。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`conversation.item.created`。
previous_item_id	string	会话上下文中前一项的ID，允许客户端理解会话的顺序。
item	object	要添加到会话中的项。

5.1 item属性

属性名	类型	描述
id	string	项的唯一标识符，可以由客户端生成以帮助管理服务器端上下文，但不是必需的，因为如果未提供，服务器将生成一个。
type	string	项的类型（`message`、`function_call`、`function_call_output`）。
status	string	项的状态（`completed`、`incomplete`）。这些状态对会话没有影响，但为了与`conversation.item.created`事件保持一致而被接受。
role	string	消息发送者的角色（`user`、`assistant`、`system`），仅适用于消息项。
content	array	消息的内容，适用于消息项。系统角色的消息项仅支持`input_text`内容，用户角色的消息项支持`input_text`和`input_audio`内容，助手角色的消息项支持文本内容。
call_id	string	函数调用的ID（适用于`function_call`和`function_call_output`项）。如果在`function_call_output`项中传递，服务器将检查会话历史中是否存在相同ID的`function_call`项。
name	string	正在调用的函数的名称（适用于`function_call`项）。
arguments	string	函数调用的参数（适用于`function_call`项）。
output	string	函数调用的输出（适用于`function_call_output`项）。

5.1.1 content属性

属性名	类型	描述
type	string	内容类型（`input_text`、`input_audio`、`text`）。
text	string	文本内容，用于`input_text`和`text`内容类型。
audio	string	Base64编码的音频字节，用于`input_audio`内容类型。
transcript	string	音频的转录文本，用于`input_audio`内容类型。

示例

{
    "event_id": "event_1920",
    "type": "conversation.item.created",
    "previous_item_id": "msg_002",
    "item": {
        "id": "msg_003",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "transcript": "hello how are you",
                "audio": "base64encodedaudio=="
            }
        ]
    }
}

6. conversation.item.input_audio_transcription.completed

此事件是用户音频的转录输出，音频写入用户音频缓冲区。当客户端或服务器（在server_vad模式下）提交输入音频缓冲区时，转录开始。转录与响应创建异步进行，因此此事件可能在响应事件之前或之后到达。

实时API模型本身接受音频，因此输入转录是一个独立的过程，运行在一个单独的ASR（自动语音识别）模型上，目前始终为whisper-1。因此，转录文本可能与模型的解释有所不同，应视为粗略指南。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`conversation.item.input_audio_transcription.completed`。
item_id	string	包含音频的用户消息项的ID。
content_index	int	包含音频的内容部分的索引。
transcript	string	转录后的文本。

示例

{
    "event_id": "event_2122",
    "type": "conversation.item.input_audio_transcription.completed",
    "item_id": "msg_003",
    "content_index": 0,
    "transcript": "Hello, how are you?"
}

7. conversation.item.input_audio_transcription.failed

当输入音频转录被配置，并且用户消息的转录请求失败时返回此事件。这些事件与其他error事件是分开的，以便客户端能够识别相关的项。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`conversation.item.input_audio_transcription.failed`。
item_id	string	用户消息项的ID。
content_index	int	包含音频的内容部分的索引。
error	object	转录错误的详细信息。

7.1 error属性

属性名	类型	描述
type	string	错误类型（例如`transcription_error`）。
code	string	错误代码（例如`audio_unintelligible`）。
message	string	错误消息，描述转录失败的原因。
param	任意类型	额外的参数，可能为`null`。

示例

{
    "event_id": "event_2324",
    "type": "conversation.item.input_audio_transcription.failed",
    "item_id": "msg_003",
    "content_index": 0,
    "error": {
        "type": "transcription_error",
        "code": "audio_unintelligible",
        "message": "The audio could not be transcribed.",
        "param": null
    }
}

8. conversation.item.truncated

当客户端通过conversation.item.truncate事件截断早期助手音频消息项时返回此事件。此事件用于同步服务器对音频的理解与客户端的播放状态。此操作将截断音频并移除服务器端的文本转录，以确保上下文中没有用户未听到的文本。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`conversation.item.truncated`。
item_id	string	被截断的助手消息项的ID。
content_index	int	被截断的内容部分的索引。
audio_end_ms	int	截断音频的持续时间，以毫秒为单位。

示例

{
    "event_id": "event_2526",
    "type": "conversation.item.truncated",
    "item_id": "msg_004",
    "content_index": 0,
    "audio_end_ms": 1500
}

9. conversation.item.deleted

当客户端通过conversation.item.delete事件删除对话中的一项时返回此事件。此事件用于同步服务器对对话历史的理解与客户端的视图。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`conversation.item.deleted`。
item_id	string	被删除项的ID。

示例

{
    "event_id": "event_2728",
    "type": "conversation.item.deleted",
    "item_id": "msg_005"
}

10 input_audio_buffer.committed

当输入音频缓冲区被提交时返回此事件，无论是由客户端还是在服务器VAD模式下自动提交。item_id属性是将要创建的用户消息项的ID，因此也会向客户端发送conversation.item.created事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`input_audio_buffer.committed`。
previous_item_id	string	新项将插入的前一项的ID。
item_id	string	将要创建的用户消息项的ID。

示例

{
    "event_id": "event_1121",
    "type": "input_audio_buffer.committed",
    "previous_item_id": "msg_001",
    "item_id": "msg_002"
}

11. input_audio_buffer.cleared

当客户端通过input_audio_buffer.clear事件清除输入音频缓冲区时返回此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`input_audio_buffer.cleared`。

示例

{
    "event_id": "event_1314",
    "type": "input_audio_buffer.cleared"
}

12. input_audio_buffer.speech_started

当服务器在server_vad模式下检测到音频缓冲区中的语音时发送。每当音频被添加到缓冲区时（除非已经检测到语音），都可能发生此事件。客户端可能希望使用此事件来中断音频播放或向用户提供视觉反馈。当语音停止时，客户端应期望收到input_audio_buffer.speech_stopped事件。item_id属性是当语音停止时将要创建的用户消息项的ID，并且也将包含在input_audio_buffer.speech_stopped事件中（除非客户端在VAD激活期间手动提交音频缓冲区）。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`input_audio_buffer.speech_started`。
audio_start_ms	int	会话开始时写入缓冲区的所有音频中首次检测到语音的毫秒数。这将对应于发送到模型的音频开始，并包括会话中配置的`prefix_padding_ms`。
item_id	string	当语音停止时将要创建的用户消息项的ID。

示例

{
    "event_id": "event_1516",
    "type": "input_audio_buffer.speech_started",
    "audio_start_ms": 1000,
    "item_id": "msg_003"
}

13. input_audio_buffer.speech_stopped

在server_vad模式下，当服务器检测到音频缓冲区中的语音结束时返回此事件。服务器还将发送一个conversation.item.created事件，其中包含从音频缓冲区创建的用户消息项。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`input_audio_buffer.speech_stopped`。
audio_end_ms	int	会话开始后停止语音的毫秒数。这将对应于发送到模型的音频结束，并包括会话中配置的`min_silence_duration_ms`。
item_id	string	将要创建的用户消息项的ID。

示例

{
    "event_id": "event_1718",
    "type": "input_audio_buffer.speech_stopped",
    "audio_end_ms": 2000,
    "item_id": "msg_003"
}

14. response.created

当创建新的响应时返回此事件。响应创建的第一个事件，响应处于in_progress的初始状态。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.created`。
response	object	响应资源。

14.1 response对象属性:

属性名	类型	描述
id	string	响应的唯一ID。
object	string	对象类型，必须为`realtime.response`。~~~~~~~~
status	string	响应的最终状态（completed, cancelled, failed, incomplete）。
status_details	object	关于状态的附加细节。
output	array	response生成的输出列表
usage	object	响应的使用统计，这将与计费相对应。实时API会话将维护对话上下文并附加新项到对话中，因此先前回合的输出（文本和音频令牌）将成为后续回合的输入。

14.1.1 status_details属性

属性名	类型	描述
type	string	导致响应失败的错误类型，与状态字段对应（cancelled, incomplete, failed）。
reason	string	响应未完成的原因。对于被取消的响应，可能是`turn_detected`（服务器VAD检测到新的发言开始）或`client_cancelled`（客户端发送了取消事件）。对于不完整的响应，可能是`max_output_tokens`或`content_filter`（服务器端安全过滤器被激活并中断了响应）。
error	object	导致响应失败的错误描述，当状态为失败时填充。

14.1.1.1 error属性

属性名	类型	描述
type	string	错误类型。
code	string	错误代码（如果有）。

14.1.2 usage的属性

属性名	类型	描述
total_tokens	int	响应中令牌的总数，包括输入和输出文本及音频令牌。
input_tokens	int	响应中使用的输入令牌数量，包括文本和音频令牌。
output_tokens	int	响应中发送的输出令牌数量，包括文本和音频令牌。
input_token_details	object	关于响应中使用的输入令牌的详细信息。
output_token_details	object	关于响应中使用的输出令牌的详细信息。

14.1.2.1 input_token_details属性

属性名	类型	描述
cached_tokens	int	缓存token
text_tokens	int	文本token
audio_tokens	int	音频token

14.1.2.2 output_token_details属性

属性名	类型	描述
text_tokens	int	响应中使用的文本令牌数量。
audio_tokens	int	响应中使用的音频令牌数量。

示例

{
    "event_id": "event_2930",
    "type": "response.created",
    "response": {
        "id": "resp_001",
        "object": "realtime.response",
        "status": "in_progress",
        "status_details": null,
        "output": [],
        "usage": null
    }
}

15. response.done

当响应完成流式传输时返回此事件。无论最终状态如何，始终发出此事件。response.done事件中包含的响应对象将包括响应中的所有输出项，但将省略原始音频数据。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.done`。
response	object	响应资源。

15.1 response对象属性:

属性名	类型	描述
id	string	响应的唯一ID。
object	string	对象类型，必须为`realtime.response`。
status	string	响应的最终状态（completed, cancelled, failed, incomplete）。
status_details	object	关于状态的附加细节。
output	array	response生成的输出列表
usage	object	响应的使用统计，这将与计费相对应。实时API会话将维护对话上下文并附加新项到对话中，因此先前回合的输出（文本和音频令牌）将成为后续回合的输入。

15.1.1 status_details属性

属性名	类型	描述
type	string	导致响应失败的错误类型，与状态字段对应（cancelled, incomplete, failed）。
reason	string	响应未完成的原因。对于被取消的响应，可能是`turn_detected`（服务器VAD检测到新的发言开始）或`client_cancelled`（客户端发送了取消事件）。对于不完整的响应，可能是`max_output_tokens`或`content_filter`（服务器端安全过滤器被激活并中断了响应）。
error	object	导致响应失败的错误描述，当状态为失败时填充。

15.1.1.1 error属性

属性名	类型	描述
type	string	错误类型。
code	string	错误代码（如果有）。

15.1.2 usage的属性

属性名	类型	描述
total_tokens	int	响应中令牌的总数，包括输入和输出文本及音频令牌。
input_tokens	int	响应中使用的输入令牌数量，包括文本和音频令牌。
output_tokens	int	响应中发送的输出令牌数量，包括文本和音频令牌。
input_token_details	object	关于响应中使用的输入令牌的详细信息。
output_token_details	object	关于响应中使用的输出令牌的详细信息。

15.1.2.1 input_token_details属性

属性名	类型	描述
cached_tokens	int	缓存token
text_tokens	int	文本token
audio_tokens	int	音频token

15.1.2.2 output_token_details属性

属性名	类型	描述
text_tokens	int	响应中使用的文本令牌数量。
audio_tokens	int	响应中使用的音频令牌数量。

示例

{
    "event_id": "event_3132",
    "type": "response.done",
    "response": {
        "id": "resp_001",
        "object": "realtime.response",
        "status": "completed",
        "status_details": null,
        "output": [
            {
                "id": "msg_006",
                "object": "realtime.item",
                "type": "message",
                "status": "completed",
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "Sure, how can I assist you today?"
                    }
                ]
            }
        ],
        "usage": {
            "total_tokens": 275,
            "input_tokens": 127,
            "output_tokens": 148,
            "input_token_details": {
                "cached_tokens": 384,
                "text_tokens": 119,
                "audio_tokens": 8,
                "cached_tokens_details": {
                    "text_tokens": 128,
                    "audio_tokens": 256
                }
            },
            "output_token_details": {
                "text_tokens": 36,
                "audio_tokens": 112
            }
        }
    }
}

16. response.output_item.added

当在响应生成期间创建新的项时返回此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.output_item.added`。
response_id	string	项所属响应的ID。
output_index	int	响应中输出项的索引。
item	object	要添加到对话中的项。

16.1 item属性

属性名	类型	描述
id	string	项的唯一ID，可以由客户端生成以帮助管理服务器端的上下文，但不是必需的，因为如果未提供，服务器将生成一个。
type	string	项的类型（message, function_call, function_call_output）。
status	string	项的状态（completed, incomplete）。这些状态对对话没有影响，但为了与`conversation.item.created`事件保持一致，可以接受此属性。
role	string	消息发送者的角色（user, assistant, system），仅适用于消息项。
content	array	消息的内容，适用于消息项。角色为system的消息项仅支持`input_text`内容，角色为user的消息项支持`input_text`和`input_audio`内容，角色为assistant的消息项支持`text`内容。
call_id	string	函数调用的ID（适用于`function_call`和`function_call_output`项）。如果在`function_call_output`项中传递，服务器将检查对话历史中是否存在具有相同ID的`function_call`项。
name	string	被调用函数的名称（适用于`function_call`项）。
arguments	string	函数调用的参数（适用于`function_call`项）。
output	string	函数调用的输出（适用于`function_call_output`项）。

16.1.1 content 属性的结构

属性名	类型	描述
type	string	内容类型（input_text, input_audio, text）。
text	string	文本内容，适用于`input_text`和`text`内容类型。
audio	string	Base64编码的音频字节，适用于`input_audio`内容类型。
transcript	string	音频的文本记录，适用于`input_audio`内容类型。

示例

{
    "event_id": "event_3334",
    "type": "response.output_item.added",
    "response_id": "resp_001",
    "output_index": 0,
    "item": {
        "id": "msg_007",
        "object": "realtime.item",
        "type": "message",
        "status": "in_progress",
        "role": "assistant",
        "content": []
    }
}

17. response.output_item.done

当项完成流式传输时返回此事件。当响应被中断、不完整或取消时也会发出此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.output_item.done`。
response_id	string	项所属响应的ID。
output_index	int	响应中输出项的索引。
item	object	要添加到对话中的项。

17.1 item属性

属性名	类型	描述
id	string	项的唯一ID，可以由客户端生成以帮助管理服务器端的上下文，但不是必需的，因为如果未提供，服务器将生成一个。
type	string	项的类型（message, function_call, function_call_output）。
status	string	项的状态（completed, incomplete）。这些状态对对话没有影响，但为了与`conversation.item.created`事件保持一致，可以接受此属性。
role	string	消息发送者的角色（user, assistant, system），仅适用于消息项。
content	array	消息的内容，适用于消息项。角色为system的消息项仅支持`input_text`内容，角色为user的消息项支持`input_text`和`input_audio`内容，角色为assistant的消息项支持`text`内容。
call_id	string	函数调用的ID（适用于`function_call`和`function_call_output`项）。如果在`function_call_output`项中传递，服务器将检查对话历史中是否存在具有相同ID的`function_call`项。
name	string	被调用函数的名称（适用于`function_call`项）。
arguments	string	函数调用的参数（适用于`function_call`项）。
output	string	函数调用的输出（适用于`function_call_output`项）。

17.1.1 content 属性的结构

属性名	类型	描述
type	string	内容类型（input_text, input_audio, text）。
text	string	文本内容，适用于`input_text`和`text`内容类型。
audio	string	Base64编码的音频字节，适用于`input_audio`内容类型。
transcript	string	音频的文本记录，适用于`input_audio`内容类型。

示例

{
    "event_id": "event_3536",
    "type": "response.output_item.done",
    "response_id": "resp_001",
    "output_index": 0,
    "item": {
        "id": "msg_007",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "Sure, I can help with that."
            }
        ]
    }
}

18. response.content_part.added

当在响应生成期间向助手消息项添加新的内容部分时返回此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.content_part.added`。
response_id	string	响应的ID。
item_id	string	内容部分被添加到的项的ID。
output_index	int	响应中输出项的索引。
content_index	int	项的内容数组中内容部分的索引。
part	object	被添加的内容部分。

18.1 part属性

参数名	类型	描述
type	string	内容类型。text、audio
text	string	文本内容（如果type=text）
audio	string	音频的内容输出base64（如果type=audio）
transcript	string	音频的翻译（如果type=audio）

示例

{
    "event_id": "event_3738",
    "type": "response.content_part.added",
    "response_id": "resp_001",
    "item_id": "msg_007",
    "output_index": 0,
    "content_index": 0,
    "part": {
        "type": "text",
        "text": ""
    }
}

19. response.content_part.done

当助手消息项中的内容部分完成流式传输时返回此事件。当响应被中断、不完整或取消时也会发出此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.content_part.done`。
response_id	string	响应的ID。
item_id	string	项的ID。
output_index	int	响应中输出项的索引。
content_index	int	项的内容数组中内容部分的索引。
part	object	已完成的内容部分。

19.1 part属性

参数名	类型	描述
type	string	内容类型。text、audio
text	string	文本内容（如果type=text）
audio	string	音频的内容输出base64（如果type=audio）
transcript	string	音频的翻译（如果type=audio）

示例

{
    "event_id": "event_3940",
    "type": "response.content_part.done",
    "response_id": "resp_001",
    "item_id": "msg_007",
    "output_index": 0,
    "content_index": 0,
    "part": {
        "type": "text",
        "text": "Sure, I can help with that."
    }
}

20. response.text.delta

当“文本”内容部分的文本值更新时返回此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.text.delta`。
response_id	string	响应的ID。
item_id	string	项的ID。
output_index	int	响应中输出项的索引。
content_index	int	项的内容数组中内容部分的索引。
delta	string	文本增量。

示例

{
    "event_id": "event_4142",
    "type": "response.text.delta",
    "response_id": "resp_001",
    "item_id": "msg_007",
    "output_index": 0,
    "content_index": 0,
    "delta": "Sure, I can h"
}

21. response.text.done

当“文本”内容部分的文本值完成流式传输时返回此事件。当响应被中断、不完整或取消时也会发出此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.text.done`。
response_id	string	响应的ID。
item_id	string	项的ID。
output_index	int	响应中输出项的索引。
content_index	int	项的内容数组中内容部分的索引。
text	string	最终文本内容。

示例

{
    "event_id": "event_4344",
    "type": "response.text.done",
    "response_id": "resp_001",
    "item_id": "msg_007",
    "output_index": 0,
    "content_index": 0,
    "text": "Sure, I can help with that."
}

22. response.audio_transcript.delta

当模型生成的音频输出转录更新时返回此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.audio_transcript.delta`。
response_id	string	响应的ID。
item_id	string	项的ID。
output_index	int	响应中输出项的索引。
content_index	int	项的内容数组中内容部分的索引。
delta	string	转录增量。

示例

{
    "event_id": "event_4546",
    "type": "response.audio_transcript.delta",
    "response_id": "resp_001",
    "item_id": "msg_008",
    "output_index": 0,
    "content_index": 0,
    "delta": "Hello, how can I a"
}

23. response.audio_transcript.done

当模型生成的音频输出转录完成流式传输时返回此事件。当响应被中断、不完整或取消时也会发出此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.audio_transcript.done`。
response_id	string	响应的ID。
item_id	string	项的ID。
output_index	int	响应中输出项的索引。
content_index	int	项的内容数组中内容部分的索引。
transcript	string	音频的最终转录。

示例

{
    "event_id": "event_4748",
    "type": "response.audio_transcript.done",
    "response_id": "resp_001",
    "item_id": "msg_008",
    "output_index": 0,
    "content_index": 0,
    "transcript": "Hello, how can I assist you today?"
}

24. response.audio.delta

当模型生成的音频更新时返回此事件。

参数

参数名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为`response.audio.delta`。
response_id	string	响应的ID。
item_id	string	项的ID。
output_index	int	响应中输出项的索引。
content_index	int	项的内容数组中内容部分的索引。
delta	string	Base64编码的音频数据增量。

示例

{
    "event_id": "event_4950",
    "type": "response.audio.delta",
    "response_id": "resp_001",
    "item_id": "msg_008",
    "output_index": 0,
    "content_index": 0,
    "delta": "Base64EncodedAudioDelta"
}

25. response.audio.done

当模型生成的音频完成时返回。也在响应被中断、不完整或取消时发出。

属性

属性名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为 "response.audio.done"。
response_id	string	响应的ID。
item_id	string	项目的ID。
output_index	int	响应中输出项的索引。
content_index	int	项目内容数组中内容部分的索引。

示例 JSON

{
    "event_id": "event_5152",
    "type": "response.audio.done",
    "response_id": "resp_001",
    "item_id": "msg_008",
    "output_index": 0,
    "content_index": 0
}

26. response.function_call_arguments.delta

当模型生成的函数调用参数更新时返回。

属性

属性名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为 "response.function_call_arguments.delta"。
response_id	string	响应的ID。
item_id	string	函数调用项的ID。
output_index	int	响应中输出项的索引。
call_id	string	函数调用的ID。
delta	string	参数增量，作为JSONstring。

示例 JSON

{
    "event_id": "event_5354",
    "type": "response.function_call_arguments.delta",
    "response_id": "resp_002",
    "item_id": "fc_001",
    "output_index": 0,
    "call_id": "call_001",
    "delta": "{\"location\": \"San\""
}

27. response.function_call_arguments.done

当模型生成的函数调用参数完成流式传输时返回。也在响应被中断、不完整或取消时发出。

属性

属性名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为 "response.function_call_arguments.done"。
response_id	string	响应的ID。
item_id	string	函数调用项的ID。
output_index	int	响应中输出项的索引。
call_id	string	函数调用的ID。
arguments	string	最终参数，作为JSON字符串。

示例 JSON

{
    "event_id": "event_5556",
    "type": "response.function_call_arguments.done",
    "response_id": "resp_002",
    "item_id": "fc_001",
    "output_index": 0,
    "call_id": "call_001",
    "arguments": "{\"location\": \"San Francisco\"}"
}

28. rate_limits.updated

在响应开始时发出，以指示更新的速率限制。当创建响应时，一些令牌将被“保留”用于输出令牌，这里显示的速率限制反映了该预留，响应完成后将相应调整。

属性

属性名	类型	描述
event_id	string	服务器事件的唯一ID。
type	string	事件类型，必须为 `rate_limits.updated`。
rate_limits	array	速率限制信息列表。

28.1 rate_limits属性

属性名	类型	描述
name	string	速率限制的名称（如请求、令牌）。
limit	int	速率限制的最大允许值。
remaining	int	在达到限制之前的剩余值。
reset_seconds	number	直到速率限制重置的秒数。

示例 JSON

{
    "event_id": "event_5758",
    "type": "rate_limits.updated",
    "rate_limits": [
        {
            "name": "requests",
            "limit": 1000,
            "remaining": 999,
            "reset_seconds": 60
        },
        {
            "name": "tokens",
            "limit": 50000,
            "remaining": 49950,
            "reset_seconds": 60
        }
    ]
}