实时说话人日志

Real-time Speaker Diarization

接上麦克风 / 电话 / 会议音频,一边说、一边把每句话打上说话人标签。 基于 WebSocket 的流式接口:每当一句完整的话音结束,服务器立刻推送 partial_result—— 适合实时字幕、客服转写、多人会议现场分离。

Pipe a microphone / phone / conference feed in and get each utterance tagged with a speaker as it is spoken. A WebSocket streaming endpoint: the moment a full utterance ends, the server pushes a partial_result — ideal for live captions, call-center transcription, and meeting-room diarization.

WS /diarize/stream 稳定Stable v1.2.4 最近更新 2026-04-23Updated 2026-04-23
01

限制与约束

Limits & Constraints

超出任一限制将返回 413 Payload Too Large422 Validation Error—— 参见 §05 错误码

Exceeding any limit returns 413 Payload Too Large or 422 Validation Error — see §05 Error Codes.

音频格式
AUDIO FORMAT
PCMs16le
16 kHz · 单声道 · 16-bit 小端
16 kHz · mono · 16-bit little-endian
单次会话时长
SESSION LENGTH
2小时hours
服务端 buffer 120 s 滚动裁剪
Server trims a 120 s rolling buffer
推荐分片
CHUNK SIZE
200ms
3200 样本 / 6400 字节 / 帧
3200 samples / 6400 bytes per frame

说明:其他格式组合会被拒绝并立刻关闭(close code 1008)。 浏览器端可直接用 AudioContext({ sampleRate: 16000 }),由浏览器自动重采样到 16 kHz。

Any other format combination is rejected with an immediate close (close code 1008). Browsers can just use AudioContext({ sampleRate: 16000 }) and let the browser resample.

02

性能指标

Performance

单句延迟 · 说完 → partial_result 到达Per-utterance latency · speech-end → partial_result P50 P99
P50 · 中位数 一半请求比它快、一半比它慢。代表典型的日常体验。
P50 · median Half of the requests are faster, half are slower. This is the typical experience.
P99 · 尾部 只有 1% 的请求比它更慢。用它来评估最差情况是否仍可接受。
P99 · tail Only 1% of requests are slower than this. Use it to judge worst-case acceptability.
单句延迟 · P50
Utterance latency · P50
72 ms
单句延迟 · P99
Utterance latency · P99
280 ms
03

在线体验

Try It

点击话筒开始录音;每说完一句话,右侧就会实时多出一段结果。再次点击话筒或发送 close 触发最终结果。需要上传文件请看 说话人日志 · 文件

Click the mic to start recording; after each utterance ends, a new segment appears on the right. Click the mic again or send close to finalize. For file uploads, see Speaker Diarization · File.

WS /diarize/stream · 实时麦克风流WS /diarize/stream · live microphone PCM s16le · 16 kHz · mono
空闲Idle
00:00
点击话筒开始;浏览器会请求麦克风权限。
Click the mic to start — the browser will ask for permission.
点击左侧话筒,开始 实时流 分离。
Click the mic on the left to start live diarization.
16 kHz · PCM s16le · WebSocket
04

协议与响应

Protocol & Response

先看下方消息序列 + 字段定义,再切到最下方语言 tab 复制可直接运行的代码片段。 不需要实时请看 说话人日志 · 文件模式

Start by reading the message sequence and field definitions below, then scroll to the language tabs for copy-paste-ready snippets. Don't need real-time? See Speaker Diarization · File mode.

WS /diarize/stream

WebSocket 握手 → 发 start JSON → 持续推 PCM 二进制帧(200 ms/帧推荐)→ 每句 VAD 闭合收到 partial_result → 发 close 得到 final_result 结束。

WebSocket handshake → send a start JSON → stream PCM binary frames (200 ms recommended) → receive partial_result per closed utterance → send close to get a final_result and end.

请求字段

Request fields

字段
Field
类型
Type
要求
Required
说明
Description
token
query
可选
optional
仅当服务端 AUTH_ENABLED=True 时需要。URL 查询参数:?token=<token>。WebSocket 握手不支持自定义 Header 所以不能用 Authorization
Required only when the server has AUTH_ENABLED=True. URL query parameter: ?token=<token>. WebSocket handshakes don't carry custom headers so Authorization isn't available.
sample_rate
int
必填
required
start 消息字段。必须为 16000,其他值以 close code 1008 拒绝。
Field in the start message. Must be 16000; other values are rejected with close code 1008.
channels
int
必填
required
start 消息字段。必须为 1(单声道)。
Field in the start message. Must be 1 (mono).
sample_width
int
必填
required
start 消息字段。字节宽度,必须为 2(16-bit PCM)。
Field in the start message. Byte width; must be 2 (16-bit PCM).
format
string
必填
required
start 消息字段。必须为 "pcm_s16le"(16-bit 小端 PCM)。
Field in the start message. Must be "pcm_s16le" (16-bit little-endian PCM).

消息序列

Message sequence

消息
Message
方向
Direction
类型
Type
说明
Description
start
→ server
JSON
{"type":"start","sample_rate":16000,"channels":1,"sample_width":2,"format":"pcm_s16le"}。仅支持 16 kHz · mono · s16le。
{"type":"start","sample_rate":16000,"channels":1,"sample_width":2,"format":"pcm_s16le"}. Only 16 kHz · mono · s16le is accepted.
ready
← server
JSON
{"type":"ready","session_id":"..."}。收到后即可推音频。
{"type":"ready","session_id":"..."}. Start pushing audio once this arrives.
<binary>
→ server
PCM
原始 PCM s16le 二进制帧。推荐每帧 200 ms(3200 样本 / 6400 字节)。
Raw PCM s16le binary frames. 200 ms per frame (3200 samples / 6400 bytes) recommended.
partial_result
← server
JSON
每完成一句话推一次。segments 是到目前为止的完整快照,不是增量——客户端整体替换而非 append。
Pushed after each utterance completes. segments is a full snapshot, not a delta — clients should replace, not append.
close
→ server
JSON
{"type":"close"}。触发 pending 音频的 VAD 收尾。
{"type":"close"}. Triggers VAD finalization on pending audio.
final_result
← server
JSON
结构与 partial_result 相同,本会话终态。服务器随后关闭连接。
Same shape as partial_result; the session's terminal state. The server then closes.
error
← server
JSON
{"type":"error","detail":"..."}。随后服务端以 close code 关闭(详见 §05)。
{"type":"error","detail":"..."}. The server then closes with a close code (see §05).

响应样本

Response sample

PARTIAL_RESULT · JSON FRAME StreamingPartialResultEvent
// 每完成一句话推一次;segments 是当前会话的完整快照,不是增量
{
  "type": "partial_result",
  "session_id": "6261ae2e072449438ee3d06f8e822ba1",
  "data": {
    "segments": [
      { "start_time": 0.0, "end_time": 4.81, "speaker_id": 0, "text": "" }
    ]
  },
  "stats": {
    "utterances": 1,
    "chunks": 6,
    "pipeline_ms": 29.5
  }
}

代码示例

Code snippets




    
    
05

错误码

Error Codes

失败时服务端先发一条 { "type": "error", "detail": "..." },随后以下列 WebSocket close code 关闭连接。客户端对 4xxx 范围不要自动重试。

On error the server first sends { "type": "error", "detail": "..." }, then closes with one of the WebSocket close codes below. Do not auto-retry on 4xxx codes.

1003
INVALID_AUDIO
Invalid audio payload。PCM 字节数非偶数,或格式不符。
INVALID_AUDIO
Invalid audio payload. Odd byte count, or format mismatch.
1008
INVALID_MESSAGE
Invalid streaming message。JSON 解析失败或字段不符合 StreamingStartRequest
INVALID_MESSAGE
Invalid streaming message. JSON parse failed or fields do not match StreamingStartRequest.
1011
INTERNAL_ERROR
模型未加载(会话未开始前)或推理异常。回退重试最多 2 次。
INTERNAL_ERROR
Model not loaded (pre-session) or inference error. Exponential back-off, up to 2 retries.
4401
UNAUTHORIZED
启用鉴权时 URL 上 ?token= 缺失或不匹配。清 token 重新登录。
UNAUTHORIZED
Auth enabled but ?token= is missing or does not match. Clear the token and sign in again.
1000
NORMAL_CLOSURE
客户端发送了 close,服务端返回 final_result 后正常关闭。
NORMAL_CLOSURE
Client sent close; server responded with final_result and closed normally.
1006
ABNORMAL_CLOSURE
网络掉线 / 防火墙杀连接。客户端应自动重连并重新发送 start
ABNORMAL_CLOSURE
Network drop / firewall kill. Client should reconnect and re-send start.
06

AI 集成 — 一键复制提示词

AI Integration — Copy the Prompt

将下面这段提示词复制到 Claude / Cursor / ChatGPT 里,让 AI 替你写接入代码。 提示词含接口契约、鉴权、重试、错误处理——很难出错

Paste the prompt below into Claude / Cursor / ChatGPT and let the AI write your client. It encodes the contract, auth, retries, and error handling — hard to get wrong.

AI-READY PROMPT
行 · 含接口契约 / 鉴权 / 重试 / 代码骨架 lines · contract · auth · retry · code skeleton

AI 快速集成

Integrate fast with AI

粘贴到对话框,加一句「用我的技术栈实现」即可。已覆盖边界情况:大文件分片建议、限流退避、鉴权缺失、模型加载中等。

Paste into the chat and add "implement in my stack". Covers edge cases: large-file chunking, rate-limit backoff, missing auth, model-loading state.