实时说话人日志

Real-time Speaker Diarization

接上麦克风 / 电话 / 会议音频，一边说、一边把每句话打上说话人标签。基于 WebSocket 的流式接口：每当一句完整的话音结束，服务器立刻推送 partial_result—— 适合实时字幕、客服转写、多人会议现场分离。

Pipe a microphone / phone / conference feed in and get each utterance tagged with a speaker as it is spoken. A WebSocket streaming endpoint: the moment a full utterance ends, the server pushes a partial_result — ideal for live captions, call-center transcription, and meeting-room diarization.

WS /diarize/stream 稳定Stable v1.2.4 最近更新 2026-04-23Updated 2026-04-23

限制与约束

Limits & Constraints

超出任一限制将返回 413 Payload Too Large 或 422 Validation Error—— 参见 §05 错误码。

Exceeding any limit returns 413 Payload Too Large or 422 Validation Error — see §05 Error Codes.

音频格式

AUDIO FORMAT

PCMs16le

16 kHz · 单声道 · 16-bit 小端

16 kHz · mono · 16-bit little-endian

单次会话时长

SESSION LENGTH

2小时hours

服务端 buffer 120 s 滚动裁剪

Server trims a 120 s rolling buffer

推荐分片

CHUNK SIZE

200ms

3200 样本 / 6400 字节 / 帧

3200 samples / 6400 bytes per frame

说明：其他格式组合会被拒绝并立刻关闭（close code 1008）。浏览器端可直接用 AudioContext({ sampleRate: 16000 })，由浏览器自动重采样到 16 kHz。

Any other format combination is rejected with an immediate close (close code 1008). Browsers can just use AudioContext({ sampleRate: 16000 }) and let the browser resample.

性能指标

Performance

单句延迟 · 说完 → partial_result 到达Per-utterance latency · speech-end → partial_result P50 P99

P50 · 中位数 一半请求比它快、一半比它慢。代表典型的日常体验。

P50 · median Half of the requests are faster, half are slower. This is the typical experience.

P99 · 尾部 只有 1% 的请求比它更慢。用它来评估最差情况是否仍可接受。

P99 · tail Only 1% of requests are slower than this. Use it to judge worst-case acceptability.

单句延迟 · P50

Utterance latency · P50

72 ms

单句延迟 · P99

Utterance latency · P99

280 ms

在线体验

Try It

点击话筒开始录音；每说完一句话，右侧就会实时多出一段结果。再次点击话筒或发送 close 触发最终结果。需要上传文件请看说话人日志 · 文件。

Click the mic to start recording; after each utterance ends, a new segment appears on the right. Click the mic again or send close to finalize. For file uploads, see Speaker Diarization · File.

WS /diarize/stream · 实时麦克风流WS /diarize/stream · live microphone PCM s16le · 16 kHz · mono

麦克风MICROPHONE

空闲Idle

00:00

点击话筒开始；浏览器会请求麦克风权限。

Click the mic to start — the browser will ask for permission.

实时输出LIVE OUTPUT

点击左侧话筒，开始 实时流 分离。

Click the mic on the left to start live diarization.

16 kHz · PCM s16le · WebSocket

协议与响应

Protocol & Response

先看下方消息序列 + 字段定义，再切到最下方语言 tab 复制可直接运行的代码片段。不需要实时请看说话人日志 · 文件模式。

Start by reading the message sequence and field definitions below, then scroll to the language tabs for copy-paste-ready snippets. Don't need real-time? See Speaker Diarization · File mode.

WS /diarize/stream

WebSocket 握手 → 发 start JSON → 持续推 PCM 二进制帧（200 ms/帧推荐）→ 每句 VAD 闭合收到 partial_result → 发 close 得到 final_result 结束。

WebSocket handshake → send a start JSON → stream PCM binary frames (200 ms recommended) → receive partial_result per closed utterance → send close to get a final_result and end.

请求字段

Request fields

字段

Field

类型

Type

要求

Required

说明

Description

token

query

可选

optional

仅当服务端 AUTH_ENABLED=True 时需要。URL 查询参数：?token=<token>。WebSocket 握手不支持自定义 Header 所以不能用 Authorization。

Required only when the server has AUTH_ENABLED=True. URL query parameter: ?token=<token>. WebSocket handshakes don't carry custom headers so Authorization isn't available.

sample_rate

int

必填

required

start 消息字段。必须为 16000，其他值以 close code 1008 拒绝。

Field in the start message. Must be 16000; other values are rejected with close code 1008.

channels

int

必填

required

start 消息字段。必须为 1（单声道）。

Field in the start message. Must be 1 (mono).

sample_width

int

必填

required

start 消息字段。字节宽度，必须为 2（16-bit PCM）。

Field in the start message. Byte width; must be 2 (16-bit PCM).

format

string

必填

required

start 消息字段。必须为 "pcm_s16le"（16-bit 小端 PCM）。

Field in the start message. Must be "pcm_s16le" (16-bit little-endian PCM).

消息序列

Message sequence

消息

Message

方向

Direction

类型

Type

说明

Description

start

→ server

JSON

{"type":"start","sample_rate":16000,"channels":1,"sample_width":2,"format":"pcm_s16le"}。仅支持 16 kHz · mono · s16le。

{"type":"start","sample_rate":16000,"channels":1,"sample_width":2,"format":"pcm_s16le"}. Only 16 kHz · mono · s16le is accepted.

ready

← server

JSON

{"type":"ready","session_id":"..."}。收到后即可推音频。

{"type":"ready","session_id":"..."}. Start pushing audio once this arrives.

→ server

PCM

原始 PCM s16le 二进制帧。推荐每帧 200 ms（3200 样本 / 6400 字节）。

Raw PCM s16le binary frames. 200 ms per frame (3200 samples / 6400 bytes) recommended.

partial_result

← server

JSON

每完成一句话推一次。segments 是到目前为止的完整快照，不是增量——客户端整体替换而非 append。

Pushed after each utterance completes. segments is a full snapshot, not a delta — clients should replace, not append.

→ server

JSON

{"type":"close"}。触发 pending 音频的 VAD 收尾。

{"type":"close"}. Triggers VAD finalization on pending audio.

final_result

← server

JSON

结构与 partial_result 相同，本会话终态。服务器随后关闭连接。

Same shape as partial_result; the session's terminal state. The server then closes.

error

← server

JSON

{"type":"error","detail":"..."}。随后服务端以 close code 关闭（详见 §05）。

{"type":"error","detail":"..."}. The server then closes with a close code (see §05).

响应样本

Response sample

PARTIAL_RESULT · JSON FRAME StreamingPartialResultEvent

// 每完成一句话推一次；segments 是当前会话的完整快照，不是增量
{
  "type": "partial_result",
  "session_id": "6261ae2e072449438ee3d06f8e822ba1",
  "data": {
    "segments": [
      { "start_time": 0.0, "end_time": 4.81, "speaker_id": 0, "text": "" }
    ]
  },
  "stats": {
    "utterances": 1,
    "chunks": 6,
    "pipeline_ms": 29.5
  }
}

代码示例

Code snippets

错误码

Error Codes

失败时服务端先发一条 { "type": "error", "detail": "..." }，随后以下列 WebSocket close code 关闭连接。客户端对 4xxx 范围不要自动重试。

On error the server first sends { "type": "error", "detail": "..." }, then closes with one of the WebSocket close codes below. Do not auto-retry on 4xxx codes.

1003

INVALID_AUDIO

Invalid audio payload。PCM 字节数非偶数，或格式不符。

INVALID_AUDIO

Invalid audio payload. Odd byte count, or format mismatch.

1008

INVALID_MESSAGE

Invalid streaming message。JSON 解析失败或字段不符合 StreamingStartRequest。

INVALID_MESSAGE

Invalid streaming message. JSON parse failed or fields do not match StreamingStartRequest.

1011

INTERNAL_ERROR

模型未加载（会话未开始前）或推理异常。回退重试最多 2 次。

INTERNAL_ERROR

Model not loaded (pre-session) or inference error. Exponential back-off, up to 2 retries.

4401

UNAUTHORIZED

启用鉴权时 URL 上 ?token= 缺失或不匹配。清 token 重新登录。

UNAUTHORIZED

Auth enabled but ?token= is missing or does not match. Clear the token and sign in again.

1000

NORMAL_CLOSURE

客户端发送了 close，服务端返回 final_result 后正常关闭。

NORMAL_CLOSURE

Client sent close; server responded with final_result and closed normally.

1006

ABNORMAL_CLOSURE

网络掉线 / 防火墙杀连接。客户端应自动重连并重新发送 start。

ABNORMAL_CLOSURE

Network drop / firewall kill. Client should reconnect and re-send start.

AI 集成 — 一键复制提示词

AI Integration — Copy the Prompt

将下面这段提示词复制到 Claude / Cursor / ChatGPT 里，让 AI 替你写接入代码。提示词含接口契约、鉴权、重试、错误处理——很难出错。

Paste the prompt below into Claude / Cursor / ChatGPT and let the AI write your client. It encodes the contract, auth, retries, and error handling — hard to get wrong.

AI-READY PROMPT

— 行 · 含接口契约 / 鉴权 / 重试 / 代码骨架 lines · contract · auth · retry · code skeleton

用 AI 快速集成

Integrate fast with AI

粘贴到对话框，加一句「用我的技术栈实现」即可。已覆盖边界情况：大文件分片建议、限流退避、鉴权缺失、模型加载中等。

Paste into the chat and add "implement in my stack". Covers edge cases: large-file chunking, rate-limit backoff, missing auth, model-loading state.