Files
claudecodeui/docs/voice.md
newsbubbles d05585e1f4 feat(voice): add optional speech-to-text input and read-aloud TTS
Adds a push-to-talk mic button in the composer and a read-aloud button on
assistant messages. Both are opt-in and hidden unless a voice backend is
configured via VOICE_SIDECAR_URL.

The auth-gated /api/voice proxy forwards to a configurable backend exposing
/transcribe and /tts (provider-agnostic); the frontend probes /api/voice/health
and hides the controls when disabled. Adds i18n keys and docs/voice.md.

Includes a local, no-API-key reference backend in voice-sidecar/ (faster-whisper
for STT, Kokoro-82M for TTS, both CPU-capable).
2026-06-08 00:48:24 +01:00

58 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Voice (optional)
Adds two opt-in voice features to the chat:
- **Push-to-talk dictation** — a mic button in the composer records your voice, transcribes it
(speech-to-text), and drops the text into the input.
- **Read-aloud** — a speaker button on each assistant message plays it back (text-to-speech).
Voice is **disabled by default**. The UI only appears when a voice backend is configured, so it has
zero impact on installs that don't use it.
## Enable it
Set `VOICE_SIDECAR_URL` for the server to point at a voice backend, then restart:
```bash
VOICE_SIDECAR_URL=http://127.0.0.1:8765 npm run server
```
When set, `GET /api/voice/health` reports `{ "enabled": true }` and the mic + speaker controls appear.
All voice requests are proxied through the app's authenticated `/api/voice/*` routes, so the backend
itself only needs to listen on localhost and is never exposed directly.
## Backend contract
`VOICE_SIDECAR_URL` can point at **any** service that implements two endpoints:
| Method & path | Request | Response |
|---|---|---|
| `POST /transcribe` | multipart, field `audio` (webm/mp4/wav/…) | `{ "text": "..." }` |
| `POST /tts` | form field `text` | audio bytes (`audio/*`, e.g. wav/mp3) |
This keeps the feature provider-agnostic — you can back it with the bundled local sidecar, or a cloud
transcription + TTS gateway, as long as it speaks that contract.
## Reference backend: `voice-sidecar/`
A local, no-API-key reference implementation using **faster-whisper** (STT) and **Kokoro-82M** (TTS),
both CPU-capable.
```bash
cd voice-sidecar
python -m venv .venv && . .venv/bin/activate # (Windows: .venv\Scripts\activate)
pip install -r requirements.txt
python -m uvicorn app:app --host 127.0.0.1 --port 8765
```
Then run the app with `VOICE_SIDECAR_URL=http://127.0.0.1:8765`.
Config (env, all optional) — see `voice-sidecar/.env.example`: `WHISPER_MODEL_SIZE`, `WHISPER_DEVICE`
(`cpu`/`cuda`), `KOKORO_VOICE`, `VOICE_PORT`.
## Notes
- The first read-aloud is slow (~1020s) while the model lazy-loads; it's near-instant and cached after.
- Recording needs a secure context (HTTPS or localhost) for microphone access.
- On iOS, playback is tap-initiated (manual read-aloud) to satisfy Safari's autoplay policy.