Files
claudecodeui/docs/voice.md
newsbubbles 711936d279 refactor(voice): provider-agnostic backend and in-app config
Switches the voice proxy to the OpenAI audio API (/v1/audio/transcriptions and
/v1/audio/speech) so it works with OpenAI, Groq, or a local server. Adds a
Settings -> Voice tab (base URL, API key, models, voice) plus a Quick Settings
toggle, and removes the bundled Python sidecar.

Review fixes: stop mic tracks on unmount, clear the global TTS stop handler and
revoke leaked blob URLs, add fetch timeouts in the proxy, surface mic errors in
the button, trim before appending transcripts, and drop the repo-wide wav ignore.
2026-06-09 10:05:06 +01:00

52 lines
1.9 KiB
Markdown

# Voice (optional)
Two opt-in voice features in the chat:
- **Push-to-talk dictation** — a mic button in the composer records, transcribes, and fills the input.
- **Read-aloud** — a speaker button on each assistant message plays it back.
Voice is **off by default**. Turn it on with the **Voice** toggle in Quick Settings or in
**Settings → Voice**. When off, the mic and speaker controls are hidden.
## Backend
Voice uses any **OpenAI-compatible audio backend**, configured in **Settings → Voice**:
| Field | Example | Notes |
|---|---|---|
| Base URL | `https://api.openai.com/v1` | OpenAI, Groq, or a local server |
| API key | `sk-…` | sent only to this app's backend, which proxies the request |
| Speech-to-text model | `whisper-1`, `gpt-4o-transcribe`, `whisper-large-v3-turbo` | |
| Text-to-speech model | `tts-1`, `gpt-4o-mini-tts`, `kokoro` | |
| Voice | `alloy`, `af_heart`, … | depends on the backend |
The backend must expose the standard endpoints:
```
POST {baseUrl}/audio/transcriptions (multipart 'file' + 'model') -> { "text": "..." }
POST {baseUrl}/audio/speech ({ model, voice, input }) -> audio bytes
```
That covers OpenAI and Groq, plus local servers like **LocalAI**, **Speaches**, **Kokoro-FastAPI**,
and **openedai-speech**. Requests are proxied through the app's authenticated `/api/voice/*` routes,
so a local backend only needs to listen on localhost.
### Server-side defaults (optional)
Instead of (or as defaults behind) the Settings fields, you can set env vars on the server:
```
VOICE_API_BASE_URL=http://127.0.0.1:8765/v1
VOICE_API_KEY=...
VOICE_STT_MODEL=whisper-1
VOICE_TTS_MODEL=tts-1
VOICE_TTS_VOICE=alloy
```
Per-user Settings values override these. If neither is set, the voice routes return 503.
## Notes
- Recording needs a secure context (HTTPS or localhost) for microphone access.
- On iOS, read-aloud is tap-initiated to satisfy Safari's autoplay policy.