refactor(voice): provider-agnostic backend and in-app config

Switches the voice proxy to the OpenAI audio API (/v1/audio/transcriptions and /v1/audio/speech) so it works with OpenAI, Groq, or a local server. Adds a Settings -> Voice tab (base URL, API key, models, voice) plus a Quick Settings toggle, and removes the bundled Python sidecar. Review fixes: stop mic tracks on unmount, clear the global TTS stop handler and revoke leaked blob URLs, add fetch timeouts in the proxy, surface mic errors in the button, trim before appending transcripts, and drop the repo-wide wav ignore.
2026-06-25 04:13:51 +08:00 · 2026-06-09 10:05:06 +01:00
parent d05585e1f4
commit 711936d279
21 changed files with 367 additions and 365 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -142,10 +142,3 @@ tasks/

 # Git worktrees
 .worktrees/
-
-# Voice sidecar (Python) — generated, machine-specific, not committed
-voice-sidecar/.venv/
-voice-sidecar/voice_messages/
-voice-sidecar/**/__pycache__/
-*.pyc
-*.wav
--- a/docs/voice.md
+++ b/docs/voice.md
@@ -1,57 +1,51 @@
 # Voice (optional)

-Adds two opt-in voice features to the chat:
+Two opt-in voice features in the chat:

- **Push-to-talk dictation** — a mic button in the composer records your voice, transcribes it
-  (speech-to-text), and drops the text into the input.
- **Read-aloud** — a speaker button on each assistant message plays it back (text-to-speech).
+- **Push-to-talk dictation** — a mic button in the composer records, transcribes, and fills the input.
+- **Read-aloud** — a speaker button on each assistant message plays it back.

-Voice is **disabled by default**. The UI only appears when a voice backend is configured, so it has
-zero impact on installs that don't use it.
+Voice is **off by default**. Turn it on with the **Voice** toggle in Quick Settings or in
+**Settings → Voice**. When off, the mic and speaker controls are hidden.

-## Enable it
+## Backend

-Set `VOICE_SIDECAR_URL` for the server to point at a voice backend, then restart:
+Voice uses any **OpenAI-compatible audio backend**, configured in **Settings → Voice**:

-```bash
-VOICE_SIDECAR_URL=http://127.0.0.1:8765 npm run server
-```
-
-When set, `GET /api/voice/health` reports `{ "enabled": true }` and the mic + speaker controls appear.
-All voice requests are proxied through the app's authenticated `/api/voice/*` routes, so the backend
-itself only needs to listen on localhost and is never exposed directly.
-
-## Backend contract
-
-`VOICE_SIDECAR_URL` can point at **any** service that implements two endpoints:
-
-| Method & path | Request | Response |
+| Field | Example | Notes |
 |---|---|---|
-| `POST /transcribe` | multipart, field `audio` (webm/mp4/wav/…) | `{ "text": "..." }` |
-| `POST /tts` | form field `text` | audio bytes (`audio/*`, e.g. wav/mp3) |
+| Base URL | `https://api.openai.com/v1` | OpenAI, Groq, or a local server |
+| API key | `sk-…` | sent only to this app's backend, which proxies the request |
+| Speech-to-text model | `whisper-1`, `gpt-4o-transcribe`, `whisper-large-v3-turbo` | |
+| Text-to-speech model | `tts-1`, `gpt-4o-mini-tts`, `kokoro` | |
+| Voice | `alloy`, `af_heart`, … | depends on the backend |

-This keeps the feature provider-agnostic — you can back it with the bundled local sidecar, or a cloud
-transcription + TTS gateway, as long as it speaks that contract.
+The backend must expose the standard endpoints:

-## Reference backend: `voice-sidecar/`
-
-A local, no-API-key reference implementation using **faster-whisper** (STT) and **Kokoro-82M** (TTS),
-both CPU-capable.
-
-```bash
-cd voice-sidecar
-python -m venv .venv && . .venv/bin/activate    # (Windows: .venv\Scripts\activate)
-pip install -r requirements.txt
-python -m uvicorn app:app --host 127.0.0.1 --port 8765
+```
+POST {baseUrl}/audio/transcriptions   (multipart 'file' + 'model')   -> { "text": "..." }
+POST {baseUrl}/audio/speech           ({ model, voice, input })       -> audio bytes
 ```

-Then run the app with `VOICE_SIDECAR_URL=http://127.0.0.1:8765`.
+That covers OpenAI and Groq, plus local servers like **LocalAI**, **Speaches**, **Kokoro-FastAPI**,
+and **openedai-speech**. Requests are proxied through the app's authenticated `/api/voice/*` routes,
+so a local backend only needs to listen on localhost.

-Config (env, all optional) — see `voice-sidecar/.env.example`: `WHISPER_MODEL_SIZE`, `WHISPER_DEVICE`
-(`cpu`/`cuda`), `KOKORO_VOICE`, `VOICE_PORT`.
+### Server-side defaults (optional)
+
+Instead of (or as defaults behind) the Settings fields, you can set env vars on the server:
+
+```
+VOICE_API_BASE_URL=http://127.0.0.1:8765/v1
+VOICE_API_KEY=...
+VOICE_STT_MODEL=whisper-1
+VOICE_TTS_MODEL=tts-1
+VOICE_TTS_VOICE=alloy
+```
+
+Per-user Settings values override these. If neither is set, the voice routes return 503.

 ## Notes

- The first read-aloud is slow (~10–20s) while the model lazy-loads; it's near-instant and cached after.
 - Recording needs a secure context (HTTPS or localhost) for microphone access.
- On iOS, playback is tap-initiated (manual read-aloud) to satisfy Safari's autoplay policy.
+- On iOS, read-aloud is tap-initiated to satisfy Safari's autoplay policy.
--- a/server/voice-proxy.js
+++ b/server/voice-proxy.js
@@ -1,48 +1,71 @@
-// Optional voice proxy — forwards speech-to-text / text-to-speech to a configurable backend.
+// Optional voice proxy — forwards STT/TTS to an OpenAI-compatible audio backend.
 //
-// Opt-in: voice is DISABLED unless VOICE_SIDECAR_URL is set. When set, it must point at a
-// backend (any implementation) exposing:
-//     POST /transcribe   (multipart field 'audio')  -> { text }
-//     POST /tts          (form field 'text')        -> audio bytes (audio/*)
-// A reference backend (local faster-whisper + Kokoro) ships in /voice-sidecar, but any
-// service implementing the two endpoints works (e.g. a cloud transcription + TTS gateway).
+// The backend is whatever the user points at: OpenAI, Groq, or a local server
+// (LocalAI / Speaches / Kokoro-FastAPI / openedai-speech / etc.). It must expose the
+// standard OpenAI audio endpoints:
+//     POST {base}/audio/transcriptions   (multipart 'file' + 'model')      -> { text }
+//     POST {base}/audio/speech           ({ model, voice, input })         -> audio bytes
 //
-// Mounted at /api/voice behind authenticateToken, so it inherits the app's auth. The backend
-// should bind to localhost and is never exposed directly.
+// Config is resolved per-request from headers (set by the client's voice settings),
+// falling back to server env defaults. Mounted at /api/voice behind authenticateToken.
 import express from 'express';

-const VOICE_SIDECAR_URL = (process.env.VOICE_SIDECAR_URL || '').replace(/\/$/, '');
-const VOICE_ENABLED = Boolean(VOICE_SIDECAR_URL);
+const ENV = {
+  baseUrl: (process.env.VOICE_API_BASE_URL || '').replace(/\/$/, ''),
+  apiKey: process.env.VOICE_API_KEY || '',
+  sttModel: process.env.VOICE_STT_MODEL || 'whisper-1',
+  ttsModel: process.env.VOICE_TTS_MODEL || 'tts-1',
+  ttsVoice: process.env.VOICE_TTS_VOICE || 'alloy',
+  ttsFormat: process.env.VOICE_TTS_FORMAT || 'mp3',
+};
+
+// Per-request config: client headers (from the user's voice settings) override env defaults.
+function resolveConfig(req) {
+  const h = req.headers;
+  return {
+    baseUrl: (String(h['x-voice-base-url'] || '') || ENV.baseUrl).replace(/\/$/, ''),
+    apiKey: String(h['x-voice-api-key'] || '') || ENV.apiKey,
+    sttModel: String(h['x-voice-stt-model'] || '') || ENV.sttModel,
+    ttsModel: String(h['x-voice-tts-model'] || '') || ENV.ttsModel,
+    ttsVoice: String(h['x-voice-tts-voice'] || '') || ENV.ttsVoice,
+  };
+}

 const router = express.Router();

-// Lazy multer (memory storage) for the audio upload — matches index.js's pattern.
+const VOICE_TIMEOUT_MS = Number(process.env.VOICE_TIMEOUT_MS || 60000);
+async function fetchWithTimeout(url, options = {}) {
+  const controller = new AbortController();
+  const timer = setTimeout(() => controller.abort(), VOICE_TIMEOUT_MS);
+  try {
+    return await fetch(url, { ...options, signal: controller.signal });
+  } finally {
+    clearTimeout(timer);
+  }
+}
+
 let _upload = null;
 async function getUpload() {
  if (!_upload) {
    const multer = (await import('multer')).default;
-    _upload = multer({
-      storage: multer.memoryStorage(),
-      limits: { fileSize: 25 * 1024 * 1024 }, // 25MB — short dictation clips
-    });
+    _upload = multer({ storage: multer.memoryStorage(), limits: { fileSize: 25 * 1024 * 1024 } });
  }
  return _upload;
 }

-function ensureEnabled(res) {
-  if (!VOICE_ENABLED) {
-    res.status(503).json({ error: 'Voice is not configured. Set VOICE_SIDECAR_URL to enable it.' });
-    return false;
-  }
-  return true;
+function authHeader(apiKey) {
+  return apiKey ? { Authorization: `Bearer ${apiKey}` } : {};
 }

-// GET /api/voice/health -> { enabled }  (frontend hides the voice UI when disabled)
-router.get('/health', (_req, res) => res.json({ enabled: VOICE_ENABLED }));
+// GET /api/voice/health -> { configured } (true if a base URL is available)
+router.get('/health', (req, res) => {
+  res.json({ configured: Boolean(resolveConfig(req).baseUrl) });
+});

 // POST /api/voice/transcribe  (multipart 'audio') -> { text }
 router.post('/transcribe', async (req, res) => {
-  if (!ensureEnabled(res)) return;
+  const cfg = resolveConfig(req);
+  if (!cfg.baseUrl) return res.status(503).json({ error: 'No voice backend configured' });
  const upload = await getUpload();
  upload.single('audio')(req, res, async (err) => {
    if (err) return res.status(400).json({ error: err.message });
@@ -50,13 +73,21 @@ router.post('/transcribe', async (req, res) => {
    try {
      const fd = new FormData();
      fd.append(
-        'audio',
+        'file',
        new Blob([req.file.buffer], { type: req.file.mimetype || 'audio/webm' }),
        req.file.originalname || 'recording.webm',
      );
-      const r = await fetch(`${VOICE_SIDECAR_URL}/transcribe`, { method: 'POST', body: fd });
-      const data = await r.json().catch(() => ({ error: 'bad voice backend response' }));
-      res.status(r.status).json(data);
+      fd.append('model', cfg.sttModel);
+      const r = await fetchWithTimeout(`${cfg.baseUrl}/audio/transcriptions`, {
+        method: 'POST',
+        headers: authHeader(cfg.apiKey),
+        body: fd,
+      });
+      const text = await r.text();
+      if (!r.ok) return res.status(r.status).json({ error: text || 'transcription failed' });
+      let data;
+      try { data = JSON.parse(text); } catch { data = { text }; }
+      res.json({ text: data.text ?? '' });
    } catch (e) {
      res.status(502).json({ error: `voice backend unreachable: ${e.message}` });
    }
@@ -65,18 +96,26 @@ router.post('/transcribe', async (req, res) => {

 // POST /api/voice/tts  { text } -> audio bytes
 router.post('/tts', async (req, res) => {
-  if (!ensureEnabled(res)) return;
+  const cfg = resolveConfig(req);
+  if (!cfg.baseUrl) return res.status(503).json({ error: 'No voice backend configured' });
  const text = req.body?.text;
  if (!text || !text.trim()) return res.status(400).json({ error: 'text required' });
  try {
-    const fd = new FormData();
-    fd.append('text', text);
-    const r = await fetch(`${VOICE_SIDECAR_URL}/tts`, { method: 'POST', body: fd });
+    const r = await fetchWithTimeout(`${cfg.baseUrl}/audio/speech`, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json', ...authHeader(cfg.apiKey) },
+      body: JSON.stringify({
+        model: cfg.ttsModel,
+        voice: cfg.ttsVoice,
+        input: text,
+        response_format: ENV.ttsFormat,
+      }),
+    });
    if (!r.ok) {
      const errText = await r.text().catch(() => 'tts failed');
      return res.status(r.status).json({ error: errText });
    }
-    res.setHeader('Content-Type', r.headers.get('content-type') || 'audio/wav');
+    res.setHeader('Content-Type', r.headers.get('content-type') || 'audio/mpeg');
    res.setHeader('Cache-Control', 'no-store');
    res.send(Buffer.from(await r.arrayBuffer()));
  } catch (e) {
--- a/src/components/chat/hooks/useTts.ts
+++ b/src/components/chat/hooks/useTts.ts
@@ -1,5 +1,6 @@
 import { useCallback, useEffect, useRef, useState } from 'react';
 import { authenticatedFetch } from '../../../utils/api';
+import { voiceConfigHeaders } from '../../../hooks/useVoiceConfig';

 // Only one message speaks at a time across the whole app.
 let stopActive: (() => void) | null = null;
@@ -36,8 +37,14 @@ export function useTts(getText: () => string) {
    if (stopActive) stopActive = null;
  }, [reset]);

-  // Cleanup on unmount.
-  useEffect(() => () => reset(), [reset]);
+  // Cleanup on unmount: drop the global stop handler if it points at us, then reset.
+  useEffect(
+    () => () => {
+      if (stopActive === stop) stopActive = null;
+      reset();
+    },
+    [reset, stop],
+  );

  const play = useCallback(async () => {
    if (stopActive) stopActive();
@@ -63,12 +70,16 @@ export function useTts(getText: () => string) {
      const res = await authenticatedFetch('/api/voice/tts', {
        method: 'POST',
        body: JSON.stringify({ text }),
+        headers: voiceConfigHeaders(),
      });
      if (!res.ok) throw new Error(`tts ${res.status}`);
      const blob = await res.blob();
      const url = URL.createObjectURL(blob);
+      if (audioRef.current !== audio) {
+        URL.revokeObjectURL(url); // stopped while loading; don't leak the blob URL
+        return;
+      }
      urlRef.current = url;
-      if (audioRef.current !== audio) return; // stopped while loading
      audio.src = url;
      audio.load();
      await audio.play();
--- a/src/components/chat/hooks/useVoiceAvailable.ts
+++ b/src/components/chat/hooks/useVoiceAvailable.ts
@@ -1,38 +1,37 @@
 import { useEffect, useState } from 'react';
-import { authenticatedFetch } from '../../../utils/api';

-// Whether the optional voice feature is configured on the server (VOICE_SIDECAR_URL set).
-// Probed once and cached app-wide so the mic/speak controls can hide themselves when off.
-let cached: boolean | null = null;
-let inflight: Promise<boolean> | null = null;
+// Voice UI is gated on the `voiceEnabled` UI preference (toggled in Quick Settings /
+// the Settings modal). This is a lightweight read-only view of that preference so the
+// mic/speak controls can hide themselves, kept in sync via the same events
+// useUiPreferences emits. No server probe.
+const STORAGE_KEY = 'uiPreferences';
+const SYNC_EVENT = 'ui-preferences:sync';

-function probe(): Promise<boolean> {
-  if (cached !== null) return Promise.resolve(cached);
-  if (!inflight) {
-    inflight = authenticatedFetch('/api/voice/health')
-      .then((r) => (r.ok ? r.json() : { enabled: false }))
-      .then((d) => {
-        cached = Boolean(d?.enabled);
-        return cached;
-      })
-      .catch(() => {
-        cached = false;
-        return false;
-      });
+function readVoiceEnabled(): boolean {
+  try {
+    const raw = localStorage.getItem(STORAGE_KEY);
+    if (!raw) return false;
+    const parsed = JSON.parse(raw);
+    return parsed?.voiceEnabled === true || parsed?.voiceEnabled === 'true';
+  } catch {
+    return false;
  }
-  return inflight;
 }

 export function useVoiceAvailable(): boolean {
-  const [available, setAvailable] = useState<boolean>(cached ?? false);
+  const [enabled, setEnabled] = useState<boolean>(() =>
+    typeof window === 'undefined' ? false : readVoiceEnabled(),
+  );
+
  useEffect(() => {
-    let mounted = true;
-    probe().then((v) => {
-      if (mounted) setAvailable(v);
-    });
+    const update = () => setEnabled(readVoiceEnabled());
+    window.addEventListener('storage', update);
+    window.addEventListener(SYNC_EVENT, update as EventListener);
    return () => {
-      mounted = false;
+      window.removeEventListener('storage', update);
+      window.removeEventListener(SYNC_EVENT, update as EventListener);
    };
  }, []);
-  return available;
+
+  return enabled;
 }
--- a/src/components/chat/hooks/useVoiceInput.ts
+++ b/src/components/chat/hooks/useVoiceInput.ts
@@ -1,5 +1,6 @@
-import { useCallback, useRef, useState } from 'react';
+import { useCallback, useEffect, useRef, useState } from 'react';
 import { authenticatedFetch } from '../../../utils/api';
+import { voiceConfigHeaders } from '../../../hooks/useVoiceConfig';

 // Mobile-safe recording: iOS Safari 18.4+ supports webm/opus; older iOS needs mp4.
 const MIME_CANDIDATES = [
@@ -39,6 +40,15 @@ export function useVoiceInput(onTranscript: (text: string) => void, onError?: (m
    streamRef.current = null;
  };

+  // Stop the mic if the component unmounts mid-recording.
+  useEffect(() => {
+    return () => {
+      streamRef.current?.getTracks().forEach((t) => t.stop());
+      streamRef.current = null;
+      recorderRef.current = null;
+    };
+  }, []);
+
  const start = useCallback(async () => {
    try {
      const stream = await navigator.mediaDevices.getUserMedia({
@@ -68,7 +78,11 @@ export function useVoiceInput(onTranscript: (text: string) => void, onError?: (m
          const ext = type.includes('mp4') ? 'm4a' : type.includes('ogg') ? 'ogg' : 'webm';
          const fd = new FormData();
          fd.append('audio', blob, `recording.${ext}`);
-          const res = await authenticatedFetch('/api/voice/transcribe', { method: 'POST', body: fd });
+          const res = await authenticatedFetch('/api/voice/transcribe', {
+            method: 'POST',
+            body: fd,
+            headers: voiceConfigHeaders(),
+          });
          if (!res.ok) throw new Error(`transcribe ${res.status}`);
          const data = await res.json();
          const text = String(data?.text || '').trim();
--- a/src/components/chat/view/ChatInterface.tsx
+++ b/src/components/chat/view/ChatInterface.tsx
@@ -404,7 +404,7 @@ function ChatInterface({
          renderInputWithMentions={renderInputWithMentions}
          textareaRef={textareaRef}
          input={input}
-          onVoiceTranscript={(text) => setInput(input ? `${input} ${text}` : text)}
+          onVoiceTranscript={(text) => setInput(input.trim() ? `${input.trim()} ${text}` : text)}
          onInputChange={handleInputChange}
          onTextareaClick={handleTextareaClick}
          onTextareaKeyDown={handleKeyDown}
--- a/src/components/chat/view/subcomponents/VoiceInputButton.tsx
+++ b/src/components/chat/view/subcomponents/VoiceInputButton.tsx
@@ -1,3 +1,4 @@
+import { useEffect, useRef, useState } from 'react';
 import { Mic, Square, Loader2 } from 'lucide-react';
 import { useTranslation } from 'react-i18next';
 import { useVoiceInput } from '../../hooks/useVoiceInput';
@@ -10,10 +11,25 @@ type Props = {
 };

 // Push-to-talk mic button. Renders nothing unless the optional voice feature is enabled.
+// Surfaces transcription errors itself (transiently) so they aren't silently swallowed.
 export default function VoiceInputButton({ onTranscript, onError }: Props) {
  const { t } = useTranslation('chat');
  const available = useVoiceAvailable();
-  const { state, toggle } = useVoiceInput(onTranscript, onError);
+  const [errorMsg, setErrorMsg] = useState<string | null>(null);
+  const errorTimer = useRef<ReturnType<typeof setTimeout> | null>(null);
+
+  const handleError = (msg: string) => {
+    onError?.(msg);
+    setErrorMsg(msg);
+    if (errorTimer.current) clearTimeout(errorTimer.current);
+    errorTimer.current = setTimeout(() => setErrorMsg(null), 4000);
+  };
+
+  const { state, toggle } = useVoiceInput(onTranscript, handleError);
+
+  useEffect(() => () => {
+    if (errorTimer.current) clearTimeout(errorTimer.current);
+  }, []);

  if (!available) return null;

@@ -27,14 +43,21 @@ export default function VoiceInputButton({ onTranscript, onError }: Props) {
    );

  return (
-    <PromptInputButton
-      tooltip={{ content: state === 'recording' ? t('voice.stopRecording') : t('voice.input') }}
-      onClick={(e: { preventDefault: () => void }) => {
-        e.preventDefault();
-        toggle();
-      }}
-    >
-      {icon}
-    </PromptInputButton>
+    <span className="relative inline-flex">
+      {errorMsg && (
+        <span className="absolute bottom-full left-1/2 mb-1 -translate-x-1/2 whitespace-nowrap rounded bg-red-600 px-2 py-1 text-xs text-white shadow-lg">
+          {errorMsg}
+        </span>
+      )}
+      <PromptInputButton
+        tooltip={{ content: state === 'recording' ? t('voice.stopRecording') : t('voice.input') }}
+        onClick={(e: { preventDefault: () => void }) => {
+          e.preventDefault();
+          toggle();
+        }}
+      >
+        {icon}
+      </PromptInputButton>
+    </span>
  );
 }
--- a/src/components/quick-settings-panel/constants.ts
+++ b/src/components/quick-settings-panel/constants.ts
@@ -4,6 +4,7 @@ import {
  Eye,
  Languages,
  Maximize2,
+  Mic,
 } from 'lucide-react';
 import type { PreferenceToggleItem } from './types';

@@ -54,4 +55,9 @@ export const INPUT_SETTING_TOGGLES: PreferenceToggleItem[] = [
    labelKey: 'quickSettings.sendByCtrlEnter',
    icon: Languages,
  },
+  {
+    key: 'voiceEnabled',
+    labelKey: 'quickSettings.voiceEnabled',
+    icon: Mic,
+  },
 ];
--- a/src/components/quick-settings-panel/types.ts
+++ b/src/components/quick-settings-panel/types.ts
@@ -6,7 +6,8 @@ export type PreferenceToggleKey =
  | 'showRawParameters'
  | 'showThinking'
  | 'autoScrollToBottom'
-  | 'sendByCtrlEnter';
+  | 'sendByCtrlEnter'
+  | 'voiceEnabled';

 export type QuickSettingsPreferences = Record<PreferenceToggleKey, boolean>;

--- a/src/components/settings/types/types.ts
+++ b/src/components/settings/types/types.ts
@@ -2,7 +2,7 @@ import type { Dispatch, SetStateAction } from 'react';
 import type { LLMProvider } from '../../../types/app';
 import type { ProviderAuthStatus } from '../../provider-auth/types';

-export type SettingsMainTab = 'agents' | 'appearance' | 'git' | 'api' | 'tasks' | 'notifications' | 'plugins' | 'about';
+export type SettingsMainTab = 'agents' | 'appearance' | 'git' | 'api' | 'voice' | 'tasks' | 'notifications' | 'plugins' | 'about';
 export type AgentProvider = LLMProvider;
 export type AgentCategory = 'account' | 'permissions' | 'mcp';
 export type ProjectSortOrder = 'name' | 'date';
--- a/src/components/settings/view/Settings.tsx
+++ b/src/components/settings/view/Settings.tsx
@@ -6,6 +6,7 @@ import SettingsSidebar from '../view/SettingsSidebar';
 import AgentsSettingsTab from '../view/tabs/agents-settings/AgentsSettingsTab';
 import AppearanceSettingsTab from '../view/tabs/AppearanceSettingsTab';
 import CredentialsSettingsTab from '../view/tabs/api-settings/CredentialsSettingsTab';
+import VoiceSettingsTab from '../view/tabs/VoiceSettingsTab';
 import GitSettingsTab from '../view/tabs/git-settings/GitSettingsTab';
 import NotificationsSettingsTab from '../view/tabs/NotificationsSettingsTab';
 import TasksSettingsTab from '../view/tabs/tasks-settings/TasksSettingsTab';
@@ -153,6 +154,8 @@ function Settings({ isOpen, onClose, projects = [], initialTab = 'agents' }: Set

              {activeTab === 'api' && <CredentialsSettingsTab />}

+              {activeTab === 'voice' && <VoiceSettingsTab />}
+
              {activeTab === 'plugins' && <PluginSettingsTab />}

              {activeTab === 'about' && <AboutTab />}
--- a/src/components/settings/view/SettingsSidebar.tsx
+++ b/src/components/settings/view/SettingsSidebar.tsx
@@ -1,4 +1,4 @@
-import { Bell, Bot, GitBranch, Info, Key, ListChecks, Palette, Puzzle } from 'lucide-react';
+import { Bell, Bot, GitBranch, Info, Key, ListChecks, Mic, Palette, Puzzle } from 'lucide-react';
 import { useTranslation } from 'react-i18next';
 import { cn } from '../../../lib/utils';
 import { PillBar, Pill } from '../../../shared/view/ui';
@@ -20,6 +20,7 @@ const NAV_ITEMS: NavItem[] = [
  { id: 'appearance', labelKey: 'mainTabs.appearance', icon: Palette },
  { id: 'git', labelKey: 'mainTabs.git', icon: GitBranch },
  { id: 'api', labelKey: 'mainTabs.apiTokens', icon: Key },
+  { id: 'voice', labelKey: 'mainTabs.voice', icon: Mic },
  { id: 'tasks', labelKey: 'mainTabs.tasks', icon: ListChecks },
  { id: 'plugins', labelKey: 'mainTabs.plugins', icon: Puzzle },
  { id: 'notifications', labelKey: 'mainTabs.notifications', icon: Bell },
--- a/src/components/settings/view/tabs/VoiceSettingsTab.tsx
+++ b/src/components/settings/view/tabs/VoiceSettingsTab.tsx
@@ -0,0 +1,82 @@
+import type { InputHTMLAttributes } from 'react';
+import { useTranslation } from 'react-i18next';
+import SettingsSection from '../SettingsSection';
+import SettingsToggle from '../SettingsToggle';
+import { useUiPreferences } from '../../../../hooks/useUiPreferences';
+import { useVoiceConfig } from '../../../../hooks/useVoiceConfig';
+
+const inputClass =
+  'w-full rounded-md border border-border bg-background px-3 py-2 text-sm text-foreground placeholder:text-muted-foreground focus:outline-none focus:ring-2 focus:ring-ring';
+
+function Field({ label, ...props }: { label: string } & InputHTMLAttributes<HTMLInputElement>) {
+  return (
+    <label className="block space-y-1">
+      <span className="text-sm font-medium text-foreground">{label}</span>
+      <input className={inputClass} {...props} />
+    </label>
+  );
+}
+
+export default function VoiceSettingsTab() {
+  const { t } = useTranslation('settings');
+  const { preferences, setPreference } = useUiPreferences();
+  const { config, update } = useVoiceConfig();
+
+  return (
+    <div className="space-y-8">
+      <SettingsSection title={t('voiceSettings.title')} description={t('voiceSettings.description')}>
+        <div className="flex items-center justify-between rounded-lg border border-border p-3">
+          <div className="pr-3">
+            <div className="text-sm font-medium text-foreground">{t('voiceSettings.enable')}</div>
+            <div className="text-xs text-muted-foreground">{t('voiceSettings.enableDescription')}</div>
+          </div>
+          <SettingsToggle
+            checked={preferences.voiceEnabled}
+            onChange={(v) => setPreference('voiceEnabled', v)}
+            ariaLabel={t('voiceSettings.enable')}
+          />
+        </div>
+      </SettingsSection>
+
+      <SettingsSection title={t('voiceSettings.backendTitle')} description={t('voiceSettings.backendDescription')}>
+        <div className="space-y-4">
+          <Field
+            label={t('voiceSettings.baseUrl')}
+            placeholder="https://api.openai.com/v1"
+            value={config.baseUrl}
+            onChange={(e) => update({ baseUrl: e.target.value })}
+          />
+          <Field
+            label={t('voiceSettings.apiKey')}
+            type="password"
+            autoComplete="off"
+            placeholder="sk-…"
+            value={config.apiKey}
+            onChange={(e) => update({ apiKey: e.target.value })}
+          />
+          <div className="grid grid-cols-1 gap-4 sm:grid-cols-3">
+            <Field
+              label={t('voiceSettings.sttModel')}
+              placeholder="whisper-1"
+              value={config.sttModel}
+              onChange={(e) => update({ sttModel: e.target.value })}
+            />
+            <Field
+              label={t('voiceSettings.ttsModel')}
+              placeholder="tts-1"
+              value={config.ttsModel}
+              onChange={(e) => update({ ttsModel: e.target.value })}
+            />
+            <Field
+              label={t('voiceSettings.voice')}
+              placeholder="alloy"
+              value={config.ttsVoice}
+              onChange={(e) => update({ ttsVoice: e.target.value })}
+            />
+          </div>
+          <p className="text-xs text-muted-foreground">{t('voiceSettings.note')}</p>
+        </div>
+      </SettingsSection>
+    </div>
+  );
+}
--- a/src/hooks/useUiPreferences.ts
+++ b/src/hooks/useUiPreferences.ts
@@ -7,6 +7,7 @@ type UiPreferences = {
  autoScrollToBottom: boolean;
  sendByCtrlEnter: boolean;
  sidebarVisible: boolean;
+  voiceEnabled: boolean;
 };

 type UiPreferenceKey = keyof UiPreferences;
@@ -39,6 +40,7 @@ const DEFAULTS: UiPreferences = {
  autoScrollToBottom: true,
  sendByCtrlEnter: false,
  sidebarVisible: true,
+  voiceEnabled: false,
 };

 const PREFERENCE_KEYS = Object.keys(DEFAULTS) as UiPreferenceKey[];
--- a/src/hooks/useVoiceConfig.ts
+++ b/src/hooks/useVoiceConfig.ts
@@ -0,0 +1,57 @@
+import { useState } from 'react';
+
+export type VoiceConfig = {
+  baseUrl: string;
+  apiKey: string;
+  sttModel: string;
+  ttsModel: string;
+  ttsVoice: string;
+};
+
+const STORAGE_KEY = 'voiceConfig';
+const DEFAULTS: VoiceConfig = { baseUrl: '', apiKey: '', sttModel: '', ttsModel: '', ttsVoice: '' };
+
+function read(): VoiceConfig {
+  try {
+    const raw = localStorage.getItem(STORAGE_KEY);
+    if (!raw) return { ...DEFAULTS };
+    const parsed = JSON.parse(raw);
+    return { ...DEFAULTS, ...(parsed && typeof parsed === 'object' ? parsed : {}) };
+  } catch {
+    return { ...DEFAULTS };
+  }
+}
+
+// Headers the voice proxy reads to target a per-user OpenAI-compatible backend.
+// Empty fields are omitted so the server's env defaults apply.
+export function voiceConfigHeaders(): Record<string, string> {
+  if (typeof window === 'undefined') return {};
+  const c = read();
+  const h: Record<string, string> = {};
+  if (c.baseUrl) h['x-voice-base-url'] = c.baseUrl;
+  if (c.apiKey) h['x-voice-api-key'] = c.apiKey;
+  if (c.sttModel) h['x-voice-stt-model'] = c.sttModel;
+  if (c.ttsModel) h['x-voice-tts-model'] = c.ttsModel;
+  if (c.ttsVoice) h['x-voice-tts-voice'] = c.ttsVoice;
+  return h;
+}
+
+export function useVoiceConfig() {
+  const [config, setConfig] = useState<VoiceConfig>(() =>
+    typeof window === 'undefined' ? { ...DEFAULTS } : read(),
+  );
+
+  const update = (patch: Partial<VoiceConfig>) => {
+    setConfig((prev) => {
+      const next = { ...prev, ...patch };
+      try {
+        localStorage.setItem(STORAGE_KEY, JSON.stringify(next));
+      } catch {
+        /* ignore persistence errors */
+      }
+      return next;
+    });
+  };
+
+  return { config, update };
+}
--- a/src/i18n/locales/en/settings.json
+++ b/src/i18n/locales/en/settings.json
@@ -49,6 +49,20 @@
    "resetToDefaults": "Reset to Defaults",
    "cancelChanges": "Cancel Changes"
  },
+  "voiceSettings": {
+    "title": "Voice",
+    "description": "Speech-to-text input and read-aloud, via an OpenAI-compatible audio backend.",
+    "enable": "Enable voice",
+    "enableDescription": "Show the mic button and the read-aloud button on messages.",
+    "backendTitle": "Backend",
+    "backendDescription": "Point at OpenAI, Groq, or a local server (LocalAI, Speaches, Kokoro-FastAPI). Leave blank to use the server default.",
+    "baseUrl": "Base URL",
+    "apiKey": "API key",
+    "sttModel": "Speech-to-text model",
+    "ttsModel": "Text-to-speech model",
+    "voice": "Voice",
+    "note": "The shown defaults work with OpenAI once you add a key. For other providers, set the base URL and model names to match."
+  },
  "quickSettings": {
    "title": "Quick Settings",
    "sections": {
@@ -63,6 +77,7 @@
    "showThinking": "Show thinking",
    "autoScrollToBottom": "Auto-scroll to bottom",
    "sendByCtrlEnter": "Send by Ctrl+Enter",
+    "voiceEnabled": "Voice (mic + read aloud)",
    "sendByCtrlEnterDescription": "When enabled, pressing Ctrl+Enter will send the message instead of just Enter. This is useful for IME users to avoid accidental sends.",
    "dragHandle": {
      "dragging": "Dragging handle",
@@ -93,6 +108,7 @@
    "appearance": "Appearance",
    "git": "Git",
    "apiTokens": "API & Tokens",
+    "voice": "Voice",
    "tasks": "Tasks",
    "notifications": "Notifications",
    "plugins": "Plugins",
--- a/voice-sidecar/.env.example
+++ b/voice-sidecar/.env.example
@@ -1,14 +0,0 @@
-# Voice sidecar config (all optional — these are the defaults).
-# The sidecar binds 127.0.0.1 only; CloudCLI's Express proxy reaches it.
-
-# Port the sidecar listens on (CloudCLI reaches it via VOICE_SIDECAR_URL).
-VOICE_PORT=8765
-
-# faster-whisper model size: tiny | base | small | medium | large-v3
-WHISPER_MODEL_SIZE=base
-# cpu (int8, default) or cuda (float16, needs a CUDA torch in the venv)
-WHISPER_DEVICE=cpu
-
-# Kokoro voice (see https://github.com/hexgrad/kokoro for the full list) and language code.
-KOKORO_VOICE=af_heart
-KOKORO_LANG=a
--- a/voice-sidecar/app.py
+++ b/voice-sidecar/app.py
@@ -1,187 +0,0 @@
-"""
-CloudCLI voice sidecar — local STT (faster-whisper) + local TTS (Kokoro-82M).
-
-Ported from the tooler voice endpoints (D:\\tooler\\backend\\server.py), swapping
-edge-tts -> Kokoro. Bound to 127.0.0.1 only; CloudCLI's Express server proxies to
-it behind JWT auth. Never exposed to the tailnet directly.
-
-Endpoints:
-  GET  /health           -> {status, whisper_loaded, kokoro_loaded}
-  POST /transcribe       (multipart 'audio')        -> {text, duration_ms}
-  POST /tts              (form 'text')              -> audio/wav bytes (cached)
-"""
-import asyncio
-import hashlib
-import logging
-import os
-import re
-import tempfile
-import time
-from pathlib import Path
-
-import numpy as np
-import soundfile as sf
-from fastapi import FastAPI, File, Form, HTTPException, UploadFile
-from fastapi.responses import Response
-
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger("voice-sidecar")
-
-# ---- Config (env-overridable) -------------------------------------------------
-PORT = int(os.getenv("VOICE_PORT", "8765"))
-WHISPER_MODEL_SIZE = os.getenv("WHISPER_MODEL_SIZE", "base")
-WHISPER_DEVICE = os.getenv("WHISPER_DEVICE", "cpu").lower()      # "cpu" | "cuda"
-KOKORO_VOICE = os.getenv("KOKORO_VOICE", "af_heart")
-KOKORO_LANG = os.getenv("KOKORO_LANG", "a")                      # 'a' = American English
-KOKORO_SR = 24000
-
-VOICE_DIR = Path(__file__).parent / "voice_messages"
-VOICE_DIR.mkdir(exist_ok=True)
-
-# ---- Lazy model singletons ----------------------------------------------------
-_whisper = None
-_whisper_lock = asyncio.Lock()
-_kpipe = None
-_kpipe_lock = asyncio.Lock()
-
-
-async def get_whisper():
-    global _whisper
-    if _whisper is not None:
-        return _whisper
-    async with _whisper_lock:
-        if _whisper is not None:
-            return _whisper
-
-        def _load():
-            from faster_whisper import WhisperModel
-            if WHISPER_DEVICE == "cuda":
-                try:
-                    logger.info("[WHISPER] loading on CUDA (float16)...")
-                    return WhisperModel(WHISPER_MODEL_SIZE, device="cuda", compute_type="float16")
-                except Exception as e:  # noqa: BLE001
-                    logger.warning("[WHISPER] CUDA failed (%s), falling back to CPU", e)
-            logger.info("[WHISPER] loading '%s' on CPU (int8)", WHISPER_MODEL_SIZE)
-            return WhisperModel(WHISPER_MODEL_SIZE, device="cpu", compute_type="int8")
-
-        _whisper = await asyncio.get_event_loop().run_in_executor(None, _load)
-        logger.info("[WHISPER] ready")
-        return _whisper
-
-
-async def get_kokoro():
-    global _kpipe
-    if _kpipe is not None:
-        return _kpipe
-    async with _kpipe_lock:
-        if _kpipe is not None:
-            return _kpipe
-
-        def _load():
-            from kokoro import KPipeline
-            logger.info("[KOKORO] loading pipeline (lang=%s)...", KOKORO_LANG)
-            return KPipeline(lang_code=KOKORO_LANG)
-
-        _kpipe = await asyncio.get_event_loop().run_in_executor(None, _load)
-        logger.info("[KOKORO] ready")
-        return _kpipe
-
-
-# ---- Text cleaning (ported verbatim from tooler prepare_text_for_tts) ---------
-def prepare_text_for_tts(text: str) -> str:
-    """Strip/transform markdown for natural speech."""
-    text = re.sub(r"```[\s\S]*?```", " code block ", text)   # code fences -> spoken stub
-    text = re.sub(r"`([^`]+)`", r"\1", text)                  # unwrap inline code
-    text = re.sub(r"\*\*([^*]+)\*\*", r"\1", text)            # bold
-    text = re.sub(r"\*([^*]+)\*", r"\1", text)                # italic
-    text = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", text)      # links -> link text
-    text = re.sub(r"^#{1,6}\s+", "", text, flags=re.MULTILINE)  # headers
-    text = re.sub(r"\s+", " ", text).strip()
-    return text
-
-
-# ---- App ----------------------------------------------------------------------
-app = FastAPI(title="CloudCLI voice sidecar")
-
-
-@app.get("/health")
-async def health():
-    return {
-        "status": "ok",
-        "whisper_loaded": _whisper is not None,
-        "kokoro_loaded": _kpipe is not None,
-    }
-
-
-@app.post("/transcribe")
-async def transcribe(audio: UploadFile = File(...)):
-    start = time.time()
-    suffix = Path(audio.filename or "rec.webm").suffix or ".webm"
-    content = await audio.read()
-    logger.info("[STT] %d bytes (%s)", len(content), audio.content_type)
-
-    tmp_path = None
-    try:
-        with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
-            tmp.write(content)
-            tmp_path = tmp.name
-
-        model = await get_whisper()
-
-        def _run():
-            segments, _info = model.transcribe(tmp_path, beam_size=5)
-            return "".join(seg.text for seg in segments).strip()
-
-        text = await asyncio.get_event_loop().run_in_executor(None, _run)
-        duration_ms = int((time.time() - start) * 1000)
-        logger.info("[STT] %dms: %s", duration_ms, text[:100])
-        return {"text": text, "duration_ms": duration_ms}
-    except Exception as e:  # noqa: BLE001
-        logger.error("[STT] failed: %s", e, exc_info=True)
-        raise HTTPException(status_code=500, detail=f"Transcription failed: {e}")
-    finally:
-        if tmp_path and os.path.exists(tmp_path):
-            try:
-                os.unlink(tmp_path)
-            except OSError:
-                pass
-
-
-@app.post("/tts")
-async def tts(text: str = Form(...)):
-    if not text.strip():
-        raise HTTPException(status_code=400, detail="Text cannot be empty")
-    if len(text) > 8000:
-        raise HTTPException(status_code=400, detail="Text too long (max 8000 chars)")
-
-    start = time.time()
-    clean = prepare_text_for_tts(text)
-    # Cache on the RAW text hash (matches tooler) so identical messages reuse audio.
-    key = hashlib.sha256(text.encode()).hexdigest()[:16]
-    out_path = VOICE_DIR / f"{key}.wav"
-
-    if not out_path.exists():
-        try:
-            pipeline = await get_kokoro()
-
-            def _synth():
-                chunks = [audio for _gs, _ps, audio in pipeline(clean, voice=KOKORO_VOICE)]
-                if not chunks:
-                    raise RuntimeError("Kokoro produced no audio")
-                full = np.concatenate([np.asarray(c, dtype=np.float32) for c in chunks])
-                sf.write(str(out_path), full, KOKORO_SR)
-
-            await asyncio.get_event_loop().run_in_executor(None, _synth)
-            logger.info("[TTS] generated %s in %dms", out_path.name, int((time.time() - start) * 1000))
-        except Exception as e:  # noqa: BLE001
-            logger.error("[TTS] failed: %s", e, exc_info=True)
-            raise HTTPException(status_code=500, detail=f"TTS failed: {e}")
-    else:
-        logger.info("[TTS] cache hit %s", out_path.name)
-
-    return Response(content=out_path.read_bytes(), media_type="audio/wav")
-
-
-if __name__ == "__main__":
-    import uvicorn
-    uvicorn.run(app, host="127.0.0.1", port=PORT, log_level="info")
--- a/voice-sidecar/requirements.txt
+++ b/voice-sidecar/requirements.txt
@@ -1,9 +0,0 @@
-# CloudCLI voice sidecar — STT (faster-whisper) + TTS (Kokoro-82M)
-fastapi>=0.110.0
-uvicorn[standard]>=0.27.0
-python-multipart>=0.0.9
-faster-whisper>=1.0.0
-kokoro>=0.9.4
-misaki[en]>=0.9.4
-soundfile>=0.12.1
-numpy>=1.26.0
--- a/voice-sidecar/test_smoke.py
+++ b/voice-sidecar/test_smoke.py
@@ -1,29 +0,0 @@
-"""Smoke test: Kokoro TTS -> faster-whisper STT round-trip."""
-import time
-import numpy as np
-import soundfile as sf
-
-PHRASE = "Hello, this is a test of the CloudCLI voice sidecar."
-
-print("[1/3] Loading Kokoro pipeline...")
-t = time.time()
-from kokoro import KPipeline
-pipe = KPipeline(lang_code="a")
-print(f"      loaded in {time.time()-t:.1f}s")
-
-print("[2/3] Synthesizing...")
-t = time.time()
-chunks = [audio for _gs, _ps, audio in pipe(PHRASE, voice="af_heart")]
-full = np.concatenate([np.asarray(c, dtype=np.float32) for c in chunks])
-sf.write("test.wav", full, 24000)
-dur = len(full) / 24000
-print(f"      synth {time.time()-t:.1f}s -> test.wav ({dur:.1f}s audio, {len(full)} samples)")
-
-print("[3/3] Transcribing back with faster-whisper (base, cpu int8)...")
-t = time.time()
-from faster_whisper import WhisperModel
-model = WhisperModel("base", device="cpu", compute_type="int8")
-segments, _info = model.transcribe("test.wav", beam_size=5)
-text = "".join(s.text for s in segments).strip()
-print(f"      transcribe {time.time()-t:.1f}s -> {text!r}")
-print("\nROUND-TRIP OK" if text else "\nROUND-TRIP PRODUCED NO TEXT")