diff --git a/.gitignore b/.gitignore index 8b7815d7..e6b7985b 100755 --- a/.gitignore +++ b/.gitignore @@ -142,10 +142,3 @@ tasks/ # Git worktrees .worktrees/ - -# Voice sidecar (Python) — generated, machine-specific, not committed -voice-sidecar/.venv/ -voice-sidecar/voice_messages/ -voice-sidecar/**/__pycache__/ -*.pyc -*.wav diff --git a/docs/voice.md b/docs/voice.md index b8e5baec..71f81b84 100644 --- a/docs/voice.md +++ b/docs/voice.md @@ -1,57 +1,51 @@ # Voice (optional) -Adds two opt-in voice features to the chat: +Two opt-in voice features in the chat: -- **Push-to-talk dictation** — a mic button in the composer records your voice, transcribes it - (speech-to-text), and drops the text into the input. -- **Read-aloud** — a speaker button on each assistant message plays it back (text-to-speech). +- **Push-to-talk dictation** — a mic button in the composer records, transcribes, and fills the input. +- **Read-aloud** — a speaker button on each assistant message plays it back. -Voice is **disabled by default**. The UI only appears when a voice backend is configured, so it has -zero impact on installs that don't use it. +Voice is **off by default**. Turn it on with the **Voice** toggle in Quick Settings or in +**Settings → Voice**. When off, the mic and speaker controls are hidden. -## Enable it +## Backend -Set `VOICE_SIDECAR_URL` for the server to point at a voice backend, then restart: +Voice uses any **OpenAI-compatible audio backend**, configured in **Settings → Voice**: -```bash -VOICE_SIDECAR_URL=http://127.0.0.1:8765 npm run server -``` - -When set, `GET /api/voice/health` reports `{ "enabled": true }` and the mic + speaker controls appear. -All voice requests are proxied through the app's authenticated `/api/voice/*` routes, so the backend -itself only needs to listen on localhost and is never exposed directly. - -## Backend contract - -`VOICE_SIDECAR_URL` can point at **any** service that implements two endpoints: - -| Method & path | Request | Response | +| Field | Example | Notes | |---|---|---| -| `POST /transcribe` | multipart, field `audio` (webm/mp4/wav/…) | `{ "text": "..." }` | -| `POST /tts` | form field `text` | audio bytes (`audio/*`, e.g. wav/mp3) | +| Base URL | `https://api.openai.com/v1` | OpenAI, Groq, or a local server | +| API key | `sk-…` | sent only to this app's backend, which proxies the request | +| Speech-to-text model | `whisper-1`, `gpt-4o-transcribe`, `whisper-large-v3-turbo` | | +| Text-to-speech model | `tts-1`, `gpt-4o-mini-tts`, `kokoro` | | +| Voice | `alloy`, `af_heart`, … | depends on the backend | -This keeps the feature provider-agnostic — you can back it with the bundled local sidecar, or a cloud -transcription + TTS gateway, as long as it speaks that contract. +The backend must expose the standard endpoints: -## Reference backend: `voice-sidecar/` - -A local, no-API-key reference implementation using **faster-whisper** (STT) and **Kokoro-82M** (TTS), -both CPU-capable. - -```bash -cd voice-sidecar -python -m venv .venv && . .venv/bin/activate # (Windows: .venv\Scripts\activate) -pip install -r requirements.txt -python -m uvicorn app:app --host 127.0.0.1 --port 8765 +``` +POST {baseUrl}/audio/transcriptions (multipart 'file' + 'model') -> { "text": "..." } +POST {baseUrl}/audio/speech ({ model, voice, input }) -> audio bytes ``` -Then run the app with `VOICE_SIDECAR_URL=http://127.0.0.1:8765`. +That covers OpenAI and Groq, plus local servers like **LocalAI**, **Speaches**, **Kokoro-FastAPI**, +and **openedai-speech**. Requests are proxied through the app's authenticated `/api/voice/*` routes, +so a local backend only needs to listen on localhost. -Config (env, all optional) — see `voice-sidecar/.env.example`: `WHISPER_MODEL_SIZE`, `WHISPER_DEVICE` -(`cpu`/`cuda`), `KOKORO_VOICE`, `VOICE_PORT`. +### Server-side defaults (optional) + +Instead of (or as defaults behind) the Settings fields, you can set env vars on the server: + +``` +VOICE_API_BASE_URL=http://127.0.0.1:8765/v1 +VOICE_API_KEY=... +VOICE_STT_MODEL=whisper-1 +VOICE_TTS_MODEL=tts-1 +VOICE_TTS_VOICE=alloy +``` + +Per-user Settings values override these. If neither is set, the voice routes return 503. ## Notes -- The first read-aloud is slow (~10–20s) while the model lazy-loads; it's near-instant and cached after. - Recording needs a secure context (HTTPS or localhost) for microphone access. -- On iOS, playback is tap-initiated (manual read-aloud) to satisfy Safari's autoplay policy. +- On iOS, read-aloud is tap-initiated to satisfy Safari's autoplay policy. diff --git a/server/voice-proxy.js b/server/voice-proxy.js index 3bdb748a..770a91de 100644 --- a/server/voice-proxy.js +++ b/server/voice-proxy.js @@ -1,48 +1,71 @@ -// Optional voice proxy — forwards speech-to-text / text-to-speech to a configurable backend. +// Optional voice proxy — forwards STT/TTS to an OpenAI-compatible audio backend. // -// Opt-in: voice is DISABLED unless VOICE_SIDECAR_URL is set. When set, it must point at a -// backend (any implementation) exposing: -// POST /transcribe (multipart field 'audio') -> { text } -// POST /tts (form field 'text') -> audio bytes (audio/*) -// A reference backend (local faster-whisper + Kokoro) ships in /voice-sidecar, but any -// service implementing the two endpoints works (e.g. a cloud transcription + TTS gateway). +// The backend is whatever the user points at: OpenAI, Groq, or a local server +// (LocalAI / Speaches / Kokoro-FastAPI / openedai-speech / etc.). It must expose the +// standard OpenAI audio endpoints: +// POST {base}/audio/transcriptions (multipart 'file' + 'model') -> { text } +// POST {base}/audio/speech ({ model, voice, input }) -> audio bytes // -// Mounted at /api/voice behind authenticateToken, so it inherits the app's auth. The backend -// should bind to localhost and is never exposed directly. +// Config is resolved per-request from headers (set by the client's voice settings), +// falling back to server env defaults. Mounted at /api/voice behind authenticateToken. import express from 'express'; -const VOICE_SIDECAR_URL = (process.env.VOICE_SIDECAR_URL || '').replace(/\/$/, ''); -const VOICE_ENABLED = Boolean(VOICE_SIDECAR_URL); +const ENV = { + baseUrl: (process.env.VOICE_API_BASE_URL || '').replace(/\/$/, ''), + apiKey: process.env.VOICE_API_KEY || '', + sttModel: process.env.VOICE_STT_MODEL || 'whisper-1', + ttsModel: process.env.VOICE_TTS_MODEL || 'tts-1', + ttsVoice: process.env.VOICE_TTS_VOICE || 'alloy', + ttsFormat: process.env.VOICE_TTS_FORMAT || 'mp3', +}; + +// Per-request config: client headers (from the user's voice settings) override env defaults. +function resolveConfig(req) { + const h = req.headers; + return { + baseUrl: (String(h['x-voice-base-url'] || '') || ENV.baseUrl).replace(/\/$/, ''), + apiKey: String(h['x-voice-api-key'] || '') || ENV.apiKey, + sttModel: String(h['x-voice-stt-model'] || '') || ENV.sttModel, + ttsModel: String(h['x-voice-tts-model'] || '') || ENV.ttsModel, + ttsVoice: String(h['x-voice-tts-voice'] || '') || ENV.ttsVoice, + }; +} const router = express.Router(); -// Lazy multer (memory storage) for the audio upload — matches index.js's pattern. +const VOICE_TIMEOUT_MS = Number(process.env.VOICE_TIMEOUT_MS || 60000); +async function fetchWithTimeout(url, options = {}) { + const controller = new AbortController(); + const timer = setTimeout(() => controller.abort(), VOICE_TIMEOUT_MS); + try { + return await fetch(url, { ...options, signal: controller.signal }); + } finally { + clearTimeout(timer); + } +} + let _upload = null; async function getUpload() { if (!_upload) { const multer = (await import('multer')).default; - _upload = multer({ - storage: multer.memoryStorage(), - limits: { fileSize: 25 * 1024 * 1024 }, // 25MB — short dictation clips - }); + _upload = multer({ storage: multer.memoryStorage(), limits: { fileSize: 25 * 1024 * 1024 } }); } return _upload; } -function ensureEnabled(res) { - if (!VOICE_ENABLED) { - res.status(503).json({ error: 'Voice is not configured. Set VOICE_SIDECAR_URL to enable it.' }); - return false; - } - return true; +function authHeader(apiKey) { + return apiKey ? { Authorization: `Bearer ${apiKey}` } : {}; } -// GET /api/voice/health -> { enabled } (frontend hides the voice UI when disabled) -router.get('/health', (_req, res) => res.json({ enabled: VOICE_ENABLED })); +// GET /api/voice/health -> { configured } (true if a base URL is available) +router.get('/health', (req, res) => { + res.json({ configured: Boolean(resolveConfig(req).baseUrl) }); +}); // POST /api/voice/transcribe (multipart 'audio') -> { text } router.post('/transcribe', async (req, res) => { - if (!ensureEnabled(res)) return; + const cfg = resolveConfig(req); + if (!cfg.baseUrl) return res.status(503).json({ error: 'No voice backend configured' }); const upload = await getUpload(); upload.single('audio')(req, res, async (err) => { if (err) return res.status(400).json({ error: err.message }); @@ -50,13 +73,21 @@ router.post('/transcribe', async (req, res) => { try { const fd = new FormData(); fd.append( - 'audio', + 'file', new Blob([req.file.buffer], { type: req.file.mimetype || 'audio/webm' }), req.file.originalname || 'recording.webm', ); - const r = await fetch(`${VOICE_SIDECAR_URL}/transcribe`, { method: 'POST', body: fd }); - const data = await r.json().catch(() => ({ error: 'bad voice backend response' })); - res.status(r.status).json(data); + fd.append('model', cfg.sttModel); + const r = await fetchWithTimeout(`${cfg.baseUrl}/audio/transcriptions`, { + method: 'POST', + headers: authHeader(cfg.apiKey), + body: fd, + }); + const text = await r.text(); + if (!r.ok) return res.status(r.status).json({ error: text || 'transcription failed' }); + let data; + try { data = JSON.parse(text); } catch { data = { text }; } + res.json({ text: data.text ?? '' }); } catch (e) { res.status(502).json({ error: `voice backend unreachable: ${e.message}` }); } @@ -65,18 +96,26 @@ router.post('/transcribe', async (req, res) => { // POST /api/voice/tts { text } -> audio bytes router.post('/tts', async (req, res) => { - if (!ensureEnabled(res)) return; + const cfg = resolveConfig(req); + if (!cfg.baseUrl) return res.status(503).json({ error: 'No voice backend configured' }); const text = req.body?.text; if (!text || !text.trim()) return res.status(400).json({ error: 'text required' }); try { - const fd = new FormData(); - fd.append('text', text); - const r = await fetch(`${VOICE_SIDECAR_URL}/tts`, { method: 'POST', body: fd }); + const r = await fetchWithTimeout(`${cfg.baseUrl}/audio/speech`, { + method: 'POST', + headers: { 'Content-Type': 'application/json', ...authHeader(cfg.apiKey) }, + body: JSON.stringify({ + model: cfg.ttsModel, + voice: cfg.ttsVoice, + input: text, + response_format: ENV.ttsFormat, + }), + }); if (!r.ok) { const errText = await r.text().catch(() => 'tts failed'); return res.status(r.status).json({ error: errText }); } - res.setHeader('Content-Type', r.headers.get('content-type') || 'audio/wav'); + res.setHeader('Content-Type', r.headers.get('content-type') || 'audio/mpeg'); res.setHeader('Cache-Control', 'no-store'); res.send(Buffer.from(await r.arrayBuffer())); } catch (e) { diff --git a/src/components/chat/hooks/useTts.ts b/src/components/chat/hooks/useTts.ts index 46ab0f27..4ceb3887 100644 --- a/src/components/chat/hooks/useTts.ts +++ b/src/components/chat/hooks/useTts.ts @@ -1,5 +1,6 @@ import { useCallback, useEffect, useRef, useState } from 'react'; import { authenticatedFetch } from '../../../utils/api'; +import { voiceConfigHeaders } from '../../../hooks/useVoiceConfig'; // Only one message speaks at a time across the whole app. let stopActive: (() => void) | null = null; @@ -36,8 +37,14 @@ export function useTts(getText: () => string) { if (stopActive) stopActive = null; }, [reset]); - // Cleanup on unmount. - useEffect(() => () => reset(), [reset]); + // Cleanup on unmount: drop the global stop handler if it points at us, then reset. + useEffect( + () => () => { + if (stopActive === stop) stopActive = null; + reset(); + }, + [reset, stop], + ); const play = useCallback(async () => { if (stopActive) stopActive(); @@ -63,12 +70,16 @@ export function useTts(getText: () => string) { const res = await authenticatedFetch('/api/voice/tts', { method: 'POST', body: JSON.stringify({ text }), + headers: voiceConfigHeaders(), }); if (!res.ok) throw new Error(`tts ${res.status}`); const blob = await res.blob(); const url = URL.createObjectURL(blob); + if (audioRef.current !== audio) { + URL.revokeObjectURL(url); // stopped while loading; don't leak the blob URL + return; + } urlRef.current = url; - if (audioRef.current !== audio) return; // stopped while loading audio.src = url; audio.load(); await audio.play(); diff --git a/src/components/chat/hooks/useVoiceAvailable.ts b/src/components/chat/hooks/useVoiceAvailable.ts index 463e4ff3..388411f0 100644 --- a/src/components/chat/hooks/useVoiceAvailable.ts +++ b/src/components/chat/hooks/useVoiceAvailable.ts @@ -1,38 +1,37 @@ import { useEffect, useState } from 'react'; -import { authenticatedFetch } from '../../../utils/api'; -// Whether the optional voice feature is configured on the server (VOICE_SIDECAR_URL set). -// Probed once and cached app-wide so the mic/speak controls can hide themselves when off. -let cached: boolean | null = null; -let inflight: Promise | null = null; +// Voice UI is gated on the `voiceEnabled` UI preference (toggled in Quick Settings / +// the Settings modal). This is a lightweight read-only view of that preference so the +// mic/speak controls can hide themselves, kept in sync via the same events +// useUiPreferences emits. No server probe. +const STORAGE_KEY = 'uiPreferences'; +const SYNC_EVENT = 'ui-preferences:sync'; -function probe(): Promise { - if (cached !== null) return Promise.resolve(cached); - if (!inflight) { - inflight = authenticatedFetch('/api/voice/health') - .then((r) => (r.ok ? r.json() : { enabled: false })) - .then((d) => { - cached = Boolean(d?.enabled); - return cached; - }) - .catch(() => { - cached = false; - return false; - }); +function readVoiceEnabled(): boolean { + try { + const raw = localStorage.getItem(STORAGE_KEY); + if (!raw) return false; + const parsed = JSON.parse(raw); + return parsed?.voiceEnabled === true || parsed?.voiceEnabled === 'true'; + } catch { + return false; } - return inflight; } export function useVoiceAvailable(): boolean { - const [available, setAvailable] = useState(cached ?? false); + const [enabled, setEnabled] = useState(() => + typeof window === 'undefined' ? false : readVoiceEnabled(), + ); + useEffect(() => { - let mounted = true; - probe().then((v) => { - if (mounted) setAvailable(v); - }); + const update = () => setEnabled(readVoiceEnabled()); + window.addEventListener('storage', update); + window.addEventListener(SYNC_EVENT, update as EventListener); return () => { - mounted = false; + window.removeEventListener('storage', update); + window.removeEventListener(SYNC_EVENT, update as EventListener); }; }, []); - return available; + + return enabled; } diff --git a/src/components/chat/hooks/useVoiceInput.ts b/src/components/chat/hooks/useVoiceInput.ts index bc83a803..ccf0ed53 100644 --- a/src/components/chat/hooks/useVoiceInput.ts +++ b/src/components/chat/hooks/useVoiceInput.ts @@ -1,5 +1,6 @@ -import { useCallback, useRef, useState } from 'react'; +import { useCallback, useEffect, useRef, useState } from 'react'; import { authenticatedFetch } from '../../../utils/api'; +import { voiceConfigHeaders } from '../../../hooks/useVoiceConfig'; // Mobile-safe recording: iOS Safari 18.4+ supports webm/opus; older iOS needs mp4. const MIME_CANDIDATES = [ @@ -39,6 +40,15 @@ export function useVoiceInput(onTranscript: (text: string) => void, onError?: (m streamRef.current = null; }; + // Stop the mic if the component unmounts mid-recording. + useEffect(() => { + return () => { + streamRef.current?.getTracks().forEach((t) => t.stop()); + streamRef.current = null; + recorderRef.current = null; + }; + }, []); + const start = useCallback(async () => { try { const stream = await navigator.mediaDevices.getUserMedia({ @@ -68,7 +78,11 @@ export function useVoiceInput(onTranscript: (text: string) => void, onError?: (m const ext = type.includes('mp4') ? 'm4a' : type.includes('ogg') ? 'ogg' : 'webm'; const fd = new FormData(); fd.append('audio', blob, `recording.${ext}`); - const res = await authenticatedFetch('/api/voice/transcribe', { method: 'POST', body: fd }); + const res = await authenticatedFetch('/api/voice/transcribe', { + method: 'POST', + body: fd, + headers: voiceConfigHeaders(), + }); if (!res.ok) throw new Error(`transcribe ${res.status}`); const data = await res.json(); const text = String(data?.text || '').trim(); diff --git a/src/components/chat/view/ChatInterface.tsx b/src/components/chat/view/ChatInterface.tsx index df2bcd88..18996b71 100644 --- a/src/components/chat/view/ChatInterface.tsx +++ b/src/components/chat/view/ChatInterface.tsx @@ -404,7 +404,7 @@ function ChatInterface({ renderInputWithMentions={renderInputWithMentions} textareaRef={textareaRef} input={input} - onVoiceTranscript={(text) => setInput(input ? `${input} ${text}` : text)} + onVoiceTranscript={(text) => setInput(input.trim() ? `${input.trim()} ${text}` : text)} onInputChange={handleInputChange} onTextareaClick={handleTextareaClick} onTextareaKeyDown={handleKeyDown} diff --git a/src/components/chat/view/subcomponents/VoiceInputButton.tsx b/src/components/chat/view/subcomponents/VoiceInputButton.tsx index aeb3585f..6a6304e1 100644 --- a/src/components/chat/view/subcomponents/VoiceInputButton.tsx +++ b/src/components/chat/view/subcomponents/VoiceInputButton.tsx @@ -1,3 +1,4 @@ +import { useEffect, useRef, useState } from 'react'; import { Mic, Square, Loader2 } from 'lucide-react'; import { useTranslation } from 'react-i18next'; import { useVoiceInput } from '../../hooks/useVoiceInput'; @@ -10,10 +11,25 @@ type Props = { }; // Push-to-talk mic button. Renders nothing unless the optional voice feature is enabled. +// Surfaces transcription errors itself (transiently) so they aren't silently swallowed. export default function VoiceInputButton({ onTranscript, onError }: Props) { const { t } = useTranslation('chat'); const available = useVoiceAvailable(); - const { state, toggle } = useVoiceInput(onTranscript, onError); + const [errorMsg, setErrorMsg] = useState(null); + const errorTimer = useRef | null>(null); + + const handleError = (msg: string) => { + onError?.(msg); + setErrorMsg(msg); + if (errorTimer.current) clearTimeout(errorTimer.current); + errorTimer.current = setTimeout(() => setErrorMsg(null), 4000); + }; + + const { state, toggle } = useVoiceInput(onTranscript, handleError); + + useEffect(() => () => { + if (errorTimer.current) clearTimeout(errorTimer.current); + }, []); if (!available) return null; @@ -27,14 +43,21 @@ export default function VoiceInputButton({ onTranscript, onError }: Props) { ); return ( - void }) => { - e.preventDefault(); - toggle(); - }} - > - {icon} - + + {errorMsg && ( + + {errorMsg} + + )} + void }) => { + e.preventDefault(); + toggle(); + }} + > + {icon} + + ); } diff --git a/src/components/quick-settings-panel/constants.ts b/src/components/quick-settings-panel/constants.ts index 15c15458..408a64c7 100644 --- a/src/components/quick-settings-panel/constants.ts +++ b/src/components/quick-settings-panel/constants.ts @@ -4,6 +4,7 @@ import { Eye, Languages, Maximize2, + Mic, } from 'lucide-react'; import type { PreferenceToggleItem } from './types'; @@ -54,4 +55,9 @@ export const INPUT_SETTING_TOGGLES: PreferenceToggleItem[] = [ labelKey: 'quickSettings.sendByCtrlEnter', icon: Languages, }, + { + key: 'voiceEnabled', + labelKey: 'quickSettings.voiceEnabled', + icon: Mic, + }, ]; diff --git a/src/components/quick-settings-panel/types.ts b/src/components/quick-settings-panel/types.ts index 16002694..8d4f0826 100644 --- a/src/components/quick-settings-panel/types.ts +++ b/src/components/quick-settings-panel/types.ts @@ -6,7 +6,8 @@ export type PreferenceToggleKey = | 'showRawParameters' | 'showThinking' | 'autoScrollToBottom' - | 'sendByCtrlEnter'; + | 'sendByCtrlEnter' + | 'voiceEnabled'; export type QuickSettingsPreferences = Record; diff --git a/src/components/settings/types/types.ts b/src/components/settings/types/types.ts index 8fe3b7ff..bab8a430 100644 --- a/src/components/settings/types/types.ts +++ b/src/components/settings/types/types.ts @@ -2,7 +2,7 @@ import type { Dispatch, SetStateAction } from 'react'; import type { LLMProvider } from '../../../types/app'; import type { ProviderAuthStatus } from '../../provider-auth/types'; -export type SettingsMainTab = 'agents' | 'appearance' | 'git' | 'api' | 'tasks' | 'notifications' | 'plugins' | 'about'; +export type SettingsMainTab = 'agents' | 'appearance' | 'git' | 'api' | 'voice' | 'tasks' | 'notifications' | 'plugins' | 'about'; export type AgentProvider = LLMProvider; export type AgentCategory = 'account' | 'permissions' | 'mcp'; export type ProjectSortOrder = 'name' | 'date'; diff --git a/src/components/settings/view/Settings.tsx b/src/components/settings/view/Settings.tsx index 8340a547..0b591402 100644 --- a/src/components/settings/view/Settings.tsx +++ b/src/components/settings/view/Settings.tsx @@ -6,6 +6,7 @@ import SettingsSidebar from '../view/SettingsSidebar'; import AgentsSettingsTab from '../view/tabs/agents-settings/AgentsSettingsTab'; import AppearanceSettingsTab from '../view/tabs/AppearanceSettingsTab'; import CredentialsSettingsTab from '../view/tabs/api-settings/CredentialsSettingsTab'; +import VoiceSettingsTab from '../view/tabs/VoiceSettingsTab'; import GitSettingsTab from '../view/tabs/git-settings/GitSettingsTab'; import NotificationsSettingsTab from '../view/tabs/NotificationsSettingsTab'; import TasksSettingsTab from '../view/tabs/tasks-settings/TasksSettingsTab'; @@ -153,6 +154,8 @@ function Settings({ isOpen, onClose, projects = [], initialTab = 'agents' }: Set {activeTab === 'api' && } + {activeTab === 'voice' && } + {activeTab === 'plugins' && } {activeTab === 'about' && } diff --git a/src/components/settings/view/SettingsSidebar.tsx b/src/components/settings/view/SettingsSidebar.tsx index 149c1492..194ccc98 100644 --- a/src/components/settings/view/SettingsSidebar.tsx +++ b/src/components/settings/view/SettingsSidebar.tsx @@ -1,4 +1,4 @@ -import { Bell, Bot, GitBranch, Info, Key, ListChecks, Palette, Puzzle } from 'lucide-react'; +import { Bell, Bot, GitBranch, Info, Key, ListChecks, Mic, Palette, Puzzle } from 'lucide-react'; import { useTranslation } from 'react-i18next'; import { cn } from '../../../lib/utils'; import { PillBar, Pill } from '../../../shared/view/ui'; @@ -20,6 +20,7 @@ const NAV_ITEMS: NavItem[] = [ { id: 'appearance', labelKey: 'mainTabs.appearance', icon: Palette }, { id: 'git', labelKey: 'mainTabs.git', icon: GitBranch }, { id: 'api', labelKey: 'mainTabs.apiTokens', icon: Key }, + { id: 'voice', labelKey: 'mainTabs.voice', icon: Mic }, { id: 'tasks', labelKey: 'mainTabs.tasks', icon: ListChecks }, { id: 'plugins', labelKey: 'mainTabs.plugins', icon: Puzzle }, { id: 'notifications', labelKey: 'mainTabs.notifications', icon: Bell }, diff --git a/src/components/settings/view/tabs/VoiceSettingsTab.tsx b/src/components/settings/view/tabs/VoiceSettingsTab.tsx new file mode 100644 index 00000000..3de61fba --- /dev/null +++ b/src/components/settings/view/tabs/VoiceSettingsTab.tsx @@ -0,0 +1,82 @@ +import type { InputHTMLAttributes } from 'react'; +import { useTranslation } from 'react-i18next'; +import SettingsSection from '../SettingsSection'; +import SettingsToggle from '../SettingsToggle'; +import { useUiPreferences } from '../../../../hooks/useUiPreferences'; +import { useVoiceConfig } from '../../../../hooks/useVoiceConfig'; + +const inputClass = + 'w-full rounded-md border border-border bg-background px-3 py-2 text-sm text-foreground placeholder:text-muted-foreground focus:outline-none focus:ring-2 focus:ring-ring'; + +function Field({ label, ...props }: { label: string } & InputHTMLAttributes) { + return ( + + ); +} + +export default function VoiceSettingsTab() { + const { t } = useTranslation('settings'); + const { preferences, setPreference } = useUiPreferences(); + const { config, update } = useVoiceConfig(); + + return ( +
+ +
+
+
{t('voiceSettings.enable')}
+
{t('voiceSettings.enableDescription')}
+
+ setPreference('voiceEnabled', v)} + ariaLabel={t('voiceSettings.enable')} + /> +
+
+ + +
+ update({ baseUrl: e.target.value })} + /> + update({ apiKey: e.target.value })} + /> +
+ update({ sttModel: e.target.value })} + /> + update({ ttsModel: e.target.value })} + /> + update({ ttsVoice: e.target.value })} + /> +
+

{t('voiceSettings.note')}

+
+
+
+ ); +} diff --git a/src/hooks/useUiPreferences.ts b/src/hooks/useUiPreferences.ts index eb0b8339..342f1698 100644 --- a/src/hooks/useUiPreferences.ts +++ b/src/hooks/useUiPreferences.ts @@ -7,6 +7,7 @@ type UiPreferences = { autoScrollToBottom: boolean; sendByCtrlEnter: boolean; sidebarVisible: boolean; + voiceEnabled: boolean; }; type UiPreferenceKey = keyof UiPreferences; @@ -39,6 +40,7 @@ const DEFAULTS: UiPreferences = { autoScrollToBottom: true, sendByCtrlEnter: false, sidebarVisible: true, + voiceEnabled: false, }; const PREFERENCE_KEYS = Object.keys(DEFAULTS) as UiPreferenceKey[]; diff --git a/src/hooks/useVoiceConfig.ts b/src/hooks/useVoiceConfig.ts new file mode 100644 index 00000000..fa170bca --- /dev/null +++ b/src/hooks/useVoiceConfig.ts @@ -0,0 +1,57 @@ +import { useState } from 'react'; + +export type VoiceConfig = { + baseUrl: string; + apiKey: string; + sttModel: string; + ttsModel: string; + ttsVoice: string; +}; + +const STORAGE_KEY = 'voiceConfig'; +const DEFAULTS: VoiceConfig = { baseUrl: '', apiKey: '', sttModel: '', ttsModel: '', ttsVoice: '' }; + +function read(): VoiceConfig { + try { + const raw = localStorage.getItem(STORAGE_KEY); + if (!raw) return { ...DEFAULTS }; + const parsed = JSON.parse(raw); + return { ...DEFAULTS, ...(parsed && typeof parsed === 'object' ? parsed : {}) }; + } catch { + return { ...DEFAULTS }; + } +} + +// Headers the voice proxy reads to target a per-user OpenAI-compatible backend. +// Empty fields are omitted so the server's env defaults apply. +export function voiceConfigHeaders(): Record { + if (typeof window === 'undefined') return {}; + const c = read(); + const h: Record = {}; + if (c.baseUrl) h['x-voice-base-url'] = c.baseUrl; + if (c.apiKey) h['x-voice-api-key'] = c.apiKey; + if (c.sttModel) h['x-voice-stt-model'] = c.sttModel; + if (c.ttsModel) h['x-voice-tts-model'] = c.ttsModel; + if (c.ttsVoice) h['x-voice-tts-voice'] = c.ttsVoice; + return h; +} + +export function useVoiceConfig() { + const [config, setConfig] = useState(() => + typeof window === 'undefined' ? { ...DEFAULTS } : read(), + ); + + const update = (patch: Partial) => { + setConfig((prev) => { + const next = { ...prev, ...patch }; + try { + localStorage.setItem(STORAGE_KEY, JSON.stringify(next)); + } catch { + /* ignore persistence errors */ + } + return next; + }); + }; + + return { config, update }; +} diff --git a/src/i18n/locales/en/settings.json b/src/i18n/locales/en/settings.json index b80d17d2..d95151df 100644 --- a/src/i18n/locales/en/settings.json +++ b/src/i18n/locales/en/settings.json @@ -49,6 +49,20 @@ "resetToDefaults": "Reset to Defaults", "cancelChanges": "Cancel Changes" }, + "voiceSettings": { + "title": "Voice", + "description": "Speech-to-text input and read-aloud, via an OpenAI-compatible audio backend.", + "enable": "Enable voice", + "enableDescription": "Show the mic button and the read-aloud button on messages.", + "backendTitle": "Backend", + "backendDescription": "Point at OpenAI, Groq, or a local server (LocalAI, Speaches, Kokoro-FastAPI). Leave blank to use the server default.", + "baseUrl": "Base URL", + "apiKey": "API key", + "sttModel": "Speech-to-text model", + "ttsModel": "Text-to-speech model", + "voice": "Voice", + "note": "The shown defaults work with OpenAI once you add a key. For other providers, set the base URL and model names to match." + }, "quickSettings": { "title": "Quick Settings", "sections": { @@ -63,6 +77,7 @@ "showThinking": "Show thinking", "autoScrollToBottom": "Auto-scroll to bottom", "sendByCtrlEnter": "Send by Ctrl+Enter", + "voiceEnabled": "Voice (mic + read aloud)", "sendByCtrlEnterDescription": "When enabled, pressing Ctrl+Enter will send the message instead of just Enter. This is useful for IME users to avoid accidental sends.", "dragHandle": { "dragging": "Dragging handle", @@ -93,6 +108,7 @@ "appearance": "Appearance", "git": "Git", "apiTokens": "API & Tokens", + "voice": "Voice", "tasks": "Tasks", "notifications": "Notifications", "plugins": "Plugins", diff --git a/voice-sidecar/.env.example b/voice-sidecar/.env.example deleted file mode 100644 index 92842059..00000000 --- a/voice-sidecar/.env.example +++ /dev/null @@ -1,14 +0,0 @@ -# Voice sidecar config (all optional — these are the defaults). -# The sidecar binds 127.0.0.1 only; CloudCLI's Express proxy reaches it. - -# Port the sidecar listens on (CloudCLI reaches it via VOICE_SIDECAR_URL). -VOICE_PORT=8765 - -# faster-whisper model size: tiny | base | small | medium | large-v3 -WHISPER_MODEL_SIZE=base -# cpu (int8, default) or cuda (float16, needs a CUDA torch in the venv) -WHISPER_DEVICE=cpu - -# Kokoro voice (see https://github.com/hexgrad/kokoro for the full list) and language code. -KOKORO_VOICE=af_heart -KOKORO_LANG=a diff --git a/voice-sidecar/app.py b/voice-sidecar/app.py deleted file mode 100644 index 518f83bf..00000000 --- a/voice-sidecar/app.py +++ /dev/null @@ -1,187 +0,0 @@ -""" -CloudCLI voice sidecar — local STT (faster-whisper) + local TTS (Kokoro-82M). - -Ported from the tooler voice endpoints (D:\\tooler\\backend\\server.py), swapping -edge-tts -> Kokoro. Bound to 127.0.0.1 only; CloudCLI's Express server proxies to -it behind JWT auth. Never exposed to the tailnet directly. - -Endpoints: - GET /health -> {status, whisper_loaded, kokoro_loaded} - POST /transcribe (multipart 'audio') -> {text, duration_ms} - POST /tts (form 'text') -> audio/wav bytes (cached) -""" -import asyncio -import hashlib -import logging -import os -import re -import tempfile -import time -from pathlib import Path - -import numpy as np -import soundfile as sf -from fastapi import FastAPI, File, Form, HTTPException, UploadFile -from fastapi.responses import Response - -logging.basicConfig(level=logging.INFO) -logger = logging.getLogger("voice-sidecar") - -# ---- Config (env-overridable) ------------------------------------------------- -PORT = int(os.getenv("VOICE_PORT", "8765")) -WHISPER_MODEL_SIZE = os.getenv("WHISPER_MODEL_SIZE", "base") -WHISPER_DEVICE = os.getenv("WHISPER_DEVICE", "cpu").lower() # "cpu" | "cuda" -KOKORO_VOICE = os.getenv("KOKORO_VOICE", "af_heart") -KOKORO_LANG = os.getenv("KOKORO_LANG", "a") # 'a' = American English -KOKORO_SR = 24000 - -VOICE_DIR = Path(__file__).parent / "voice_messages" -VOICE_DIR.mkdir(exist_ok=True) - -# ---- Lazy model singletons ---------------------------------------------------- -_whisper = None -_whisper_lock = asyncio.Lock() -_kpipe = None -_kpipe_lock = asyncio.Lock() - - -async def get_whisper(): - global _whisper - if _whisper is not None: - return _whisper - async with _whisper_lock: - if _whisper is not None: - return _whisper - - def _load(): - from faster_whisper import WhisperModel - if WHISPER_DEVICE == "cuda": - try: - logger.info("[WHISPER] loading on CUDA (float16)...") - return WhisperModel(WHISPER_MODEL_SIZE, device="cuda", compute_type="float16") - except Exception as e: # noqa: BLE001 - logger.warning("[WHISPER] CUDA failed (%s), falling back to CPU", e) - logger.info("[WHISPER] loading '%s' on CPU (int8)", WHISPER_MODEL_SIZE) - return WhisperModel(WHISPER_MODEL_SIZE, device="cpu", compute_type="int8") - - _whisper = await asyncio.get_event_loop().run_in_executor(None, _load) - logger.info("[WHISPER] ready") - return _whisper - - -async def get_kokoro(): - global _kpipe - if _kpipe is not None: - return _kpipe - async with _kpipe_lock: - if _kpipe is not None: - return _kpipe - - def _load(): - from kokoro import KPipeline - logger.info("[KOKORO] loading pipeline (lang=%s)...", KOKORO_LANG) - return KPipeline(lang_code=KOKORO_LANG) - - _kpipe = await asyncio.get_event_loop().run_in_executor(None, _load) - logger.info("[KOKORO] ready") - return _kpipe - - -# ---- Text cleaning (ported verbatim from tooler prepare_text_for_tts) --------- -def prepare_text_for_tts(text: str) -> str: - """Strip/transform markdown for natural speech.""" - text = re.sub(r"```[\s\S]*?```", " code block ", text) # code fences -> spoken stub - text = re.sub(r"`([^`]+)`", r"\1", text) # unwrap inline code - text = re.sub(r"\*\*([^*]+)\*\*", r"\1", text) # bold - text = re.sub(r"\*([^*]+)\*", r"\1", text) # italic - text = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", text) # links -> link text - text = re.sub(r"^#{1,6}\s+", "", text, flags=re.MULTILINE) # headers - text = re.sub(r"\s+", " ", text).strip() - return text - - -# ---- App ---------------------------------------------------------------------- -app = FastAPI(title="CloudCLI voice sidecar") - - -@app.get("/health") -async def health(): - return { - "status": "ok", - "whisper_loaded": _whisper is not None, - "kokoro_loaded": _kpipe is not None, - } - - -@app.post("/transcribe") -async def transcribe(audio: UploadFile = File(...)): - start = time.time() - suffix = Path(audio.filename or "rec.webm").suffix or ".webm" - content = await audio.read() - logger.info("[STT] %d bytes (%s)", len(content), audio.content_type) - - tmp_path = None - try: - with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp: - tmp.write(content) - tmp_path = tmp.name - - model = await get_whisper() - - def _run(): - segments, _info = model.transcribe(tmp_path, beam_size=5) - return "".join(seg.text for seg in segments).strip() - - text = await asyncio.get_event_loop().run_in_executor(None, _run) - duration_ms = int((time.time() - start) * 1000) - logger.info("[STT] %dms: %s", duration_ms, text[:100]) - return {"text": text, "duration_ms": duration_ms} - except Exception as e: # noqa: BLE001 - logger.error("[STT] failed: %s", e, exc_info=True) - raise HTTPException(status_code=500, detail=f"Transcription failed: {e}") - finally: - if tmp_path and os.path.exists(tmp_path): - try: - os.unlink(tmp_path) - except OSError: - pass - - -@app.post("/tts") -async def tts(text: str = Form(...)): - if not text.strip(): - raise HTTPException(status_code=400, detail="Text cannot be empty") - if len(text) > 8000: - raise HTTPException(status_code=400, detail="Text too long (max 8000 chars)") - - start = time.time() - clean = prepare_text_for_tts(text) - # Cache on the RAW text hash (matches tooler) so identical messages reuse audio. - key = hashlib.sha256(text.encode()).hexdigest()[:16] - out_path = VOICE_DIR / f"{key}.wav" - - if not out_path.exists(): - try: - pipeline = await get_kokoro() - - def _synth(): - chunks = [audio for _gs, _ps, audio in pipeline(clean, voice=KOKORO_VOICE)] - if not chunks: - raise RuntimeError("Kokoro produced no audio") - full = np.concatenate([np.asarray(c, dtype=np.float32) for c in chunks]) - sf.write(str(out_path), full, KOKORO_SR) - - await asyncio.get_event_loop().run_in_executor(None, _synth) - logger.info("[TTS] generated %s in %dms", out_path.name, int((time.time() - start) * 1000)) - except Exception as e: # noqa: BLE001 - logger.error("[TTS] failed: %s", e, exc_info=True) - raise HTTPException(status_code=500, detail=f"TTS failed: {e}") - else: - logger.info("[TTS] cache hit %s", out_path.name) - - return Response(content=out_path.read_bytes(), media_type="audio/wav") - - -if __name__ == "__main__": - import uvicorn - uvicorn.run(app, host="127.0.0.1", port=PORT, log_level="info") diff --git a/voice-sidecar/requirements.txt b/voice-sidecar/requirements.txt deleted file mode 100644 index c37d56e9..00000000 --- a/voice-sidecar/requirements.txt +++ /dev/null @@ -1,9 +0,0 @@ -# CloudCLI voice sidecar — STT (faster-whisper) + TTS (Kokoro-82M) -fastapi>=0.110.0 -uvicorn[standard]>=0.27.0 -python-multipart>=0.0.9 -faster-whisper>=1.0.0 -kokoro>=0.9.4 -misaki[en]>=0.9.4 -soundfile>=0.12.1 -numpy>=1.26.0 diff --git a/voice-sidecar/test_smoke.py b/voice-sidecar/test_smoke.py deleted file mode 100644 index 224729fe..00000000 --- a/voice-sidecar/test_smoke.py +++ /dev/null @@ -1,29 +0,0 @@ -"""Smoke test: Kokoro TTS -> faster-whisper STT round-trip.""" -import time -import numpy as np -import soundfile as sf - -PHRASE = "Hello, this is a test of the CloudCLI voice sidecar." - -print("[1/3] Loading Kokoro pipeline...") -t = time.time() -from kokoro import KPipeline -pipe = KPipeline(lang_code="a") -print(f" loaded in {time.time()-t:.1f}s") - -print("[2/3] Synthesizing...") -t = time.time() -chunks = [audio for _gs, _ps, audio in pipe(PHRASE, voice="af_heart")] -full = np.concatenate([np.asarray(c, dtype=np.float32) for c in chunks]) -sf.write("test.wav", full, 24000) -dur = len(full) / 24000 -print(f" synth {time.time()-t:.1f}s -> test.wav ({dur:.1f}s audio, {len(full)} samples)") - -print("[3/3] Transcribing back with faster-whisper (base, cpu int8)...") -t = time.time() -from faster_whisper import WhisperModel -model = WhisperModel("base", device="cpu", compute_type="int8") -segments, _info = model.transcribe("test.wav", beam_size=5) -text = "".join(s.text for s in segments).strip() -print(f" transcribe {time.time()-t:.1f}s -> {text!r}") -print("\nROUND-TRIP OK" if text else "\nROUND-TRIP PRODUCED NO TEXT")