Roadmap
0.0.1 — First release (done)
Default command, pipe mode, GPU, daemon mode.
- Default command:
voicshalone starts mic recording (no subcommand) - Pipe mode:
cat file.wav | voicsh→ transcribe → stdout - Auto-resample WAV to 16kHz mono (linear interpolation)
- GPU feature gates:
--features cuda,vulkan,hipblas - Logging cleanup: all output respects -v/-vv levels consistently
- Honest README matching actual features
- Unix socket IPC:
voicsh start/voicsh stop/voicsh toggle - Model stays in memory (~300MB for base.en)
- Systemd user service:
voicsh install-service -
voicsh statusshows daemon health - Shell completions (bash/zsh/fish)
-
voicsh init— auto-tune: benchmark hardware, recommend model, download -
voicsh follow— stream daemon events (meter, state, transcriptions) -
voicsh config get/set/list/dump— manage config without a text editor - Hallucination filtering (configurable, language-specific)
- Fan-out mode — run English + multilingual models in parallel, pick best
0.1.0 — Usable voice typing (done)
Multi-language support, GNOME Shell extension, spoken punctuation, per-token confidence, hallucination filtering, and quantized models.
- Punctuation: “period”, “comma”, “question mark”, “exclamation mark”, etc.
- Whitespace: “new line”, “new paragraph”, “space”, “tab”
- Caps toggle: “all caps” / “end caps”
- Symbols: slash, ampersand, at-sign, dollar, hash, percent, asterisk, etc.
- Brackets: open/close paren, brace, bracket, angle bracket
- Key combos: “delete word” (Ctrl+Backspace), “backspace”
- Configurable vocabulary in config.toml (user overrides merge with built-ins)
- Rule-based (no LLM needed)
- Multi-language: en, de, es, fr, pt, it, nl, pl, ru, ja, zh, ko
- GNOME Shell extension: install, toggle, status indicator (
voicsh install-gnome-extension) - Language allowlist: filter transcriptions by allowed languages + confidence threshold
- GNOME extension: “Open Debug Log” menu item (launches
voicsh followin terminal) - GNOME extension: follow mode with live audio levels, recording state, transcriptions
- GNOME extension: language picker + model switcher via IPC (SetLanguage/SetModel)
- GNOME extension: language indicator in panel (two-letter code next to icon, configurable)
- Unified output: daemon verbose and
voicsh followshare one renderer (DRY) - Per-token confidence coloring: real decoder probabilities (green/default/yellow/red)
- Hallucination filter: 76+ phrases, CJK punctuation normalization, punctuation-only skip
- Quantized model support (q5_0, q5_1, q8_0 variants)
0.2.0 — Cleaner output, leaner internals (done)
Output polish, code health, CLI ergonomics, GNOME extension cleanup.
- Per-token probability coloring in terminal output
- Dropped transcriptions styled with strikethrough
- Suppress noisy low-confidence hallucination filter drops
- Remove unused streaming module (-2,262 lines)
- Consolidate all unsafe operations into
sysmodule - Extract hallucination filter phrases to TOML data file
- Extract
RecordConfigstruct from parameter soup - CLI:
voicsh models use <name>command - CLI:
voicsh benchmark <model>as positional argument - GNOME extension: rename to
[email protected], version display, migration cleanup - GNOME extension: DaemonInfo on follow connect, binary path
- Stale portal session detection with user warning
- GPU preflight checks (Vulkan glslc, libclang-dev, unversioned clang)
- Voice command provenance tracking in pipeline
0.3.0 — Wayland overlay and spoken punctuation
Wayland overlay for live feedback. Spoken punctuation that works reliably.
- Wayland layer-shell overlay: recording indicator + live transcription display
- Per-token confidence visualization in overlay (color-coded, same scale as terminal)
- Sentence collector: buffer dictated chunks in the overlay instead of injecting immediately
- Reliable spoken punctuation end-to-end (building on 0.1.0 foundation)
0.4.0 — GPU acceleration
GPU acceleration for Whisper inference.
- CUDA working on dev machine
- Vulkan working on dev machine
- Research task: Evaluate candle Whisper — if viable, migrate from whisper.cpp for unified GPU context
- N-best reality: whisper-rs/whisper.cpp BeamSearch returns single-best only (no n-best extraction). True n-best requires either candle Whisper or multi-pass with different temperatures.
- Alternative path: Keep whisper.cpp + BeamSearch for better single-best, feed per-token probabilities to corrector (already wired)
- GPU compilation gates: CUDA, Vulkan, hipBLAS in CI containers
- Vulkan runtime tests via lavapipe
0.5.0 — Voice commands
Full voice commands beyond punctuation — navigation, selection, editing, app control.
- Navigation: “go to line”, “scroll up/down”, “page up/down”
- Selection: “select word”, “select line”, “select all”
- Editing: “undo”, “redo”, “copy”, “paste”, “cut”
- Extensible command vocabulary via config
0.6.0 — Post-ASR error correction
LLM-based error correction using per-token confidence as a guide.
Current state: Early prototype exists (T5 via candle + SymSpell hybrid corrector). CPU-only, slow, and corrections are often wrong. SymSpell is lowercase-only which breaks casing. Needs a fundamentally better approach before shipping.
- Per-token probability data already available (
TokenProbability { token, probability }) - Low-probability tokens are flagged as correction candidates for the LLM
- Option A — Local model: FlanEC (flan-t5-base) or instruction-tuned LLM via candle (~250M–1B params, F16)
- Option B — External LLM: Ollama / llama.cpp / cloud API with token-confidence prompt
- Lazy-loaded model, timeout + fallback to raw transcription
- Greedy reranking: bump
best_of: 1→best_of: 5for better base accuracy before correction - English-only guard initially (passthrough for other languages)
[error_correction]config section- LLM sentence stitcher: when a sentence is complete (detected by LLM), refine and inject
- Receives token-level probabilities to focus corrections on uncertain tokens
- Context-aware: knows the previous sentence for coherent flow
- Uses local LLM (Ollama / llama.cpp) or cloud (Anthropic, OpenAI)
- Timeout + fallback to raw transcription if LLM is unavailable
0.7.0 — LLM assistant
Voice-activated LLM: hold key + speak a question → LLM processes → answer injected as text.
- Local: Ollama, llama.cpp server (auto-detect running)
- Cloud providers optional (Anthropic, OpenAI)
- Timeout + fallback
Future
- Streaming token-by-token display (live partial results during recording)
- Push-to-talk (hold hotkey)
- X11 support (xdotool/xsel)
- Profiles (per-app settings)
- Daemon: listen for PipeWire/PulseAudio device changes, auto-recover or show helpful message
- German grammar correction: t5-small-grammar-correction-german (aiassociates) via candle
- Deepgram remote API integration (cloud ASR alternative)
- NVIDIA Canary / NeMo support via local docker container (nvcr.io/nvidia/nemo) — high-quality local ASR alternative (~20GB+)
- GNOME extension: portal per-recording (close RemoteDesktop session when idle, remove yellow privacy indicator)
Non-goals
- GUI settings app (config file is enough)
- Speaker identification
- Real-time translation
- Windows/macOS (Linux-first)