Roadmap

Roadmap

0.0.1 — First release (done)

Default command, pipe mode, GPU, daemon mode.

  • Default command: voicsh alone starts mic recording (no subcommand)
  • Pipe mode: cat file.wav | voicsh → transcribe → stdout
  • Auto-resample WAV to 16kHz mono (linear interpolation)
  • GPU feature gates: --features cuda, vulkan, hipblas
  • Logging cleanup: all output respects -v/-vv levels consistently
  • Honest README matching actual features
  • Unix socket IPC: voicsh start / voicsh stop / voicsh toggle
  • Model stays in memory (~300MB for base.en)
  • Systemd user service: voicsh install-service
  • voicsh status shows daemon health
  • Shell completions (bash/zsh/fish)
  • voicsh init — auto-tune: benchmark hardware, recommend model, download
  • voicsh follow — stream daemon events (meter, state, transcriptions)
  • voicsh config get/set/list/dump — manage config without a text editor
  • Hallucination filtering (configurable, language-specific)
  • Fan-out mode — run English + multilingual models in parallel, pick best

0.1.0 — Usable voice typing (done)

Multi-language support, GNOME Shell extension, spoken punctuation, per-token confidence, hallucination filtering, and quantized models.

  • Punctuation: “period”, “comma”, “question mark”, “exclamation mark”, etc.
  • Whitespace: “new line”, “new paragraph”, “space”, “tab”
  • Caps toggle: “all caps” / “end caps”
  • Symbols: slash, ampersand, at-sign, dollar, hash, percent, asterisk, etc.
  • Brackets: open/close paren, brace, bracket, angle bracket
  • Key combos: “delete word” (Ctrl+Backspace), “backspace”
  • Configurable vocabulary in config.toml (user overrides merge with built-ins)
  • Rule-based (no LLM needed)
  • Multi-language: en, de, es, fr, pt, it, nl, pl, ru, ja, zh, ko
  • GNOME Shell extension: install, toggle, status indicator (voicsh install-gnome-extension)
  • Language allowlist: filter transcriptions by allowed languages + confidence threshold
  • GNOME extension: “Open Debug Log” menu item (launches voicsh follow in terminal)
  • GNOME extension: follow mode with live audio levels, recording state, transcriptions
  • GNOME extension: language picker + model switcher via IPC (SetLanguage/SetModel)
  • GNOME extension: language indicator in panel (two-letter code next to icon, configurable)
  • Unified output: daemon verbose and voicsh follow share one renderer (DRY)
  • Per-token confidence coloring: real decoder probabilities (green/default/yellow/red)
  • Hallucination filter: 76+ phrases, CJK punctuation normalization, punctuation-only skip
  • Quantized model support (q5_0, q5_1, q8_0 variants)

0.2.0 — Cleaner output, leaner internals (done)

Output polish, code health, CLI ergonomics, GNOME extension cleanup.

  • Per-token probability coloring in terminal output
  • Dropped transcriptions styled with strikethrough
  • Suppress noisy low-confidence hallucination filter drops
  • Remove unused streaming module (-2,262 lines)
  • Consolidate all unsafe operations into sys module
  • Extract hallucination filter phrases to TOML data file
  • Extract RecordConfig struct from parameter soup
  • CLI: voicsh models use <name> command
  • CLI: voicsh benchmark <model> as positional argument
  • GNOME extension: rename to [email protected], version display, migration cleanup
  • GNOME extension: DaemonInfo on follow connect, binary path
  • Stale portal session detection with user warning
  • GPU preflight checks (Vulkan glslc, libclang-dev, unversioned clang)
  • Voice command provenance tracking in pipeline

0.3.0 — Wayland overlay and spoken punctuation

Wayland overlay for live feedback. Spoken punctuation that works reliably.

  • Wayland layer-shell overlay: recording indicator + live transcription display
  • Per-token confidence visualization in overlay (color-coded, same scale as terminal)
  • Sentence collector: buffer dictated chunks in the overlay instead of injecting immediately
  • Reliable spoken punctuation end-to-end (building on 0.1.0 foundation)

0.4.0 — GPU acceleration

GPU acceleration for Whisper inference.

  • CUDA working on dev machine
  • Vulkan working on dev machine
  • Research task: Evaluate candle Whisper — if viable, migrate from whisper.cpp for unified GPU context
  • N-best reality: whisper-rs/whisper.cpp BeamSearch returns single-best only (no n-best extraction). True n-best requires either candle Whisper or multi-pass with different temperatures.
  • Alternative path: Keep whisper.cpp + BeamSearch for better single-best, feed per-token probabilities to corrector (already wired)
  • GPU compilation gates: CUDA, Vulkan, hipBLAS in CI containers
  • Vulkan runtime tests via lavapipe

0.5.0 — Voice commands

Full voice commands beyond punctuation — navigation, selection, editing, app control.

  • Navigation: “go to line”, “scroll up/down”, “page up/down”
  • Selection: “select word”, “select line”, “select all”
  • Editing: “undo”, “redo”, “copy”, “paste”, “cut”
  • Extensible command vocabulary via config

0.6.0 — Post-ASR error correction

LLM-based error correction using per-token confidence as a guide.

Current state: Early prototype exists (T5 via candle + SymSpell hybrid corrector). CPU-only, slow, and corrections are often wrong. SymSpell is lowercase-only which breaks casing. Needs a fundamentally better approach before shipping.

  • Per-token probability data already available (TokenProbability { token, probability })
  • Low-probability tokens are flagged as correction candidates for the LLM
  • Option A — Local model: FlanEC (flan-t5-base) or instruction-tuned LLM via candle (~250M–1B params, F16)
  • Option B — External LLM: Ollama / llama.cpp / cloud API with token-confidence prompt
  • Lazy-loaded model, timeout + fallback to raw transcription
  • Greedy reranking: bump best_of: 1best_of: 5 for better base accuracy before correction
  • English-only guard initially (passthrough for other languages)
  • [error_correction] config section
  • LLM sentence stitcher: when a sentence is complete (detected by LLM), refine and inject
    • Receives token-level probabilities to focus corrections on uncertain tokens
    • Context-aware: knows the previous sentence for coherent flow
    • Uses local LLM (Ollama / llama.cpp) or cloud (Anthropic, OpenAI)
    • Timeout + fallback to raw transcription if LLM is unavailable

0.7.0 — LLM assistant

Voice-activated LLM: hold key + speak a question → LLM processes → answer injected as text.

  • Local: Ollama, llama.cpp server (auto-detect running)
  • Cloud providers optional (Anthropic, OpenAI)
  • Timeout + fallback

Future

  • Streaming token-by-token display (live partial results during recording)
  • Push-to-talk (hold hotkey)
  • X11 support (xdotool/xsel)
  • Profiles (per-app settings)
  • Daemon: listen for PipeWire/PulseAudio device changes, auto-recover or show helpful message
  • German grammar correction: t5-small-grammar-correction-german (aiassociates) via candle
  • Deepgram remote API integration (cloud ASR alternative)
  • NVIDIA Canary / NeMo support via local docker container (nvcr.io/nvidia/nemo) — high-quality local ASR alternative (~20GB+)
  • GNOME extension: portal per-recording (close RemoteDesktop session when idle, remove yellow privacy indicator)

Non-goals

  • GUI settings app (config file is enough)
  • Speaker identification
  • Real-time translation
  • Windows/macOS (Linux-first)