Architecture
Architecture
Wayland-only voice typing: audio capture → VAD → chunking → Whisper transcription → text injection.
Design Principles
- Offline-First — No network for core functionality
- Single Binary — No runtime interpreters
- Fail Fast — Clear errors, no silent failures
- Pluggable — Traits for audio source, transcriber, text sink
Pipeline
AudioSource → VadStation → ChunkerStation → TranscriberStation → SinkStation
│ │ │ │ │
└──────────────┴──────────────┴──────────────────┴─────────────────┘
All connected via crossbeam::channel::boundedEach station implements Station<Input, Output> and runs in its own thread via StationRunner.
Stations
| Station | Input | Output | Purpose |
|---|---|---|---|
| VadStation | AudioFrame | VadFrame | RMS-based speech detection, level meter (-v) |
| ChunkerStation | VadFrame | AudioChunk | Gap-shrinking adaptive chunker |
| TranscriberStation | AudioChunk | TranscribedText | whisper-rs inference |
| SinkStation | TranscribedText | () | Delegates to TextSink impl |
TextSink Implementations
- InjectorSink — Clipboard + paste key simulation (continuous mode)
- CollectorSink — Accumulates text, returns on finish (
--oncemode) - StdoutSink — Writes to stdout (pipe mode:
cat file.wav | voicsh)
Text Injection Fallback Chain
- Portal — xdg-desktop-portal RemoteDesktop key injection (GNOME 45+, KDE 6.1+)
- wtype — wlroots virtual keyboard
- ydotool — uinput-based, works everywhere but needs daemon
Paste key auto-detection: queries swaymsg → hyprctl → GNOME Shell Introspect → GNOME fallback → generic fallback.
Module Map
src/
├── app.rs # Orchestrates record command (feature-gated: cpal-audio+model-download+cli)
├── cli.rs # clap argument parsing (-v/-vv, --once, --fan-out, etc.)
├── config.rs # TOML config + env overrides
├── diagnostics.rs # `voicsh check` dependency validation
├── audio/
│ ├── capture.rs # cpal AudioSource impl (feature: cpal-audio)
│ ├── recorder.rs # AudioSource trait + MockAudioSource
│ ├── wav.rs # WavAudioSource: WAV file input with resampling (hound)
│ └── vad.rs # Voice activity detection (RMS threshold + state machine)
├── input/
│ ├── injector.rs # TextInjector with CommandExecutor trait (wl-copy, wtype, ydotool)
│ ├── portal.rs # ashpd PortalSession for key injection (feature: portal)
│ └── focused_window.rs # Paste key detection (sway/hyprland/GNOME)
├── pipeline/
│ ├── orchestrator.rs # Pipeline + PipelineConfig + PipelineHandle
│ ├── station.rs # Station trait + StationRunner
│ ├── sink.rs # TextSink trait, SinkStation, InjectorSink, CollectorSink, StdoutSink
│ ├── adaptive_chunker.rs # Gap-shrinking chunker algorithm
│ ├── vad_station.rs # VAD as Station
│ ├── chunker_station.rs # Chunker as Station
│ ├── transcriber_station.rs # Transcriber as Station
│ ├── types.rs # AudioFrame, VadFrame, AudioChunk, TranscribedText
│ └── error.rs # StationError + ErrorReporter trait
├── streaming/ # Experimental streaming pipeline (ring buffer, stitcher)
├── stt/
│ ├── transcriber.rs # Transcriber trait + MockTranscriber
│ ├── whisper.rs # WhisperTranscriber (feature: whisper)
│ └── fan_out.rs # Parallel model comparison (--fan-out)
├── models/
│ ├── catalog.rs # Model metadata, English/multilingual variants
│ └── download.rs # HuggingFace download with SHA-1 verification (feature: model-download)
└── ipc/
├── protocol.rs # JSON command/response types
└── server.rs # Unix socket IPC serverFeature Gates
default = ["full"]
full = ["whisper", "cpal-audio", "model-download", "cli", "portal"]
cuda = ["whisper-rs/cuda"] # NVIDIA GPU
vulkan = ["whisper-rs/vulkan"] # Cross-platform GPU
hipblas = ["whisper-rs/hipblas"] # AMD GPU
openblas = ["whisper-rs/openblas"] # CPU BLASUse --no-default-features for fast lib-only builds (skips whisper-rs compilation).
Verbosity
- No flag: text only (
"transcribed text") -v: volume meter + result lines ([ok 43ch] "text")-vv: full diagnostics (chunk timing, transcribing progress, paste detection steps)
Configuration
TOML at ~/.config/voicsh/config.toml. Sections: [audio], [stt], [input].
Environment overrides: VOICSH_MODEL, VOICSH_AUDIO_DEVICE, etc.