14 KiB
🎙️ W H I S P E R V O I C E
SOVEREIGN SPEECH RECOGNITION
"The master's tools will never dismantle the master's house."
Build your own tools. Run them locally. Free your mind.
📡 The Transmission
We are witnessing the enshittification of the digital world. What were once vibrant social commons are being walled off, strip-mined for data, and degraded into rent-seeking silos. Your voice is no longer your own; it is a training set for a corporate oracle that charges you for the privilege of listening.
Whisper Voice is a small act of sabotage against this trend.
It is built on the axiom of Technological Sovereignty. By moving state-of-the-art inference from the server farms to your own silicon, you reclaim the means of digital production. No telemetry. No subscriptions. No "cloud processing" that eavesdrops on your intent.
⚡ The Engine
Whisper Voice operates directly on the metal. It is not an API wrapper; it is an autonomous machine.
| Component | Technology | Benefit |
|---|---|---|
| Inference Core | Faster-Whisper | Hyper-optimized C++ implementation via CTranslate2. Delivers 4x velocity over standard PyTorch. |
| Compression | INT8 quantization | Enables Pro-grade models (Large-v3) to run on consumer-grade GPUs, democratizing elite AI. |
| Sensory Gate | Silero VAD | Enterprise-grade Voice Activity Detection filters out the noise, ensuring only pure intent is processed. |
| Interface | Qt 6 / QML | Hardware-accelerated, glassmorphic UI that is fluid, responsive, and sovereign. |
🛑 Compatibility Matrix (Windows)
The core engine (CTranslate2) is heavily optimized for Nvidia tensor cores.
| Manufacturer | Hardware | Status | Notes |
|---|---|---|---|
| Nvidia | GTX 900+ / RTX | ✅ Supported | Full heavy-metal acceleration. |
| AMD | Radeon RX | ⚠️ CPU Fallback | Runs on CPU. Valid for Small/Medium, slow for Large. |
| Intel | Arc / Iris | ⚠️ CPU Fallback | Runs on CPU. Valid for Small/Medium, slow for Large. |
| Apple | M1 / M2 / M3 | ❌ Unsupported | Release is strictly Windows x64. |
AMD Users: v1.0.3 auto-detects GPU failures and silently falls back to CPU.
🖋️ Universal Transcription
At its core, Whisper Voice is the ultimate bridge between thought and text. It listens with superhuman precision, converting spoken word into written form across 99 languages.
- Punctuation Mastery: Automatically handles capitalization and complex punctuation formatting.
- Contextual Intelligence: Smarter than standard dictation; it understands the flow of sentences to resolve homophones and technical jargon ($1.5k vs "fifteen hundred dollars").
- Total Privacy: Your private dictation, legal notes, or creative writing never leave your RAM.
Workflow: F9 (Default)
The primary channel for native-language transcription. It transcribes precisely what it hears in the language you speak (or the one you've locked in Settings).
🧠 Intelligent Correction (New in v1.1.0)
Whisper Voice now integrates a local Llama 3.2 1B LLM to act as a "Silent Consultant". It post-processes transcripts to fix grammar or polish style without effectively "chatting" back.
It is strictly trained on a Forensic Protocol: it will never lecture you, never refuse to process explicit language, and never sanitize your words. Your profanity is yours to keep.
Correction Modes:
- Standard (Default): Fixes grammar, punctuation, and capitalization while keeping every word you said.
- Grammar Only: Strictly fixes objective errors (spelling/agreement). Touches nothing else.
- Rewrite: Polishes the flow and clarity of your sentences while explicitly preserving your original tone (Casual stays casual, Formal stays formal).
Supported Languages:
The correction engine is optimized for English, German, French, Italian, Portuguese, Spanish, Hindi, and Thai. It also performs well on Russian, Chinese, Japanese, and Romanian.
This approach incurs a ~2s latency penalty but uses zero extra VRAM when in Low VRAM mode.
🌎 Universal Translation
Whisper Voice v1.0.1 includes a Neural Translation Engine that allows you to bridge any linguistic gap instantly.
- Input: Speak in French, Japanese, Russian, or 96 other languages.
- Output: The engine instantly reconstructs the semantic meaning into fluent English.
- Task Protocol: Handled via the dedicated
F10channel.
🔍 Why only English translation?
A common question arises: Why can't I translate from French to Japanese?
The architecture of the underlying Whisper model is a Many-to-English design. During its massive training phase (680,000 hours of audio), the translation task was specifically optimized to map the global linguistic commons onto a single bridge language: English. This allowed the model to reach incredible levels of semantic understanding without the exponential complexity of a "Many-to-Many" mapping.
By focusing its translation decoder solely on English, Whisper achieves "Zero-Shot" quality that rivals specialized translation engines while remaining lightweight enough to run on your local GPU.
🕹️ Command & Control
Global Hotkeys
The agent runs silently in the background, waiting for your signal.
- Transcribe (F9): Opens the channel for standard speech-to-text.
- Translate (F10): Opens the channel for neural translation.
- Customization: Remap these keys in Settings. The recorder supports complex chords (e.g.
Ctrl + Alt + Space) to fit your workflow.
Injection Protocols
- Clipboard Paste: Standard text injection. Instant, reliable.
- Simulate Typing: Mimics physical keystrokes at superhuman speed (6000 CPM). Bypasses anti-paste restrictions and "protected" windows.
📊 Intelligence Matrix
Select the model that aligns with your available resources.
| Model | VRAM (GPU) | RAM (CPU) | Designation | Capability |
|---|---|---|---|---|
Tiny |
~500 MB | ~1 GB | ⚡ Supersonic | Command & Control, older hardware. |
Base |
~600 MB | ~1 GB | 🚀 Very Fast | Daily driver for low-power laptops. |
Small |
~1 GB | ~2 GB | ⏩ Fast | High accuracy English dictation. |
Medium |
~2 GB | ~4 GB | ⚖️ Balanced | Complex vocabulary, foreign accents. |
Large-v3 Turbo |
~4 GB | ~6 GB | ✨ Optimal | The Sweet Spot. Near-Large intelligence, Medium speed. |
Large-v3 |
~5 GB | ~8 GB | 🧠 Maximum | Professional grade. Uncompromised. |
Note: Acceleration requires you to manually select your Compute Device (CUDA GPU or CPU) in Settings.
📉 Low VRAM Mode
For users with limited GPU memory (e.g., 4GB cards) or those running heavy games simultaneously, Whisper Voice offers a specialized Low VRAM Mode.
- Behavior: The AI model is aggressively unloaded from the GPU immediately after every transcription.
- Benefit: When idle, the app consumes near-zero VRAM (~0MB), leaving your GPU completely free for gaming or rendering.
- Trade-off: There is a "cold start" latency of 1-2 seconds for every voice command as the model reloads from the disk cache.
♿ Accessibility (WCAG 2.2 AAA)
Whisper Voice is built to be usable by everyone. The entire interface has been engineered to meet WCAG 2.2 AAA — the highest tier of accessibility compliance. This is not a checkbox exercise; it is a structural commitment.
Color & Contrast
Every design token is calibrated for Enhanced Contrast (WCAG 1.4.6, 7:1 minimum):
| Token | Ratio | Purpose |
|---|---|---|
textPrimary #FAFAFA |
~17:1 | Body text, headings |
textSecondary #ABABAB |
8.1:1 | Descriptions, hints |
accentPurple #B794F6 |
7.2:1 | Interactive elements, focus rings |
borderSubtle |
3:1 | Non-text contrast for borders and separators |
Keyboard Navigation
Full keyboard access — no mouse required:
- Tab / Shift+Tab: Navigate between all interactive controls (sliders, switches, buttons, dropdowns, text fields).
- Arrow Keys: Navigate the Settings sidebar tabs.
- Enter / Space: Activate any focused control.
- Focus Rings: Every interactive element shows a visible 2px accent-colored focus indicator.
Screen Reader Support
Every component is annotated with semantic roles and descriptive names:
- Buttons, sliders, checkboxes, combo boxes, text fields — all declare their
Accessible.roleandAccessible.name. - Switches report "on" / "off" state in their accessible name.
- The loader status uses
AlertMessagefor live-region announcements. - Settings tabs use
Tab/PageTabroles matching WAI-ARIA patterns.
Non-Color State Indicators
Toggle switches display I/O marks inside the thumb (not just color changes), ensuring state is perceivable without color vision (WCAG 1.4.1).
Target Sizes
All interactive controls meet the 24px minimum target size (WCAG 2.5.8). Slider handles, buttons, switches, and nav items are all comfortably clickable.
Reduced Motion
A Reduce Motion toggle (Settings > Visuals) disables all decorative animations:
- Shader effects (gradient blobs, glow, CRT scanlines, rainbow waveform)
- Particle systems
- Pulsing animations (mic button, recording timer, border)
- Loader logo pulse and progress shimmer
The system also respects the Windows "Show animations" preference via SystemParametersInfo detection. Essential information (recording state, progress bars, timer text) remains fully functional.
🛠️ Deployment
📥 Installation
- Acquire: Download
WhisperVoice.exefrom Releases. - Deploy: Place it anywhere. It is portable.
- Bootstrap: Run it. The agent will self-provision an isolated Python runtime (~2GB) on first launch.
- Sync: Future updates are handled by the Smart Bootstrapper, which surgically updates only changed files, respecting your bandwidth and your settings.
🔧 Troubleshooting
- App crashes on start: Ensure you have Microsoft Visual C++ Redistributable 2015-2022 installed.
- "Simulate Typing" is slow: Some applications (remote desktops, legacy games) cannot handle the data stream. Lower the typing speed in Settings to ~1200 CPM.
- No Audio: The agent listens to the Default Communication Device. Verify your Windows Sound Control Panel.
🌐 Supported Languages
The engine understands the following 99 languages. You can lock the focus to a specific language in Settings to improve accuracy, or rely on Auto-Detect for fluid multilingual usage.
| Afrikaans 🇿🇦 | Albanian 🇦🇱 | Amharic 🇪🇹 | Arabic 🇸🇦 | Armenian 🇦🇲 | Assamese 🇮🇳 |
| Azerbaijani 🇦🇿 | Bashkir 🇷🇺 | Basque 🇪🇸 | Belarusian 🇧🇾 | Bengali 🇧🇩 | Bosnian 🇧🇦 |
| Breton 🇫🇷 | Bulgarian 🇧🇬 | Burmese 🇲🇲 | Castilian 🇪🇸 | Catalan 🇪🇸 | Chinese 🇨🇳 |
| Croatian 🇭🇷 | Czech 🇨🇿 | Danish 🇩🇰 | Dutch 🇳🇱 | English 🇺🇸 | Estonian 🇪🇪 |
| Faroese 🇫🇴 | Finnish 🇫🇮 | Flemish 🇧🇪 | French 🇫🇷 | Galician 🇪🇸 | Georgian 🇬🇪 |
| German 🇩🇪 | Greek 🇬🇷 | Gujarati 🇮🇳 | Haitian 🇭🇹 | Hausa 🇳🇬 | Hawaiian 🇺🇸 |
| Hebrew 🇮🇱 | Hindi 🇮🇳 | Hungarian 🇭🇺 | Icelandic 🇮🇸 | Indonesian 🇮🇩 | Italian 🇮🇹 |
| Japanese 🇯🇵 | Javanese 🇮 Indonesa | Kannada 🇮🇳 | Kazakh 🇰🇿 | Khmer 🇰🇭 | Korean 🇰🇷 |
| Lao 🇱🇦 | Latin 🇻🇦 | Latvian 🇱🇻 | Lingala 🇨🇩 | Lithuanian 🇱🇹 | Luxembourgish 🇱🇺 |
| Macedonian 🇲🇰 | Malagasy 🇲🇬 | Malay 🇲🇾 | Malayalam 🇮🇳 | Maltese 🇲🇹 | Maori 🇳🇿 |
| Marathi 🇮🇳 | Moldavian 🇲🇩 | Mongolian 🇲🇳 | Myanmar 🇲🇲 | Nepali 🇳🇵 | Norwegian 🇳🇴 |
| Occitan 🇫🇷 | Panjabi 🇮🇳 | Pashto 🇦🇫 | Persian 🇮🇷 | Polish 🇵🇱 | Portuguese 🇵🇹 |
| Punjabi 🇮🇳 | Romanian 🇷🇴 | Russian 🇷🇺 | Sanskrit 🇮🇳 | Serbian 🇷🇸 | Shona 🇿🇼 |
| Sindhi 🇵🇰 | Sinhala 🇱🇰 | Slovak 🇸🇰 | Slovenian 🇸🇮 | Somali 🇸🇴 | Spanish 🇪🇸 |
| Sundanese 🇮🇩 | Swahili 🇰🇪 | Swedish 🇸🇪 | Tagalog 🇵🇭 | Tajik 🇹🇯 | Tamil 🇮🇳 |
| Tatar 🇷🇺 | Telugu 🇮🇳 | Thai 🇹🇭 | Tibetan 🇨🇳 | Turkish 🇹🇷 | Turkmen 🇹🇲 |
| Ukrainian 🇺🇦 | Urdu 🇵🇰 | Uzbek 🇺🇿 | Vietnamese 🇻e | Welsh 🏴 | Yiddish 🇮🇱 |
| Yoruba 🇳🇬 |
⚖️ PUBLIC DOMAIN (CC0 1.0)
No Rights Reserved. No Gods. No Masters. No Managers.
Credit to OpenAI (Whisper), Systran (Faster-Whisper), and Silero (VAD).