Files
whisper_voice/README.md
2026-02-18 22:30:48 +02:00

14 KiB

🎙️ W H I S P E R   V O I C E

SOVEREIGN SPEECH RECOGNITION


Status Download License


"The master's tools will never dismantle the master's house."
Build your own tools. Run them locally. Free your mind.

View SourceReport Issue



📡 The Transmission

We are witnessing the enshittification of the digital world. What were once vibrant social commons are being walled off, strip-mined for data, and degraded into rent-seeking silos. Your voice is no longer your own; it is a training set for a corporate oracle that charges you for the privilege of listening.

Whisper Voice is a small act of sabotage against this trend.

It is built on the axiom of Technological Sovereignty. By moving state-of-the-art inference from the server farms to your own silicon, you reclaim the means of digital production. No telemetry. No subscriptions. No "cloud processing" that eavesdrops on your intent.


The Engine

Whisper Voice operates directly on the metal. It is not an API wrapper; it is an autonomous machine.

Component Technology Benefit
Inference Core Faster-Whisper Hyper-optimized C++ implementation via CTranslate2. Delivers 4x velocity over standard PyTorch.
Compression INT8 quantization Enables Pro-grade models (Large-v3) to run on consumer-grade GPUs, democratizing elite AI.
Sensory Gate Silero VAD Enterprise-grade Voice Activity Detection filters out the noise, ensuring only pure intent is processed.
Interface Qt 6 / QML Hardware-accelerated, glassmorphic UI that is fluid, responsive, and sovereign.

🛑 Compatibility Matrix (Windows)

The core engine (CTranslate2) is heavily optimized for Nvidia tensor cores.

Manufacturer Hardware Status Notes
Nvidia GTX 900+ / RTX Supported Full heavy-metal acceleration.
AMD Radeon RX ⚠️ CPU Fallback Runs on CPU. Valid for Small/Medium, slow for Large.
Intel Arc / Iris ⚠️ CPU Fallback Runs on CPU. Valid for Small/Medium, slow for Large.
Apple M1 / M2 / M3 Unsupported Release is strictly Windows x64.

AMD Users: v1.0.3 auto-detects GPU failures and silently falls back to CPU.


🖋️ Universal Transcription

At its core, Whisper Voice is the ultimate bridge between thought and text. It listens with superhuman precision, converting spoken word into written form across 99 languages.

  • Punctuation Mastery: Automatically handles capitalization and complex punctuation formatting.
  • Contextual Intelligence: Smarter than standard dictation; it understands the flow of sentences to resolve homophones and technical jargon ($1.5k vs "fifteen hundred dollars").
  • Total Privacy: Your private dictation, legal notes, or creative writing never leave your RAM.

Workflow: F9 (Default)

The primary channel for native-language transcription. It transcribes precisely what it hears in the language you speak (or the one you've locked in Settings).

🧠 Intelligent Correction (New in v1.1.0)

Whisper Voice now integrates a local Llama 3.2 1B LLM to act as a "Silent Consultant". It post-processes transcripts to fix grammar or polish style without effectively "chatting" back.

It is strictly trained on a Forensic Protocol: it will never lecture you, never refuse to process explicit language, and never sanitize your words. Your profanity is yours to keep.

Correction Modes:

  • Standard (Default): Fixes grammar, punctuation, and capitalization while keeping every word you said.
  • Grammar Only: Strictly fixes objective errors (spelling/agreement). Touches nothing else.
  • Rewrite: Polishes the flow and clarity of your sentences while explicitly preserving your original tone (Casual stays casual, Formal stays formal).

Supported Languages:

The correction engine is optimized for English, German, French, Italian, Portuguese, Spanish, Hindi, and Thai. It also performs well on Russian, Chinese, Japanese, and Romanian.

This approach incurs a ~2s latency penalty but uses zero extra VRAM when in Low VRAM mode.


🌎 Universal Translation

Whisper Voice v1.0.1 includes a Neural Translation Engine that allows you to bridge any linguistic gap instantly.

  • Input: Speak in French, Japanese, Russian, or 96 other languages.
  • Output: The engine instantly reconstructs the semantic meaning into fluent English.
  • Task Protocol: Handled via the dedicated F10 channel.

🔍 Why only English translation?

A common question arises: Why can't I translate from French to Japanese?

The architecture of the underlying Whisper model is a Many-to-English design. During its massive training phase (680,000 hours of audio), the translation task was specifically optimized to map the global linguistic commons onto a single bridge language: English. This allowed the model to reach incredible levels of semantic understanding without the exponential complexity of a "Many-to-Many" mapping.

By focusing its translation decoder solely on English, Whisper achieves "Zero-Shot" quality that rivals specialized translation engines while remaining lightweight enough to run on your local GPU.


🕹️ Command & Control

Global Hotkeys

The agent runs silently in the background, waiting for your signal.

  • Transcribe (F9): Opens the channel for standard speech-to-text.
  • Translate (F10): Opens the channel for neural translation.
  • Customization: Remap these keys in Settings. The recorder supports complex chords (e.g. Ctrl + Alt + Space) to fit your workflow.

Injection Protocols

  • Clipboard Paste: Standard text injection. Instant, reliable.
  • Simulate Typing: Mimics physical keystrokes at superhuman speed (6000 CPM). Bypasses anti-paste restrictions and "protected" windows.

📊 Intelligence Matrix

Select the model that aligns with your available resources.

Model VRAM (GPU) RAM (CPU) Designation Capability
Tiny ~500 MB ~1 GB Supersonic Command & Control, older hardware.
Base ~600 MB ~1 GB 🚀 Very Fast Daily driver for low-power laptops.
Small ~1 GB ~2 GB Fast High accuracy English dictation.
Medium ~2 GB ~4 GB ⚖️ Balanced Complex vocabulary, foreign accents.
Large-v3 Turbo ~4 GB ~6 GB Optimal The Sweet Spot. Near-Large intelligence, Medium speed.
Large-v3 ~5 GB ~8 GB 🧠 Maximum Professional grade. Uncompromised.

Note: Acceleration requires you to manually select your Compute Device (CUDA GPU or CPU) in Settings.

📉 Low VRAM Mode

For users with limited GPU memory (e.g., 4GB cards) or those running heavy games simultaneously, Whisper Voice offers a specialized Low VRAM Mode.

  • Behavior: The AI model is aggressively unloaded from the GPU immediately after every transcription.
  • Benefit: When idle, the app consumes near-zero VRAM (~0MB), leaving your GPU completely free for gaming or rendering.
  • Trade-off: There is a "cold start" latency of 1-2 seconds for every voice command as the model reloads from the disk cache.

Accessibility (WCAG 2.2 AAA)

Whisper Voice is built to be usable by everyone. The entire interface has been engineered to meet WCAG 2.2 AAA — the highest tier of accessibility compliance. This is not a checkbox exercise; it is a structural commitment.

Color & Contrast

Every design token is calibrated for Enhanced Contrast (WCAG 1.4.6, 7:1 minimum):

Token Ratio Purpose
textPrimary #FAFAFA ~17:1 Body text, headings
textSecondary #ABABAB 8.1:1 Descriptions, hints
accentPurple #B794F6 7.2:1 Interactive elements, focus rings
borderSubtle 3:1 Non-text contrast for borders and separators

Keyboard Navigation

Full keyboard access — no mouse required:

  • Tab / Shift+Tab: Navigate between all interactive controls (sliders, switches, buttons, dropdowns, text fields).
  • Arrow Keys: Navigate the Settings sidebar tabs.
  • Enter / Space: Activate any focused control.
  • Focus Rings: Every interactive element shows a visible 2px accent-colored focus indicator.

Screen Reader Support

Every component is annotated with semantic roles and descriptive names:

  • Buttons, sliders, checkboxes, combo boxes, text fields — all declare their Accessible.role and Accessible.name.
  • Switches report "on" / "off" state in their accessible name.
  • The loader status uses AlertMessage for live-region announcements.
  • Settings tabs use Tab / PageTab roles matching WAI-ARIA patterns.

Non-Color State Indicators

Toggle switches display I/O marks inside the thumb (not just color changes), ensuring state is perceivable without color vision (WCAG 1.4.1).

Target Sizes

All interactive controls meet the 24px minimum target size (WCAG 2.5.8). Slider handles, buttons, switches, and nav items are all comfortably clickable.

Reduced Motion

A Reduce Motion toggle (Settings > Visuals) disables all decorative animations:

  • Shader effects (gradient blobs, glow, CRT scanlines, rainbow waveform)
  • Particle systems
  • Pulsing animations (mic button, recording timer, border)
  • Loader logo pulse and progress shimmer

The system also respects the Windows "Show animations" preference via SystemParametersInfo detection. Essential information (recording state, progress bars, timer text) remains fully functional.


🛠️ Deployment

📥 Installation

  1. Acquire: Download WhisperVoice.exe from Releases.
  2. Deploy: Place it anywhere. It is portable.
  3. Bootstrap: Run it. The agent will self-provision an isolated Python runtime (~2GB) on first launch.
  4. Sync: Future updates are handled by the Smart Bootstrapper, which surgically updates only changed files, respecting your bandwidth and your settings.

🔧 Troubleshooting

  • App crashes on start: Ensure you have Microsoft Visual C++ Redistributable 2015-2022 installed.
  • "Simulate Typing" is slow: Some applications (remote desktops, legacy games) cannot handle the data stream. Lower the typing speed in Settings to ~1200 CPM.
  • No Audio: The agent listens to the Default Communication Device. Verify your Windows Sound Control Panel.


🌐 Supported Languages

The engine understands the following 99 languages. You can lock the focus to a specific language in Settings to improve accuracy, or rely on Auto-Detect for fluid multilingual usage.

Afrikaans 🇿🇦 Albanian 🇦🇱 Amharic 🇪🇹 Arabic 🇸🇦 Armenian 🇦🇲 Assamese 🇮🇳
Azerbaijani 🇦🇿 Bashkir 🇷🇺 Basque 🇪🇸 Belarusian 🇧🇾 Bengali 🇧🇩 Bosnian 🇧🇦
Breton 🇫🇷 Bulgarian 🇧🇬 Burmese 🇲🇲 Castilian 🇪🇸 Catalan 🇪🇸 Chinese 🇨🇳
Croatian 🇭🇷 Czech 🇨🇿 Danish 🇩🇰 Dutch 🇳🇱 English 🇺🇸 Estonian 🇪🇪
Faroese 🇫🇴 Finnish 🇫🇮 Flemish 🇧🇪 French 🇫🇷 Galician 🇪🇸 Georgian 🇬🇪
German 🇩🇪 Greek 🇬🇷 Gujarati 🇮🇳 Haitian 🇭🇹 Hausa 🇳🇬 Hawaiian 🇺🇸
Hebrew 🇮🇱 Hindi 🇮🇳 Hungarian 🇭🇺 Icelandic 🇮🇸 Indonesian 🇮🇩 Italian 🇮🇹
Japanese 🇯🇵 Javanese 🇮 Indonesa Kannada 🇮🇳 Kazakh 🇰🇿 Khmer 🇰🇭 Korean 🇰🇷
Lao 🇱🇦 Latin 🇻🇦 Latvian 🇱🇻 Lingala 🇨🇩 Lithuanian 🇱🇹 Luxembourgish 🇱🇺
Macedonian 🇲🇰 Malagasy 🇲🇬 Malay 🇲🇾 Malayalam 🇮🇳 Maltese 🇲🇹 Maori 🇳🇿
Marathi 🇮🇳 Moldavian 🇲🇩 Mongolian 🇲🇳 Myanmar 🇲🇲 Nepali 🇳🇵 Norwegian 🇳🇴
Occitan 🇫🇷 Panjabi 🇮🇳 Pashto 🇦🇫 Persian 🇮🇷 Polish 🇵🇱 Portuguese 🇵🇹
Punjabi 🇮🇳 Romanian 🇷🇴 Russian 🇷🇺 Sanskrit 🇮🇳 Serbian 🇷🇸 Shona 🇿🇼
Sindhi 🇵🇰 Sinhala 🇱🇰 Slovak 🇸🇰 Slovenian 🇸🇮 Somali 🇸🇴 Spanish 🇪🇸
Sundanese 🇮🇩 Swahili 🇰🇪 Swedish 🇸🇪 Tagalog 🇵🇭 Tajik 🇹🇯 Tamil 🇮🇳
Tatar 🇷🇺 Telugu 🇮🇳 Thai 🇹🇭 Tibetan 🇨🇳 Turkish 🇹🇷 Turkmen 🇹🇲
Ukrainian 🇺🇦 Urdu 🇵🇰 Uzbek 🇺🇿 Vietnamese 🇻e Welsh 🏴󠁧󠁢󠁷󠁬󠁳󠁿 Yiddish 🇮🇱
Yoruba 🇳🇬


⚖️ PUBLIC DOMAIN (CC0 1.0)

No Rights Reserved. No Gods. No Masters. No Managers.

Credit to OpenAI (Whisper), Systran (Faster-Whisper), and Silero (VAD).