Files

Your Name aa2b0acd86 Add WCAG 2.2 AAA accessibility section to README

2026-02-18 22:30:48 +02:00

14 KiB

Raw Permalink Blame History

🎙️ W H I S P E R V O I C E

SOVEREIGN SPEECH RECOGNITION

"The master's tools will never dismantle the master's house."
Build your own tools. Run them locally. Free your mind.

View Source • Report Issue

📡 The Transmission

We are witnessing the enshittification of the digital world. What were once vibrant social commons are being walled off, strip-mined for data, and degraded into rent-seeking silos. Your voice is no longer your own; it is a training set for a corporate oracle that charges you for the privilege of listening.

Whisper Voice is a small act of sabotage against this trend.

It is built on the axiom of Technological Sovereignty. By moving state-of-the-art inference from the server farms to your own silicon, you reclaim the means of digital production. No telemetry. No subscriptions. No "cloud processing" that eavesdrops on your intent.

⚡ The Engine

Whisper Voice operates directly on the metal. It is not an API wrapper; it is an autonomous machine.

Component	Technology	Benefit
Inference Core	Faster-Whisper	Hyper-optimized C++ implementation via CTranslate2. Delivers 4x velocity over standard PyTorch.
Compression	INT8 quantization	Enables Pro-grade models (`Large-v3`) to run on consumer-grade GPUs, democratizing elite AI.
Sensory Gate	Silero VAD	Enterprise-grade Voice Activity Detection filters out the noise, ensuring only pure intent is processed.
Interface	Qt 6 / QML	Hardware-accelerated, glassmorphic UI that is fluid, responsive, and sovereign.

🛑 Compatibility Matrix (Windows)

The core engine (CTranslate2) is heavily optimized for Nvidia tensor cores.

Manufacturer	Hardware	Status	Notes
Nvidia	GTX 900+ / RTX	✅ Supported	Full heavy-metal acceleration.
AMD	Radeon RX	⚠️ CPU Fallback	Runs on CPU. Valid for `Small/Medium`, slow for `Large`.
Intel	Arc / Iris	⚠️ CPU Fallback	Runs on CPU. Valid for `Small/Medium`, slow for `Large`.
Apple	M1 / M2 / M3	❌ Unsupported	Release is strictly Windows x64.

AMD Users: v1.0.3 auto-detects GPU failures and silently falls back to CPU.

🖋️ Universal Transcription

At its core, Whisper Voice is the ultimate bridge between thought and text. It listens with superhuman precision, converting spoken word into written form across 99 languages.

Punctuation Mastery: Automatically handles capitalization and complex punctuation formatting.
Contextual Intelligence: Smarter than standard dictation; it understands the flow of sentences to resolve homophones and technical jargon ($1.5k vs "fifteen hundred dollars").
Total Privacy: Your private dictation, legal notes, or creative writing never leave your RAM.

Workflow: `F9 (Default)`

The primary channel for native-language transcription. It transcribes precisely what it hears in the language you speak (or the one you've locked in Settings).

🧠 Intelligent Correction (New in v1.1.0)

Whisper Voice now integrates a local Llama 3.2 1B LLM to act as a "Silent Consultant". It post-processes transcripts to fix grammar or polish style without effectively "chatting" back.

It is strictly trained on a Forensic Protocol: it will never lecture you, never refuse to process explicit language, and never sanitize your words. Your profanity is yours to keep.

Correction Modes:

Standard (Default): Fixes grammar, punctuation, and capitalization while keeping every word you said.
Grammar Only: Strictly fixes objective errors (spelling/agreement). Touches nothing else.
Rewrite: Polishes the flow and clarity of your sentences while explicitly preserving your original tone (Casual stays casual, Formal stays formal).

Supported Languages:

The correction engine is optimized for English, German, French, Italian, Portuguese, Spanish, Hindi, and Thai. It also performs well on Russian, Chinese, Japanese, and Romanian.

This approach incurs a ~2s latency penalty but uses zero extra VRAM when in Low VRAM mode.

🌎 Universal Translation

Whisper Voice v1.0.1 includes a Neural Translation Engine that allows you to bridge any linguistic gap instantly.

Input: Speak in French, Japanese, Russian, or 96 other languages.
Output: The engine instantly reconstructs the semantic meaning into fluent English.
Task Protocol: Handled via the dedicated F10 channel.

🔍 Why only English translation?

A common question arises: Why can't I translate from French to Japanese?

The architecture of the underlying Whisper model is a Many-to-English design. During its massive training phase (680,000 hours of audio), the translation task was specifically optimized to map the global linguistic commons onto a single bridge language: English. This allowed the model to reach incredible levels of semantic understanding without the exponential complexity of a "Many-to-Many" mapping.

By focusing its translation decoder solely on English, Whisper achieves "Zero-Shot" quality that rivals specialized translation engines while remaining lightweight enough to run on your local GPU.

🕹️ Command & Control

Global Hotkeys

The agent runs silently in the background, waiting for your signal.

Transcribe (F9): Opens the channel for standard speech-to-text.
Translate (F10): Opens the channel for neural translation.
Customization: Remap these keys in Settings. The recorder supports complex chords (e.g. Ctrl + Alt + Space) to fit your workflow.

Injection Protocols

Clipboard Paste: Standard text injection. Instant, reliable.
Simulate Typing: Mimics physical keystrokes at superhuman speed (6000 CPM). Bypasses anti-paste restrictions and "protected" windows.

📊 Intelligence Matrix

Select the model that aligns with your available resources.

Model	VRAM (GPU)	RAM (CPU)	Designation	Capability
`Tiny`	~500 MB	~1 GB	⚡ Supersonic	Command & Control, older hardware.
`Base`	~600 MB	~1 GB	🚀 Very Fast	Daily driver for low-power laptops.
`Small`	~1 GB	~2 GB	⏩ Fast	High accuracy English dictation.
`Medium`	~2 GB	~4 GB	⚖️ Balanced	Complex vocabulary, foreign accents.
`Large-v3 Turbo`	~4 GB	~6 GB	✨ Optimal	The Sweet Spot. Near-Large intelligence, Medium speed.
`Large-v3`	~5 GB	~8 GB	🧠 Maximum	Professional grade. Uncompromised.

Note: Acceleration requires you to manually select your Compute Device (CUDA GPU or CPU) in Settings.

📉 Low VRAM Mode

For users with limited GPU memory (e.g., 4GB cards) or those running heavy games simultaneously, Whisper Voice offers a specialized Low VRAM Mode.

Behavior: The AI model is aggressively unloaded from the GPU immediately after every transcription.
Benefit: When idle, the app consumes near-zero VRAM (~0MB), leaving your GPU completely free for gaming or rendering.
Trade-off: There is a "cold start" latency of 1-2 seconds for every voice command as the model reloads from the disk cache.

♿ Accessibility (WCAG 2.2 AAA)

Whisper Voice is built to be usable by everyone. The entire interface has been engineered to meet WCAG 2.2 AAA — the highest tier of accessibility compliance. This is not a checkbox exercise; it is a structural commitment.

Color & Contrast

Every design token is calibrated for Enhanced Contrast (WCAG 1.4.6, 7:1 minimum):

Token	Ratio	Purpose
`textPrimary` #FAFAFA	~17:1	Body text, headings
`textSecondary` #ABABAB	8.1:1	Descriptions, hints
`accentPurple` #B794F6	7.2:1	Interactive elements, focus rings
`borderSubtle`	3:1	Non-text contrast for borders and separators

Full keyboard access — no mouse required:

Tab / Shift+Tab: Navigate between all interactive controls (sliders, switches, buttons, dropdowns, text fields).
Arrow Keys: Navigate the Settings sidebar tabs.
Enter / Space: Activate any focused control.
Focus Rings: Every interactive element shows a visible 2px accent-colored focus indicator.

Every component is annotated with semantic roles and descriptive names:

Buttons, sliders, checkboxes, combo boxes, text fields — all declare their Accessible.role and Accessible.name.
Switches report "on" / "off" state in their accessible name.
The loader status uses AlertMessage for live-region announcements.
Settings tabs use Tab / PageTab roles matching WAI-ARIA patterns.

Non-Color State Indicators

Toggle switches display I/O marks inside the thumb (not just color changes), ensuring state is perceivable without color vision (WCAG 1.4.1).

Target Sizes

All interactive controls meet the 24px minimum target size (WCAG 2.5.8). Slider handles, buttons, switches, and nav items are all comfortably clickable.

Reduced Motion

A Reduce Motion toggle (Settings > Visuals) disables all decorative animations:

Shader effects (gradient blobs, glow, CRT scanlines, rainbow waveform)
Particle systems
Pulsing animations (mic button, recording timer, border)
Loader logo pulse and progress shimmer

The system also respects the Windows "Show animations" preference via SystemParametersInfo detection. Essential information (recording state, progress bars, timer text) remains fully functional.

🛠️ Deployment

📥 Installation

Acquire: Download WhisperVoice.exe from Releases.
Deploy: Place it anywhere. It is portable.
Bootstrap: Run it. The agent will self-provision an isolated Python runtime (~2GB) on first launch.
Sync: Future updates are handled by the Smart Bootstrapper, which surgically updates only changed files, respecting your bandwidth and your settings.

🔧 Troubleshooting

App crashes on start: Ensure you have Microsoft Visual C++ Redistributable 2015-2022 installed.
"Simulate Typing" is slow: Some applications (remote desktops, legacy games) cannot handle the data stream. Lower the typing speed in Settings to ~1200 CPM.
No Audio: The agent listens to the Default Communication Device. Verify your Windows Sound Control Panel.

🌐 Supported Languages

The engine understands the following 99 languages. You can lock the focus to a specific language in Settings to improve accuracy, or rely on Auto-Detect for fluid multilingual usage.


Afrikaans 🇿🇦	Albanian 🇦🇱	Amharic 🇪🇹	Arabic 🇸🇦	Armenian 🇦🇲	Assamese 🇮🇳
Azerbaijani 🇦🇿	Bashkir 🇷🇺	Basque 🇪🇸	Belarusian 🇧🇾	Bengali 🇧🇩	Bosnian 🇧🇦
Breton 🇫🇷	Bulgarian 🇧🇬	Burmese 🇲🇲	Castilian 🇪🇸	Catalan 🇪🇸	Chinese 🇨🇳
Croatian 🇭🇷	Czech 🇨🇿	Danish 🇩🇰	Dutch 🇳🇱	English 🇺🇸	Estonian 🇪🇪
Faroese 🇫🇴	Finnish 🇫🇮	Flemish 🇧🇪	French 🇫🇷	Galician 🇪🇸	Georgian 🇬🇪
German 🇩🇪	Greek 🇬🇷	Gujarati 🇮🇳	Haitian 🇭🇹	Hausa 🇳🇬	Hawaiian 🇺🇸
Hebrew 🇮🇱	Hindi 🇮🇳	Hungarian 🇭🇺	Icelandic 🇮🇸	Indonesian 🇮🇩	Italian 🇮🇹
Japanese 🇯🇵	Javanese 🇮 Indonesa	Kannada 🇮🇳	Kazakh 🇰🇿	Khmer 🇰🇭	Korean 🇰🇷
Lao 🇱🇦	Latin 🇻🇦	Latvian 🇱🇻	Lingala 🇨🇩	Lithuanian 🇱🇹	Luxembourgish 🇱🇺
Macedonian 🇲🇰	Malagasy 🇲🇬	Malay 🇲🇾	Malayalam 🇮🇳	Maltese 🇲🇹	Maori 🇳🇿
Marathi 🇮🇳	Moldavian 🇲🇩	Mongolian 🇲🇳	Myanmar 🇲🇲	Nepali 🇳🇵	Norwegian 🇳🇴
Occitan 🇫🇷	Panjabi 🇮🇳	Pashto 🇦🇫	Persian 🇮🇷	Polish 🇵🇱	Portuguese 🇵🇹
Punjabi 🇮🇳	Romanian 🇷🇴	Russian 🇷🇺	Sanskrit 🇮🇳	Serbian 🇷🇸	Shona 🇿🇼
Sindhi 🇵🇰	Sinhala 🇱🇰	Slovak 🇸🇰	Slovenian 🇸🇮	Somali 🇸🇴	Spanish 🇪🇸
Sundanese 🇮🇩	Swahili 🇰🇪	Swedish 🇸🇪	Tagalog 🇵🇭	Tajik 🇹🇯	Tamil 🇮🇳
Tatar 🇷🇺	Telugu 🇮🇳	Thai 🇹🇭	Tibetan 🇨🇳	Turkish 🇹🇷	Turkmen 🇹🇲
Ukrainian 🇺🇦	Urdu 🇵🇰	Uzbek 🇺🇿	Vietnamese 🇻e	Welsh 🏴󠁧󠁢󠁷󠁬󠁳󠁿	Yiddish 🇮🇱
Yoruba 🇳🇬

⚖️ PUBLIC DOMAIN (CC0 1.0)

No Rights Reserved. No Gods. No Masters. No Managers.

Credit to OpenAI (Whisper), Systran (Faster-Whisper), and Silero (VAD).

14 KiB Raw Permalink Blame History