whisper_voice/README.md

<div align="center">

# 🎙️ W H I S P E R &nbsp; V O I C E
### SOVEREIGN SPEECH RECOGNITION

<br>

![Status](https://img.shields.io/badge/STATUS-OPERATIONAL-success?style=for-the-badge&logo=server&color=2ecc71)
[![Download](https://img.shields.io/gitea/v/release/lashman/whisper_voice?gitea_url=https%3A%2F%2Fgit.lashman.live&label=Install&style=for-the-badge&logo=windows&logoColor=white&color=3b82f6)](https://git.lashman.live/lashman/whisper_voice/releases/latest)
[![License](https://img.shields.io/badge/LICENSE-PUBLIC_DOMAIN-lightgrey?style=for-the-badge&logo=creative-commons&logoColor=black)](https://creativecommons.org/publicdomain/zero/1.0/)

<br>

> *"The master's tools will never dismantle the master's house."*
> <br>
> **Build your own tools. Run them locally. Free your mind.**

[View Source](https://git.lashman.live/lashman/whisper_voice) • [Report Issue](https://git.lashman.live/lashman/whisper_voice/issues)

</div>

<br>
<br>

## 📡 The Transmission

We are witnessing the **enshittification** of the digital world. What were once vibrant social commons are being walled off, strip-mined for data, and degraded into rent-seeking silos. Your voice is no longer your own; it is a training set for a corporate oracle that charges you for the privilege of listening.

**Whisper Voice** is a small act of sabotage against this trend.

It is built on the axiom of **Technological Sovereignty**. By moving state-of-the-art inference from the server farms to your own silicon, you reclaim the means of digital production. No telemetry. No subscriptions. No "cloud processing" that eavesdrops on your intent.

---

## ⚡ The Engine

Whisper Voice operates directly on the metal. It is not an API wrapper; it is an autonomous machine.

| Component | Technology | Benefit |
| :--- | :--- | :--- |
| **Inference Core** | **Faster-Whisper** | Hyper-optimized C++ implementation via **CTranslate2**. Delivers **4x velocity** over standard PyTorch. |
| **Compression** | **INT8 quantization** | Enables Pro-grade models (`Large-v3`) to run on consumer-grade GPUs, democratizing elite AI. |
| **Sensory Gate** | **Silero VAD** | Enterprise-grade Voice Activity Detection filters out the noise, ensuring only pure intent is processed. |
| **Interface** | **Qt 6 / QML** | Hardware-accelerated, glassmorphic UI that is fluid, responsive, and sovereign. |

### 🛑 Compatibility Matrix (Windows)
The core engine (`CTranslate2`) is heavily optimized for Nvidia tensor cores.

| Manufacturer | Hardware | Status | Notes |
| :--- | :--- | :--- | :--- |
| **Nvidia** | GTX 900+ / RTX | ✅ **Supported** | Full heavy-metal acceleration. |
| **AMD** | Radeon RX | ⚠️ **CPU Fallback** | Runs on CPU. Valid for `Small/Medium`, slow for `Large`. |
| **Intel** | Arc / Iris | ⚠️ **CPU Fallback** | Runs on CPU. Valid for `Small/Medium`, slow for `Large`. |
| **Apple** | M1 / M2 / M3 | ❌ **Unsupported** | Release is strictly Windows x64. |

> **AMD Users**: v1.0.3 auto-detects GPU failures and silently falls back to CPU.

<br>

## 🖋️ Universal Transcription

At its core, Whisper Voice is the ultimate bridge between thought and text. It listens with superhuman precision, converting spoken word into written form across **99 languages**.

*   **Punctuation Mastery**: Automatically handles capitalization and complex punctuation formatting.
*   **Contextual Intelligence**: Smarter than standard dictation; it understands the flow of sentences to resolve homophones and technical jargon ($1.5k vs "fifteen hundred dollars").
*   **Total Privacy**: Your private dictation, legal notes, or creative writing never leave your RAM.

### Workflow: `F9 (Default)`
The primary channel for native-language transcription. It transcribes precisely what it hears in the language you speak (or the one you've locked in Settings).

### ✨ Style Prompting (New in v1.0.2)
Whisper Voice replaces traditional "grammar correction models" with a native **Style Prompting** engine. By injecting a specific "pre-prompt" into the model's context window, we can guide its internal style without external post-processing.

*   **Standard (Default)**: Forces the model to use full sentences, proper capitalization, and periods. Ideal for dictation.
*   **Casual**: Encourages a relaxed, lowercase style (e.g., "no way that's crazy lol").
*   **Custom**: Allows you to seed the model with your own context (e.g., "Here is a list of medical terms:").

This approach incurs **zero latency penalty** and **zero extra VRAM** usage.

<br>

## 🌎 Universal Translation

Whisper Voice v1.0.1 includes a **Neural Translation Engine** that allows you to bridge any linguistic gap instantly.

*   **Input**: Speak in French, Japanese, Russian, or **96 other languages**.
*   **Output**: The engine instantly reconstructs the semantic meaning into fluent **English**.
*   **Task Protocol**: Handled via the dedicated `F10` channel.

### 🔍 Why only English translation?
A common question arises: *Why can't I translate from French to Japanese?*

The architecture of the underlying Whisper model is a **Many-to-English** design. During its massive training phase (680,000 hours of audio), the translation task was specifically optimized to map the global linguistic commons onto a single bridge language: **English**. This allowed the model to reach incredible levels of semantic understanding without the exponential complexity of a "Many-to-Many" mapping.

By focusing its translation decoder solely on English, Whisper achieves "Zero-Shot" quality that rivals specialized translation engines while remaining lightweight enough to run on your local GPU.

---

## 🕹️ Command & Control

### Global Hotkeys
The agent runs silently in the background, waiting for your signal.

*   **Transcribe (F9)**: Opens the channel for standard speech-to-text.
*   **Translate (F10)**: Opens the channel for neural translation.
*   **Customization**: Remap these keys in Settings. The recorder supports complex chords (e.g. `Ctrl + Alt + Space`) to fit your workflow.

### Injection Protocols
*   **Clipboard Paste**: Standard text injection. Instant, reliable.
*   **Simulate Typing**: Mimics physical keystrokes at superhuman speed (6000 CPM). Bypasses anti-paste restrictions and "protected" windows.

<br>

## 📊 Intelligence Matrix

Select the model that aligns with your available resources.

| Model | VRAM (GPU) | RAM (CPU) | Designation | Capability |
| :--- | :--- | :--- | :--- | :--- |
| `Tiny` | **~500 MB** | ~1 GB | ⚡ **Supersonic** | Command & Control, older hardware. |
| `Base` | **~600 MB** | ~1 GB | 🚀 **Very Fast** | Daily driver for low-power laptops. |
| `Small` | **~1 GB** | ~2 GB | ⏩ **Fast** | High accuracy English dictation. |
| `Medium` | **~2 GB** | ~4 GB | ⚖️ **Balanced** | Complex vocabulary, foreign accents. |
| `Large-v3 Turbo` | **~4 GB** | ~6 GB | ✨ **Optimal** | **The Sweet Spot.** Near-Large intelligence, Medium speed. |
| `Large-v3` | **~5 GB** | ~8 GB | 🧠 **Maximum** | Professional grade. Uncompromised. |

> *Note: Acceleration requires you to manually select your Compute Device (CUDA GPU or CPU) in Settings.*

### 📉 Low VRAM Mode
For users with limited GPU memory (e.g., 4GB cards) or those running heavy games simultaneously, Whisper Voice offers a specialized **Low VRAM Mode**.

*   **Behavior**: The AI model is aggressively unloaded from the GPU immediately after every transcription.
*   **Benefit**: When idle, the app consumes near-zero VRAM (~0MB), leaving your GPU completely free for gaming or rendering.
*   **Trade-off**: There is a "cold start" latency of 1-2 seconds for every voice command as the model reloads from the disk cache.

---

## 🛠️ Deployment

### 📥 Installation
1.  **Acquire**: Download `WhisperVoice.exe` from [Releases](https://git.lashman.live/lashman/whisper_voice/releases).
2.  **Deploy**: Place it anywhere. It is portable.
3.  **Bootstrap**: Run it. The agent will self-provision an isolated Python runtime (~2GB) on first launch.
4.  **Sync**: Future updates are handled by the **Smart Bootstrapper**, which surgically updates only changed files, respecting your bandwidth and your settings.

### 🔧 Troubleshooting
*   **App crashes on start**: Ensure you have [Microsoft Visual C++ Redistributable 2015-2022](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist) installed.
*   **"Simulate Typing" is slow**: Some applications (remote desktops, legacy games) cannot handle the data stream. Lower the typing speed in Settings to ~1200 CPM.
*   **No Audio**: The agent listens to the **Default Communication Device**. Verify your Windows Sound Control Panel.

<br>

---

## 🌐 Supported Languages

The engine understands the following 99 languages. You can lock the focus to a specific language in Settings to improve accuracy, or rely on **Auto-Detect** for fluid multilingual usage.

| | | | | | |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Afrikaans 🇿🇦 | Albanian 🇦🇱 | Amharic 🇪🇹 | Arabic 🇸🇦 | Armenian 🇦🇲 | Assamese 🇮🇳 |
| Azerbaijani 🇦🇿 | Bashkir 🇷🇺 | Basque 🇪🇸 | Belarusian 🇧🇾 | Bengali 🇧🇩 | Bosnian 🇧🇦 |
| Breton 🇫🇷 | Bulgarian 🇧🇬 | Burmese 🇲🇲 | Castilian 🇪🇸 | Catalan 🇪🇸 | Chinese 🇨🇳 |
| Croatian 🇭🇷 | Czech 🇨🇿 | Danish 🇩🇰 | Dutch 🇳🇱 | English 🇺🇸 | Estonian 🇪🇪 |
| Faroese 🇫🇴 | Finnish 🇫🇮 | Flemish 🇧🇪 | French 🇫🇷 | Galician 🇪🇸 | Georgian 🇬🇪 |
| German 🇩🇪 | Greek 🇬🇷 | Gujarati 🇮🇳 | Haitian 🇭🇹 | Hausa 🇳🇬 | Hawaiian 🇺🇸 |
| Hebrew 🇮🇱 | Hindi 🇮🇳 | Hungarian 🇭🇺 | Icelandic 🇮🇸 | Indonesian 🇮🇩 | Italian 🇮🇹 |
| Japanese 🇯🇵 | Javanese 🇮 Indonesa | Kannada 🇮🇳 | Kazakh 🇰🇿 | Khmer 🇰🇭 | Korean 🇰🇷 |
| Lao 🇱🇦 | Latin 🇻🇦 | Latvian 🇱🇻 | Lingala 🇨🇩 | Lithuanian 🇱🇹 | Luxembourgish 🇱🇺 |
| Macedonian 🇲🇰 | Malagasy 🇲🇬 | Malay 🇲🇾 | Malayalam 🇮🇳 | Maltese 🇲🇹 | Maori 🇳🇿 |
| Marathi 🇮🇳 | Moldavian 🇲🇩 | Mongolian 🇲🇳 | Myanmar 🇲🇲 | Nepali 🇳🇵 | Norwegian 🇳🇴 |
| Occitan 🇫🇷 | Panjabi 🇮🇳 | Pashto 🇦🇫 | Persian 🇮🇷 | Polish 🇵🇱 | Portuguese 🇵🇹 |
| Punjabi 🇮🇳 | Romanian 🇷🇴 | Russian 🇷🇺 | Sanskrit 🇮🇳 | Serbian 🇷🇸 | Shona 🇿🇼 |
| Sindhi 🇵🇰 | Sinhala 🇱🇰 | Slovak 🇸🇰 | Slovenian 🇸🇮 | Somali 🇸🇴 | Spanish 🇪🇸 |
| Sundanese 🇮🇩 | Swahili 🇰🇪 | Swedish 🇸🇪 | Tagalog 🇵🇭 | Tajik 🇹🇯 | Tamil 🇮🇳 |
| Tatar 🇷🇺 | Telugu 🇮🇳 | Thai 🇹🇭 | Tibetan 🇨🇳 | Turkish 🇹🇷 | Turkmen 🇹🇲 |
| Ukrainian 🇺🇦 | Urdu 🇵🇰 | Uzbek 🇺🇿 | Vietnamese 🇻e | Welsh 🏴󠁧󠁢󠁷󠁬󠁳󠁿 | Yiddish 🇮🇱 |
| Yoruba 🇳🇬 | | | | | |

<br>
<br>

<div align="center">

### ⚖️ PUBLIC DOMAIN (CC0 1.0)
*No Rights Reserved. No Gods. No Masters. No Managers.*

Credit to **OpenAI** (Whisper), **Systran** (Faster-Whisper), and **Silero** (VAD).

</div>