Final documentation polish
This commit is contained in:
111
README.md
111
README.md
@@ -5,12 +5,20 @@
|
||||
|
||||
<br>
|
||||
|
||||

|
||||
|
||||
**Your Voice. Your Machine. Your Data.**
|
||||
<br>
|
||||
*A high-performance, locally-run dictation agent for the liberated desktop.*
|
||||
|
||||
[](https://git.lashman.live/lashman/whisper_voice/releases/latest)
|
||||
[](https://creativecommons.org/publicdomain/zero/1.0/)
|
||||
[](https://git.lashman.live/lashman/whisper_voice/releases/latest)
|
||||
[](https://creativecommons.org/publicdomain/zero/1.0/)
|
||||
|
||||
<br>
|
||||
|
||||
<p align="center">
|
||||
<img src="https://raw.githubusercontent.com/Tarikul-Islam-Anik/Animated-Fluent-Emojis/master/Emojis/Objects/Microphone.png" alt="Microphone" width="100" />
|
||||
</p>
|
||||
|
||||
</div>
|
||||
|
||||
@@ -18,49 +26,84 @@
|
||||
|
||||
## ✊ The Manifesto
|
||||
|
||||
**We hold these truths to be self-evident: That user data is an extension of the self, and its exploitation by centralized clouds is a violation of digital autonomy.**
|
||||
**We hold these truths to be self-evident:** That user data is an extension of the self, and its exploitation by centralized clouds is a violation of digital autonomy.
|
||||
|
||||
Whisper Voice is built on the principle of **technological sovereignty**. It provides state-of-the-art speech recognition without renting your cognitive output to corporate oligarchies. By running entirely on your own hardware, it reclaims the means of digital production, ensuring that your words remain exclusively yours.
|
||||
|
||||
> *"The master's tools will never dismantle the master's house."* — Audre Lorde
|
||||
> <br>**Build your own tools. Run them locally.**
|
||||
|
||||
---
|
||||
|
||||
## ⚡ Technical Core
|
||||
|
||||
Under the hood, Whisper Voice exploits the raw power of **[Faster-Whisper](https://github.com/SYSTRAN/faster-whisper)**, a hyper-optimized implementation of OpenAI's Whisper model using CTranslate2.
|
||||
Whisper Voice is not a wrapper for an API. It is a fully contained neural inference engine running on your metal.
|
||||
|
||||
* **Zero Latency Loop**: By eliminating network round-trips, transcription happens as fast as your hardware can think.
|
||||
* **Privacy by Physics**: Data physically cannot leave your machine because the engine has no cloud uplink. The cable is cut.
|
||||
* **Precision Engineering**: Leveraging 8-bit quantization (`int8`) to run professional-grade models on consumer hardware with minimal memory footprint.
|
||||
### The Engine: Faster-Whisper
|
||||
We utilize the **CTranslate2** backend—a high-performance inference engine for Transformer models. This allows us to run OpenAI's Whisper architectures with:
|
||||
* **4x Speedup** over standard PyTorch implementations.
|
||||
* **4x Memory Reduction** via 8-bit quantization (`int8`), enabling Pro-grade models on consumer GPUs.
|
||||
|
||||
### The Sense: Silero VAD
|
||||
To distinguish human speech from background noise, we employ **Silero VAD** (Voice Activity Detection). This ensures that the agent only listens when you speak, conserving compute resources and preventing hallucinated text from silence.
|
||||
|
||||
### The Interface: Qt 6 (PySide6)
|
||||
The UI is built with **Qt Quick/QML**, rendering a hardware-accelerated, glassmorphic overlay that feels native to modern desktop environments while remaining completely decoupled from OS spyware.
|
||||
|
||||
---
|
||||
|
||||
## 📊 Model Performance
|
||||
## 📊 Model Intelligence
|
||||
|
||||
Choose the engine that matches your hardware capabilities.
|
||||
Select the intelligence level that matches your hardware reality.
|
||||
|
||||
| Model | GPU VRAM (rec.) | CPU RAM (rec.) | Relative Speed | Capability |
|
||||
| Model | GPU VRAM | CPU RAM | Speed | Best For |
|
||||
| :--- | :--- | :--- | :--- | :--- |
|
||||
| **Tiny** | ~500 MB | ~1 GB | Supersonic | Quick commands, simple dictation. |
|
||||
| **Base** | ~600 MB | ~1 GB | Very Fast | Good balance for older hardware. |
|
||||
| **Small** | ~1 GB | ~2 GB | Fast | Standard driver. High accuracy for English. |
|
||||
| **Medium** | ~2 GB | ~4 GB | Moderate | High precision. Great for accents. |
|
||||
| **Large-v3 Turbo** | ~4 GB | ~6 GB | Fast/Mod | **Best Balance.** Near Large accuracy at much higher speeds. |
|
||||
| **Large-v3** | ~5 GB | ~8 GB | Heavy | Professional grade. Near-perfect understanding. |
|
||||
| **Tiny** | ~500 MB | ~1 GB | ⚡ Supersonic | Quick commands, older machinery. |
|
||||
| **Base** | ~600 MB | ~1 GB | 🚀 Very Fast | Daily driving on low-power laptops. |
|
||||
| **Small** | ~1 GB | ~2 GB | ⏩ Fast | High accuracy for English dictation. |
|
||||
| **Medium** | ~2 GB | ~4 GB | ⚖️ Balanced | Complex vocabulary and accents. |
|
||||
| **Large-v3 Turbo** | ~4 GB | ~6 GB | ✨ **Optimal** | The sweet spot. Large-level smarts, Medium-level speed. |
|
||||
| **Large-v3** | ~5 GB | ~8 GB | 🧠 Maximum | Professional transcription. Uncompromised quality. |
|
||||
|
||||
*Note: CPU inference is significantly slower than GPU but fully supported via highly optimized vector instructions (AVX2).*
|
||||
*Note: The agent automatically detects your hardware (CUDA GPU or CPU) and optimizes the runtime accordingly.*
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ Usage Guide
|
||||
## 🛠️ Operational Guide
|
||||
|
||||
### Installation
|
||||
1. **Acquire**: Download the latest portable executable from the [Releases](https://git.lashman.live/lashman/whisper_voice/releases) page.
|
||||
2. **Deploy**: Place `WhisperVoice.exe` in a directory of your choosing.
|
||||
3. **Initialize**: Run the executable. It will autonomously hydrate its runtime environment (approx. 2GB) on the first launch.
|
||||
### Deployment
|
||||
1. **Download**: Grab the latest `WhisperVoice.exe` from [Releases](https://git.lashman.live/lashman/whisper_voice/releases).
|
||||
2. **Install**: There is no installation. Place the executable in a directory you control (e.g., `C:\Tools\WhisperVoice`).
|
||||
3. **Bootstrap**: Run it. The agent will self-provision its own isolated Python environment (~2GB). This ensures your system PATH remains clean and unpolluted.
|
||||
|
||||
### Operation
|
||||
1. **Configure**: Right-click the **System Tray Icon** to open Settings. Select your **Model Size** and **Compute Device**.
|
||||
2. **Engage**: Press `F9` (or your custom hotkey) to open the channel.
|
||||
3. **Dictate**: Speak clearly. The noise gate will isolate your voice.
|
||||
4. **Execute**: Release the key. The machine interprets the signal and injects the text into your active window immediately.
|
||||
### Usage
|
||||
* **Hotkeys**: The default trigger is `F9`. You can rebind this in Settings to any combination (e.g., `Ctrl+Space`, `Alt+V`).
|
||||
* **Injection Modes**:
|
||||
* *Clipboard Paste*: Standard, reliable text insertion.
|
||||
* *Simulate Typing*: A stealth mode that physically mimics keystrokes (up to 6000 CPM) to bypass applications that block pasting (e.g., games, remote terminals).
|
||||
* **Tray Agent**: The app lives in your system tray. Right-click the icon to access Settings or terminate the process.
|
||||
|
||||
### Removal
|
||||
* **Portable**: To uninstall, simply delete the folder. No registry keys, no hidden services, no trace left behind.
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
<details>
|
||||
<summary><b>The app crashes immediately on start</b></summary>
|
||||
Ensure you have the <b>Microsoft Visual C++ Redistributable (2015-2022)</b> installed, as the underlying CTranslate2 engine requires these standard libraries.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>"Simulate Typing" is slow or misses characters</b></summary>
|
||||
Adjust the <b>Typing Speed</b> slider in Settings. Some older applications cannot handle supersonic 6000 CPM input; try lowering it to 1200 CPM.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Microphone not picking up audio</b></summary>
|
||||
The agent uses your <b>System Default Input Device</b>. Ensure your microphone is set as Default in Windows Sound Settings.
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
@@ -68,15 +111,15 @@ Choose the engine that matches your hardware capabilities.
|
||||
|
||||
**Public Domain (CC0 1.0)**
|
||||
|
||||
To the extent possible under law, the creators of this interface have waived all copyright and related or neighboring rights to this work. This tool belongs to the commons.
|
||||
To the extent possible under law, the creators of this interface have waived all copyright and related or neighboring rights to this work. This tool belongs to the commons. It is a gift to the digital proletariat.
|
||||
|
||||
* **Fork it.**
|
||||
* **Mod it.**
|
||||
* **Sell it.**
|
||||
* **Liberate it.**
|
||||
* **Distribute it.**
|
||||
|
||||
### Acknowledgments
|
||||
While this interface is CC0, it relies on the shoulders of giants:
|
||||
* **OpenAI Whisper Models**: Released under the MIT License.
|
||||
* **Faster-Whisper & CTranslate2**: Released under the MIT License.
|
||||
### Credits
|
||||
* **OpenAI**: For the Whisper weights (MIT).
|
||||
* **Systran**: For Faster-Whisper (MIT).
|
||||
* **Qt Company**: For the UI framework (LGPL).
|
||||
|
||||
*No gods, no cloud managers.*
|
||||
|
||||
Reference in New Issue
Block a user