Aesthetic overhaul of documentation
This commit is contained in:
190
README.md
190
README.md
@@ -1,129 +1,120 @@
|
|||||||
<div align="center">
|
<div align="center">
|
||||||
|
|
||||||
# WHISPER VOICE
|
# 🎙️ W H I S P E R V O I C E
|
||||||
### SOVEREIGN SPEECH RECOGNITION
|
### SOVEREIGN SPEECH RECOGNITION
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
**Your Voice. Your Machine. Your Data.**
|
|
||||||
<br>
|
|
||||||
*A high-performance, locally-run dictation agent for the liberated desktop.*
|
|
||||||
|
|
||||||
[](https://git.lashman.live/lashman/whisper_voice/releases/latest)
|
[](https://git.lashman.live/lashman/whisper_voice/releases/latest)
|
||||||
[](https://creativecommons.org/publicdomain/zero/1.0/)
|
[](https://creativecommons.org/publicdomain/zero/1.0/)
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
<p align="center">
|
> *"The master's tools will never dismantle the master's house."* — Audre Lorde
|
||||||
<img src="https://raw.githubusercontent.com/Tarikul-Islam-Anik/Animated-Fluent-Emojis/master/Emojis/Objects/Microphone.png" alt="Microphone" width="100" />
|
> <br>
|
||||||
</p>
|
> **Build your own tools. Run them locally.**
|
||||||
|
|
||||||
|
[Report Issue](https://git.lashman.live/lashman/whisper_voice/issues) • [View Source](https://git.lashman.live/lashman/whisper_voice) • [Releases](https://git.lashman.live/lashman/whisper_voice/releases)
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
---
|
<br>
|
||||||
|
|
||||||
## ✊ The Manifesto
|
## ✊ The Manifesto
|
||||||
|
|
||||||
**We hold these truths to be self-evident:** That user data is an extension of the self, and its exploitation by centralized clouds is a violation of digital autonomy.
|
**We hold these truths to be self-evident:** That user data is an extension of the self, and its exploitation by centralized clouds is a violation of digital autonomy.
|
||||||
|
|
||||||
Whisper Voice is built on the principle of **technological sovereignty**. It provides state-of-the-art speech recognition without renting your cognitive output to corporate oligarchies. By running entirely on your own hardware, it reclaims the means of digital production, ensuring that your words remain exclusively yours.
|
**Whisper Voice** is built on the principle of **technological sovereignty**. It provides state-of-the-art speech recognition without renting your cognitive output to corporate oligarchies. By running entirely on your own hardware, it reclaims the means of digital production, ensuring that your words remain exclusively yours.
|
||||||
|
|
||||||
> *"The master's tools will never dismantle the master's house."* — Audre Lorde
|
|
||||||
> <br>**Build your own tools. Run them locally.**
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## ⚡ Technical Core
|
## ⚡ Technical Architecture
|
||||||
|
|
||||||
Whisper Voice is not a wrapper for an API. It is a fully contained neural inference engine running on your metal.
|
This operates on the metal. It is not a wrapper. It is an engine.
|
||||||
|
|
||||||
### The Engine: Faster-Whisper
|
| Component | Technology | Benefit |
|
||||||
We utilize the **CTranslate2** backend—a high-performance inference engine for Transformer models. This allows us to run OpenAI's Whisper architectures with:
|
| :--- | :--- | :--- |
|
||||||
* **4x Speedup** over standard PyTorch implementations.
|
| **Inference Core** | **Faster-Whisper** | Hyper-optimized implementation of OpenAI's Whisper using **CTranslate2**. Delivers **4x speedups** over PyTorch. |
|
||||||
* **4x Memory Reduction** via 8-bit quantization (`int8`), enabling Pro-grade models on consumer GPUs.
|
| **Quantization** | **INT8** | 8-bit quantization enables Pro-grade models (`Large-v3`) to run on consumer GPUs with minimal VRAM. |
|
||||||
|
| **Sensory Gate** | **Silero VAD** | Enterprise-grade Voice Activity Detection filters out silence and background noise, conserving compute. |
|
||||||
### The Sense: Silero VAD
|
| **Interface** | **Qt 6 / QML** | Hardware-accelerated, glassmorphic UI that feels native yet remains OS-independent. |
|
||||||
To distinguish human speech from background noise, we employ **Silero VAD** (Voice Activity Detection). This ensures that the agent only listens when you speak, conserving compute resources and preventing hallucinated text from silence.
|
|
||||||
|
|
||||||
### The Interface: Qt 6 (PySide6)
|
|
||||||
The UI is built with **Qt Quick/QML**, rendering a hardware-accelerated, glassmorphic overlay that feels native to modern desktop environments while remaining completely decoupled from OS spyware.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 📊 Model Intelligence
|
## 📊 Intelligence Matrix
|
||||||
|
|
||||||
Select the intelligence level that matches your hardware reality.
|
Select the model that aligns with your hardware capabilities.
|
||||||
|
|
||||||
| Model | GPU VRAM | CPU RAM | Speed | Best For |
|
| Model | VRAM (GPU) | RAM (CPU) | Velocity | Designation |
|
||||||
| :--- | :--- | :--- | :--- | :--- |
|
| :--- | :--- | :--- | :--- | :--- |
|
||||||
| **Tiny** | ~500 MB | ~1 GB | ⚡ Supersonic | Quick commands, older machinery. |
|
| `Tiny` | **~500 MB** | ~1 GB | ⚡ **Supersonic** | Command & Control, older hardware. |
|
||||||
| **Base** | ~600 MB | ~1 GB | 🚀 Very Fast | Daily driving on low-power laptops. |
|
| `Base` | **~600 MB** | ~1 GB | 🚀 **Very Fast** | Daily driver for low-power laptops. |
|
||||||
| **Small** | ~1 GB | ~2 GB | ⏩ Fast | High accuracy for English dictation. |
|
| `Small` | **~1 GB** | ~2 GB | ⏩ **Fast** | High accuracy English dictation. |
|
||||||
| **Medium** | ~2 GB | ~4 GB | ⚖️ Balanced | Complex vocabulary and accents. |
|
| `Medium` | **~2 GB** | ~4 GB | ⚖️ **Balanced** | Complex vocabulary, foreign accents. |
|
||||||
| **Large-v3 Turbo** | ~4 GB | ~6 GB | ✨ **Optimal** | The sweet spot. Large-level smarts, Medium-level speed. |
|
| `Large-v3 Turbo` | **~4 GB** | ~6 GB | ✨ **Optimal** | **Sweet Spot.** Near-Large smarts, Medium speed. |
|
||||||
| **Large-v3** | ~5 GB | ~8 GB | 🧠 Maximum | Professional transcription. Uncompromised quality. |
|
| `Large-v3` | **~5 GB** | ~8 GB | 🧠 **Maximum** | Professional transcription. Uncompromised. |
|
||||||
|
|
||||||
*Note: You must select your available Compute Device (CUDA GPU or CPU) in the Settings to enable acceleration.*
|
> *Note: Acceleration requires you to manually select your Compute Device (CUDA GPU or CPU) in Settings.*
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🛠️ Operational Guide
|
## 🛠️ Operations
|
||||||
|
|
||||||
### Deployment
|
### 📥 Deployment
|
||||||
1. **Download**: Grab the latest `WhisperVoice.exe` from [Releases](https://git.lashman.live/lashman/whisper_voice/releases).
|
1. **Download**: Grab `WhisperVoice.exe` from [Releases](https://git.lashman.live/lashman/whisper_voice/releases).
|
||||||
2. **Install**: There is no installation. Place the executable in a directory you control (e.g., `C:\Tools\WhisperVoice`).
|
2. **Deploy**: Place it anywhere. It is portable.
|
||||||
3. **Bootstrap**: Run it. The agent will self-provision its own isolated Python environment (~2GB). This ensures your system PATH remains clean and unpolluted.
|
3. **Bootstrap**: Run it. The agent will self-provision an isolated Python environment (~2GB) on first launch.
|
||||||
|
|
||||||
### Usage
|
### 🕹️ Controls
|
||||||
* **Hotkeys**: The default trigger is `F9`. You can rebind this in Settings to any combination (e.g., `Ctrl+Space`, `Alt+V`).
|
* **Global Hook**: `F9` (Default). Press to open the channel. Release to inject text.
|
||||||
* **Injection Modes**:
|
* **Tray Agent**: Retracts to the system tray. Right-click for **Settings** or **File Transcription**.
|
||||||
* *Clipboard Paste*: Standard, reliable text insertion.
|
|
||||||
* *Simulate Typing*: A stealth mode that physically mimics keystrokes (up to 6000 CPM) to bypass applications that block pasting (e.g., games, remote terminals).
|
|
||||||
* **Tray Agent**: The app lives in your system tray. Right-click the icon to access Settings or terminate the process.
|
|
||||||
|
|
||||||
### Advanced Features
|
### 📡 Input Modes
|
||||||
* **File Transcription**: Need to transcribe a pre-recorded audio file? Right-click the **System Tray Icon** and select **Transcribe File**. Supports `.wav`, `.mp3`, `.m4a`, and most common formats.
|
| Mode | Description | Speed |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| **Clipboard Paste** | Standard text injection via OS clipboard. | Instant |
|
||||||
|
| **Simulate Typing** | Mimics physical keystrokes. Bypasses anti-paste blocks. | Up to **6000** CPM |
|
||||||
|
|
||||||
### 🌐 Supported Languages
|
---
|
||||||
The model is trained on 680,000 hours of multilingual data and supports the following languages with high accuracy:
|
|
||||||
|
## 🌐 Universal Translation
|
||||||
|
|
||||||
|
The model listens in **99 languages** and translates them to English or transcribes them natively.
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>Click to expand full list (99 Languages)</b></summary>
|
<summary><b>Click to view supported languages</b></summary>
|
||||||
|
<br>
|
||||||
|
|
||||||
| | | | |
|
| | | | |
|
||||||
| :--- | :--- | :--- | :--- |
|
| :--- | :--- | :--- | :--- |
|
||||||
| Afrikaans | Albanian | Amharic | Arabic |
|
| Afrikaans 🇿🇦 | Albanian 🇦🇱 | Amharic 🇪🇹 | Arabic 🇸🇦 |
|
||||||
| Armenian | Assamese | Azerbaijani | Bashkir |
|
| Armenian 🇦🇲 | Assamese 🇮🇳 | Azerbaijani 🇦🇿 | Bashkir 🇷🇺 |
|
||||||
| Basque | Belarusian | Bengali | Bosnian |
|
| Basque 🇪🇸 | Belarusian 🇧🇾 | Bengali 🇧🇩 | Bosnian 🇧🇦 |
|
||||||
| Breton | Bulgarian | Burmese | Castilian |
|
| Breton 🇫🇷 | Bulgarian 🇧🇬 | Burmese 🇲🇲 | Castilian 🇪🇸 |
|
||||||
| Catalan | Chinese | Croatian | Czech |
|
| Catalan 🇪🇸 | Chinese 🇨🇳 | Croatian 🇭🇷 | Czech 🇨🇿 |
|
||||||
| Danish | Dutch | English | Estonian |
|
| Danish 🇩🇰 | Dutch 🇳🇱 | English 🇺🇸 | Estonian 🇪🇪 |
|
||||||
| Faroese | Finnish | Flemish | French |
|
| Faroese 🇫🇴 | Finnish 🇫🇮 | Flemish 🇧🇪 | French 🇫🇷 |
|
||||||
| Galician | Georgian | German | Greek |
|
| Galician 🇪🇸 | Georgian 🇬🇪 | German 🇩🇪 | Greek 🇬🇷 |
|
||||||
| Gujarati | Haitian | Haitian Creole | Hausa |
|
| Gujarati 🇮🇳 | Haitian 🇭🇹 | Hausa 🇳🇬 | Hawaiian 🇺🇸 |
|
||||||
| Hawaiian | Hebrew | Hindi | Hungarian |
|
| Hebrew 🇮🇱 | Hindi 🇮🇳 | Hungarian 🇭🇺 | Icelandic 🇮🇸 |
|
||||||
| Icelandic | Indonesian | Italian | Japanese |
|
| Indonesian 🇮🇩 | Italian 🇮🇹 | Japanese 🇯🇵 | Javanese 🇮🇩 |
|
||||||
| Javanese | Kannada | Kazakh | Khmer |
|
| Kannada 🇮🇳 | Kazakh 🇰🇿 | Khmer 🇰🇭 | Korean 🇰🇷 |
|
||||||
| Korean | Lao | Latin | Latvian |
|
| Lao 🇱🇦 | Latin 🇻🇦 | Latvian 🇱🇻 | Lingala 🇨🇩 |
|
||||||
| Letzeburgesch | Lingala | Lithuanian | Luxembourgish |
|
| Lithuanian 🇱🇹 | Luxembourgish 🇱🇺 | Macedonian 🇲🇰 | Malagasy 🇲🇬 |
|
||||||
| Macedonian | Malagasy | Malay | Malayalam |
|
| Malay 🇲🇾 | Malayalam 🇮🇳 | Maltese 🇲🇹 | Maori 🇳🇿 |
|
||||||
| Maltese | Maori | Marathi | Moldavian |
|
| Marathi 🇮🇳 | Moldavian 🇲🇩 | Mongolian 🇲🇳 | Myanmar 🇲🇲 |
|
||||||
| Mongolian | Myanmar | Nepali | Norwegian |
|
| Nepali 🇳🇵 | Norwegian 🇳🇴 | Occitan 🇫🇷 | Panjabi 🇮🇳 |
|
||||||
| Nynorsk | Occitan | Panjabi | Pashto |
|
| Pashto 🇦🇫 | Persian 🇮🇷 | Polish 🇵🇱 | Portuguese 🇵🇹 |
|
||||||
| Persian | Polish | Portuguese | Punjabi |
|
| Punjabi 🇮🇳 | Romanian 🇷🇴 | Russian 🇷🇺 | Sanskrit 🇮🇳 |
|
||||||
| Pushto | Romanian | Russian | Sanskrit |
|
| Serbian 🇷🇸 | Shona 🇿🇼 | Sindhi 🇵🇰 | Sinhala 🇱🇰 |
|
||||||
| Serbian | Shona | Sindhi | Sinhala |
|
| Slovak 🇸🇰 | Slovenian 🇸🇮 | Somali 🇸🇴 | Spanish 🇪🇸 |
|
||||||
| Sinhalese | Slovak | Slovenian | Somali |
|
| Sundanese 🇮🇩 | Swahili 🇰🇪 | Swedish 🇸🇪 | Tagalog 🇵🇭 |
|
||||||
| Spanish | Sundanese | Swahili | Swedish |
|
| Tajik 🇹🇯 | Tamil 🇮🇳 | Tatar 🇷🇺 | Telugu 🇮🇳 |
|
||||||
| Tagalog | Tajik | Tamil | Tatar |
|
| Thai 🇹🇭 | Tibetan 🇨🇳 | Turkish 🇹🇷 | Turkmen 🇹🇲 |
|
||||||
| Telugu | Thai | Tibetan | Turkish |
|
| Ukrainian 🇺🇦 | Urdu 🇵🇰 | Uzbek 🇺🇿 | Vietnamese 🇻e |
|
||||||
| Turkmen | Ukrainian | Urdu | Uzbek |
|
| Welsh 🏴 | Yiddish 🇮🇱 | Yoruba 🇳🇬 | |
|
||||||
| Valencian | Vietnamese | Welsh | Yiddish |
|
|
||||||
| Yoruba | | | |
|
|
||||||
|
|
||||||
*Note: The model will automatically detect the language being spoken.*
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -131,35 +122,34 @@ The model is trained on 680,000 hours of multilingual data and supports the foll
|
|||||||
## 🔧 Troubleshooting
|
## 🔧 Troubleshooting
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>The app crashes immediately on start</b></summary>
|
<summary><b>🔥 App crashes on start</b></summary>
|
||||||
Ensure you have the <b>Microsoft Visual C++ Redistributable (2015-2022)</b> installed, as the underlying CTranslate2 engine requires these standard libraries.
|
<blockquote>
|
||||||
|
The underlying engine requires standard C++ libraries. Install the <b>Microsoft Visual C++ Redistributable (2015-2022)</b>.
|
||||||
|
</blockquote>
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>"Simulate Typing" is slow or misses characters</b></summary>
|
<summary><b>🐌 "Simulate Typing" is slow</b></summary>
|
||||||
Adjust the <b>Typing Speed</b> slider in Settings. Some older applications cannot handle supersonic 6000 CPM input; try lowering it to 1200 CPM.
|
<blockquote>
|
||||||
|
Some apps (games, RDP) can't handle supersonic input. Go to <b>Settings</b> and lower the <b>Typing Speed</b> to ~1200 CPM.
|
||||||
|
</blockquote>
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>Microphone not picking up audio</b></summary>
|
<summary><b>🎤 No Audio / Silence</b></summary>
|
||||||
The agent uses your <b>System Default Input Device</b>. Ensure your microphone is set as Default in Windows Sound Settings.
|
<blockquote>
|
||||||
|
The agent listens to the <b>Default Communication Device</b>. Ensure your microphone is set correctly in Windows Sound Settings.
|
||||||
|
</blockquote>
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## ⚖️ License & Rights
|
<div align="center">
|
||||||
|
|
||||||
**Public Domain (CC0 1.0)**
|
### ⚖️ PUBLIC DOMAIN (CC0 1.0)
|
||||||
|
|
||||||
To the extent possible under law, the creators of this interface have waived all copyright and related or neighboring rights to this work. This tool belongs to the commons. It is a gift to the digital proletariat.
|
*No Rights Reserved. No Gods. No Managers.*
|
||||||
|
|
||||||
* **Fork it.**
|
Credit to **OpenAI** (Whisper), **Systran** (Faster-Whisper), and **Silero** (VAD).
|
||||||
* **Mod it.**
|
|
||||||
* **Distribute it.**
|
|
||||||
|
|
||||||
### Credits
|
</div>
|
||||||
* **OpenAI**: For the Whisper weights (MIT).
|
|
||||||
* **Systran**: For Faster-Whisper (MIT).
|
|
||||||
* **Qt Company**: For the UI framework (LGPL).
|
|
||||||
|
|
||||||
*No gods, no cloud managers.*
|
|
||||||
|
|||||||
Reference in New Issue
Block a user