5 Commits

Author SHA1 Message Date
Your Name
84f10092e9 Release v1.0.2: Implemented Style Prompting & Removed Grammar Correction
- Removed M2M100 Grammar Correction model completely to reduce bloat/complexity.
- Implemented 'Style Prompting' in Settings -> AI Engine to handle punctuation natively via Whisper.
- Added Style Presets: Standard (Default), Casual, and Custom.
- Optimized Build: Bootstrapper no longer requires transformers/sentencepiece.
- Fixed 'torch' NameError in Low VRAM mode.
- Fixed Bootstrapper missing dependency detection.
- Updated UI to reflect removed features.
- Included compiled v1.0.2 Executable in dist/.
2026-01-25 13:42:06 +02:00
Your Name
03f46ee1e3 Docs: Final polish - Enshittification manifesto and structural refinement 2026-01-24 19:21:01 +02:00
Your Name
0f1bf5f1af Docs: Final polish - 6-col language table and refined manifesto 2026-01-24 19:12:08 +02:00
Your Name
0b2b5848e2 Fix: Translation Reliability, Click-Through, and Docs Sync
- Transcriber: Enforced 'beam_size=5' and prompt injection for robust translation.
- Transcriber: Removed conditioning on previous text to prevent language stickiness.
- Transcriber: Refactored kwargs to sanitize inputs.
- Overlay: Fixed click-through by toggling WS_EX_TRANSPARENT.
- UI: Added real download progress reporting.
- Docs: Refactored language list to table.
2026-01-24 19:05:43 +02:00
Your Name
f3bf7541cf Docs: Detailed expansion of README with Translation features and open layout 2026-01-24 18:33:22 +02:00
14 changed files with 478 additions and 142 deletions

195
README.md
View File

@@ -5,150 +5,157 @@
<br> <br>
![Status](https://img.shields.io/badge/STATUS-OPERATIONAL-success?style=for-the-badge&logo=server) ![Status](https://img.shields.io/badge/STATUS-OPERATIONAL-success?style=for-the-badge&logo=server&color=2ecc71)
[![Download](https://img.shields.io/gitea/v/release/lashman/whisper_voice?gitea_url=https%3A%2F%2Fgit.lashman.live&label=Download&style=for-the-badge&logo=windows&logoColor=white&color=2563eb)](https://git.lashman.live/lashman/whisper_voice/releases/latest) [![Download](https://img.shields.io/gitea/v/release/lashman/whisper_voice?gitea_url=https%3A%2F%2Fgit.lashman.live&label=Install&style=for-the-badge&logo=windows&logoColor=white&color=3b82f6)](https://git.lashman.live/lashman/whisper_voice/releases/latest)
[![License](https://img.shields.io/badge/LICENSE-CC0_PUBLIC_DOMAIN-lightgrey?style=for-the-badge&logo=creative-commons&logoColor=black)](https://creativecommons.org/publicdomain/zero/1.0/) [![License](https://img.shields.io/badge/LICENSE-PUBLIC_DOMAIN-lightgrey?style=for-the-badge&logo=creative-commons&logoColor=black)](https://creativecommons.org/publicdomain/zero/1.0/)
<br> <br>
> *"The master's tools will never dismantle the master's house."* — Audre Lorde > *"The master's tools will never dismantle the master's house."*
> <br> > <br>
> **Build your own tools. Run them locally.** > **Build your own tools. Run them locally. Free your mind.**
[Report Issue](https://git.lashman.live/lashman/whisper_voice/issues) • [View Source](https://git.lashman.live/lashman/whisper_voice) • [Releases](https://git.lashman.live/lashman/whisper_voice/releases) [View Source](https://git.lashman.live/lashman/whisper_voice) • [Report Issue](https://git.lashman.live/lashman/whisper_voice/issues)
</div> </div>
<br>
<br> <br>
## The Manifesto ## 📡 The Transmission
**We hold these truths to be self-evident:** That user data is an extension of the self, and its exploitation by centralized clouds is a violation of digital autonomy. We are witnessing the **enshittification** of the digital world. What were once vibrant social commons are being walled off, strip-mined for data, and degraded into rent-seeking silos. Your voice is no longer your own; it is a training set for a corporate oracle that charges you for the privilege of listening.
**Whisper Voice** is built on the principle of **technological sovereignty**. It provides state-of-the-art speech recognition without renting your cognitive output to corporate oligarchies. By running entirely on your own hardware, it reclaims the means of digital production, ensuring that your words remain exclusively yours. **Whisper Voice** is a small act of sabotage against this trend.
It is built on the axiom of **Technological Sovereignty**. By moving state-of-the-art inference from the server farms to your own silicon, you reclaim the means of digital production. No telemetry. No subscriptions. No "cloud processing" that eavesdrops on your intent.
--- ---
## ⚡ Technical Architecture ## ⚡ The Engine
This operates on the metal. It is not a wrapper. It is an engine. Whisper Voice operates directly on the metal. It is not an API wrapper; it is an autonomous machine.
| Component | Technology | Benefit | | Component | Technology | Benefit |
| :--- | :--- | :--- | | :--- | :--- | :--- |
| **Inference Core** | **Faster-Whisper** | Hyper-optimized implementation of OpenAI's Whisper using **CTranslate2**. Delivers **4x speedups** over PyTorch. | | **Inference Core** | **Faster-Whisper** | Hyper-optimized C++ implementation via **CTranslate2**. Delivers **4x velocity** over standard PyTorch. |
| **Quantization** | **INT8** | 8-bit quantization enables Pro-grade models (`Large-v3`) to run on consumer GPUs with minimal VRAM. | | **Compression** | **INT8 quantization** | Enables Pro-grade models (`Large-v3`) to run on consumer-grade GPUs, democratizing elite AI. |
| **Sensory Gate** | **Silero VAD** | Enterprise-grade Voice Activity Detection filters out silence and background noise, conserving compute. | | **Sensory Gate** | **Silero VAD** | Enterprise-grade Voice Activity Detection filters out the noise, ensuring only pure intent is processed. |
| **Interface** | **Qt 6 / QML** | Hardware-accelerated, glassmorphic UI that feels native yet remains OS-independent. | | **Interface** | **Qt 6 / QML** | Hardware-accelerated, glassmorphic UI that is fluid, responsive, and sovereign. |
<br>
## 🖋️ Universal Transcription
At its core, Whisper Voice is the ultimate bridge between thought and text. It listens with superhuman precision, converting spoken word into written form across **99 languages**.
* **Punctuation Mastery**: Automatically handles capitalization and complex punctuation formatting.
* **Contextual Intelligence**: Smarter than standard dictation; it understands the flow of sentences to resolve homophones and technical jargon ($1.5k vs "fifteen hundred dollars").
* **Total Privacy**: Your private dictation, legal notes, or creative writing never leave your RAM.
### Workflow: `F9 (Default)`
The primary channel for native-language transcription. It transcribes precisely what it hears in the language you speak (or the one you've locked in Settings).
<br>
## 🌎 Universal Translation
Whisper Voice v1.0.1 includes a **Neural Translation Engine** that allows you to bridge any linguistic gap instantly.
* **Input**: Speak in French, Japanese, Russian, or **96 other languages**.
* **Output**: The engine instantly reconstructs the semantic meaning into fluent **English**.
* **Task Protocol**: Handled via the dedicated `F10` channel.
### 🔍 Why only English translation?
A common question arises: *Why can't I translate from French to Japanese?*
The architecture of the underlying Whisper model is a **Many-to-English** design. During its massive training phase (680,000 hours of audio), the translation task was specifically optimized to map the global linguistic commons onto a single bridge language: **English**. This allowed the model to reach incredible levels of semantic understanding without the exponential complexity of a "Many-to-Many" mapping.
By focusing its translation decoder solely on English, Whisper achieves "Zero-Shot" quality that rivals specialized translation engines while remaining lightweight enough to run on your local GPU.
--- ---
## 🕹️ Command & Control
### Global Hotkeys
The agent runs silently in the background, waiting for your signal.
* **Transcribe (F9)**: Opens the channel for standard speech-to-text.
* **Translate (F10)**: Opens the channel for neural translation.
* **Customization**: Remap these keys in Settings. The recorder supports complex chords (e.g. `Ctrl + Alt + Space`) to fit your workflow.
### Injection Protocols
* **Clipboard Paste**: Standard text injection. Instant, reliable.
* **Simulate Typing**: Mimics physical keystrokes at superhuman speed (6000 CPM). Bypasses anti-paste restrictions and "protected" windows.
<br>
## 📊 Intelligence Matrix ## 📊 Intelligence Matrix
Select the model that aligns with your hardware capabilities. Select the model that aligns with your available resources.
| Model | VRAM (GPU) | RAM (CPU) | Velocity | Designation | | Model | VRAM (GPU) | RAM (CPU) | Designation | Capability |
| :--- | :--- | :--- | :--- | :--- | | :--- | :--- | :--- | :--- | :--- |
| `Tiny` | **~500 MB** | ~1 GB | ⚡ **Supersonic** | Command & Control, older hardware. | | `Tiny` | **~500 MB** | ~1 GB | ⚡ **Supersonic** | Command & Control, older hardware. |
| `Base` | **~600 MB** | ~1 GB | 🚀 **Very Fast** | Daily driver for low-power laptops. | | `Base` | **~600 MB** | ~1 GB | 🚀 **Very Fast** | Daily driver for low-power laptops. |
| `Small` | **~1 GB** | ~2 GB | ⏩ **Fast** | High accuracy English dictation. | | `Small` | **~1 GB** | ~2 GB | ⏩ **Fast** | High accuracy English dictation. |
| `Medium` | **~2 GB** | ~4 GB | ⚖️ **Balanced** | Complex vocabulary, foreign accents. | | `Medium` | **~2 GB** | ~4 GB | ⚖️ **Balanced** | Complex vocabulary, foreign accents. |
| `Large-v3 Turbo` | **~4 GB** | ~6 GB | ✨ **Optimal** | **Sweet Spot.** Near-Large smarts, Medium speed. | | `Large-v3 Turbo` | **~4 GB** | ~6 GB | ✨ **Optimal** | **The Sweet Spot.** Near-Large intelligence, Medium speed. |
| `Large-v3` | **~5 GB** | ~8 GB | 🧠 **Maximum** | Professional transcription. Uncompromised. | | `Large-v3` | **~5 GB** | ~8 GB | 🧠 **Maximum** | Professional grade. Uncompromised. |
> *Note: Acceleration requires you to manually select your Compute Device (CUDA GPU or CPU) in Settings.* > *Note: Acceleration requires you to manually select your Compute Device (CUDA GPU or CPU) in Settings.*
--- ---
## 🛠️ Operations ## 🛠️ Deployment
### 📥 Deployment ### 📥 Installation
1. **Download**: Grab `WhisperVoice.exe` from [Releases](https://git.lashman.live/lashman/whisper_voice/releases). 1. **Acquire**: Download `WhisperVoice.exe` from [Releases](https://git.lashman.live/lashman/whisper_voice/releases).
2. **Deploy**: Place it anywhere. It is portable. 2. **Deploy**: Place it anywhere. It is portable.
3. **Bootstrap**: Run it. The agent will self-provision an isolated Python environment (~2GB) on first launch. 3. **Bootstrap**: Run it. The agent will self-provision an isolated Python runtime (~2GB) on first launch.
4. **Sync**: Future updates are handled by the **Smart Bootstrapper**, which surgically updates only changed files, respecting your bandwidth and your settings.
### 🕹️ Controls ### 🔧 Troubleshooting
* **Global Hook**: `F9` (Default). Press to open the channel. Release to inject text. * **App crashes on start**: Ensure you have [Microsoft Visual C++ Redistributable 2015-2022](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist) installed.
* **Tray Agent**: Retracts to the system tray. Right-click for **Settings** or **File Transcription**. * **"Simulate Typing" is slow**: Some applications (remote desktops, legacy games) cannot handle the data stream. Lower the typing speed in Settings to ~1200 CPM.
* **No Audio**: The agent listens to the **Default Communication Device**. Verify your Windows Sound Control Panel.
### 📡 Input Modes
| Mode | Description | Speed |
| :--- | :--- | :--- |
| **Clipboard Paste** | Standard text injection via OS clipboard. | Instant |
| **Simulate Typing** | Mimics physical keystrokes. Bypasses anti-paste blocks. | Up to **6000** CPM |
---
## 🌐 Universal Translation
The model listens in **99 languages** and translates them to English or transcribes them natively.
<details>
<summary><b>Click to view supported languages</b></summary>
<br> <br>
| | | | |
| :--- | :--- | :--- | :--- |
| Afrikaans 🇿🇦 | Albanian 🇦🇱 | Amharic 🇪🇹 | Arabic 🇸🇦 |
| Armenian 🇦🇲 | Assamese 🇮🇳 | Azerbaijani 🇦🇿 | Bashkir 🇷🇺 |
| Basque 🇪🇸 | Belarusian 🇧🇾 | Bengali 🇧🇩 | Bosnian 🇧🇦 |
| Breton 🇫🇷 | Bulgarian 🇧🇬 | Burmese 🇲🇲 | Castilian 🇪🇸 |
| Catalan 🇪🇸 | Chinese 🇨🇳 | Croatian 🇭🇷 | Czech 🇨🇿 |
| Danish 🇩🇰 | Dutch 🇳🇱 | English 🇺🇸 | Estonian 🇪🇪 |
| Faroese 🇫🇴 | Finnish 🇫🇮 | Flemish 🇧🇪 | French 🇫🇷 |
| Galician 🇪🇸 | Georgian 🇬🇪 | German 🇩🇪 | Greek 🇬🇷 |
| Gujarati 🇮🇳 | Haitian 🇭🇹 | Hausa 🇳🇬 | Hawaiian 🇺🇸 |
| Hebrew 🇮🇱 | Hindi 🇮🇳 | Hungarian 🇭🇺 | Icelandic 🇮🇸 |
| Indonesian 🇮🇩 | Italian 🇮🇹 | Japanese 🇯🇵 | Javanese 🇮🇩 |
| Kannada 🇮🇳 | Kazakh 🇰🇿 | Khmer 🇰🇭 | Korean 🇰🇷 |
| Lao 🇱🇦 | Latin 🇻🇦 | Latvian 🇱🇻 | Lingala 🇨🇩 |
| Lithuanian 🇱🇹 | Luxembourgish 🇱🇺 | Macedonian 🇲🇰 | Malagasy 🇲🇬 |
| Malay 🇲🇾 | Malayalam 🇮🇳 | Maltese 🇲🇹 | Maori 🇳🇿 |
| Marathi 🇮🇳 | Moldavian 🇲🇩 | Mongolian 🇲🇳 | Myanmar 🇲🇲 |
| Nepali 🇳🇵 | Norwegian 🇳🇴 | Occitan 🇫🇷 | Panjabi 🇮🇳 |
| Pashto 🇦🇫 | Persian 🇮🇷 | Polish 🇵🇱 | Portuguese 🇵🇹 |
| Punjabi 🇮🇳 | Romanian 🇷🇴 | Russian 🇷🇺 | Sanskrit 🇮🇳 |
| Serbian 🇷🇸 | Shona 🇿🇼 | Sindhi 🇵🇰 | Sinhala 🇱🇰 |
| Slovak 🇸🇰 | Slovenian 🇸🇮 | Somali 🇸🇴 | Spanish 🇪🇸 |
| Sundanese 🇮🇩 | Swahili 🇰🇪 | Swedish 🇸🇪 | Tagalog 🇵🇭 |
| Tajik 🇹🇯 | Tamil 🇮🇳 | Tatar 🇷🇺 | Telugu 🇮🇳 |
| Thai 🇹🇭 | Tibetan 🇨🇳 | Turkish 🇹🇷 | Turkmen 🇹🇲 |
| Ukrainian 🇺🇦 | Urdu 🇵🇰 | Uzbek 🇺🇿 | Vietnamese 🇻e |
| Welsh 🏴󠁧󠁢󠁷󠁬󠁳󠁿 | Yiddish 🇮🇱 | Yoruba 🇳🇬 | |
</details>
--- ---
## 🔧 Troubleshooting ## 🌐 Supported Languages
<details> The engine understands the following 99 languages. You can lock the focus to a specific language in Settings to improve accuracy, or rely on **Auto-Detect** for fluid multilingual usage.
<summary><b>🔥 App crashes on start</b></summary>
<blockquote>
The underlying engine requires standard C++ libraries. Install the <b>Microsoft Visual C++ Redistributable (2015-2022)</b>.
</blockquote>
</details>
<details> | | | | | | |
<summary><b>🐌 "Simulate Typing" is slow</b></summary> | :--- | :--- | :--- | :--- | :--- | :--- |
<blockquote> | Afrikaans 🇿🇦 | Albanian 🇦🇱 | Amharic 🇪🇹 | Arabic 🇸🇦 | Armenian 🇦🇲 | Assamese 🇮🇳 |
Some apps (games, RDP) can't handle supersonic input. Go to <b>Settings</b> and lower the <b>Typing Speed</b> to ~1200 CPM. | Azerbaijani 🇦🇿 | Bashkir 🇷🇺 | Basque 🇪🇸 | Belarusian 🇧🇾 | Bengali 🇧🇩 | Bosnian 🇧🇦 |
</blockquote> | Breton 🇫🇷 | Bulgarian 🇧🇬 | Burmese 🇲🇲 | Castilian 🇪🇸 | Catalan 🇪🇸 | Chinese 🇨🇳 |
</details> | Croatian 🇭🇷 | Czech 🇨🇿 | Danish 🇩🇰 | Dutch 🇳🇱 | English 🇺🇸 | Estonian 🇪🇪 |
| Faroese 🇫🇴 | Finnish 🇫🇮 | Flemish 🇧🇪 | French 🇫🇷 | Galician 🇪🇸 | Georgian 🇬🇪 |
| German 🇩🇪 | Greek 🇬🇷 | Gujarati 🇮🇳 | Haitian 🇭🇹 | Hausa 🇳🇬 | Hawaiian 🇺🇸 |
| Hebrew 🇮🇱 | Hindi 🇮🇳 | Hungarian 🇭🇺 | Icelandic 🇮🇸 | Indonesian 🇮🇩 | Italian 🇮🇹 |
| Japanese 🇯🇵 | Javanese 🇮 Indonesa | Kannada 🇮🇳 | Kazakh 🇰🇿 | Khmer 🇰🇭 | Korean 🇰🇷 |
| Lao 🇱🇦 | Latin 🇻🇦 | Latvian 🇱🇻 | Lingala 🇨🇩 | Lithuanian 🇱🇹 | Luxembourgish 🇱🇺 |
| Macedonian 🇲🇰 | Malagasy 🇲🇬 | Malay 🇲🇾 | Malayalam 🇮🇳 | Maltese 🇲🇹 | Maori 🇳🇿 |
| Marathi 🇮🇳 | Moldavian 🇲🇩 | Mongolian 🇲🇳 | Myanmar 🇲🇲 | Nepali 🇳🇵 | Norwegian 🇳🇴 |
| Occitan 🇫🇷 | Panjabi 🇮🇳 | Pashto 🇦🇫 | Persian 🇮🇷 | Polish 🇵🇱 | Portuguese 🇵🇹 |
| Punjabi 🇮🇳 | Romanian 🇷🇴 | Russian 🇷🇺 | Sanskrit 🇮🇳 | Serbian 🇷🇸 | Shona 🇿🇼 |
| Sindhi 🇵🇰 | Sinhala 🇱🇰 | Slovak 🇸🇰 | Slovenian 🇸🇮 | Somali 🇸🇴 | Spanish 🇪🇸 |
| Sundanese 🇮🇩 | Swahili 🇰🇪 | Swedish 🇸🇪 | Tagalog 🇵🇭 | Tajik 🇹🇯 | Tamil 🇮🇳 |
| Tatar 🇷🇺 | Telugu 🇮🇳 | Thai 🇹🇭 | Tibetan 🇨🇳 | Turkish 🇹🇷 | Turkmen 🇹🇲 |
| Ukrainian 🇺🇦 | Urdu 🇵🇰 | Uzbek 🇺🇿 | Vietnamese 🇻e | Welsh 🏴󠁧󠁢󠁷󠁬󠁳󠁿 | Yiddish 🇮🇱 |
| Yoruba 🇳🇬 | | | | | |
<details> <br>
<summary><b>🎤 No Audio / Silence</b></summary> <br>
<blockquote>
The agent listens to the <b>Default Communication Device</b>. Ensure your microphone is set correctly in Windows Sound Settings.
</blockquote>
</details>
---
<div align="center"> <div align="center">
### ⚖️ PUBLIC DOMAIN (CC0 1.0) ### ⚖️ PUBLIC DOMAIN (CC0 1.0)
*No Rights Reserved. No Gods. No Masters. No Managers.*
*No Rights Reserved. No Gods. No Managers.*
Credit to **OpenAI** (Whisper), **Systran** (Faster-Whisper), and **Silero** (VAD). Credit to **OpenAI** (Whisper), **Systran** (Faster-Whisper), and **Silero** (VAD).

View File

@@ -347,11 +347,17 @@ class Bootstrapper:
messagebox.showerror("WhisperVoice Error", f"Failed to launch app: {e}") messagebox.showerror("WhisperVoice Error", f"Failed to launch app: {e}")
return False return False
def check_dependencies(self):
"""Quick check if critical dependencies are installed."""
return True # Deprecated logic placeholder
def setup_and_run(self): def setup_and_run(self):
"""Full setup/update and run flow.""" """Full setup/update and run flow."""
try: try:
# 1. Ensure basics
if not self.is_python_ready(): if not self.is_python_ready():
self.download_python() self.download_python()
self._fix_pth_file() # Ensure pth is fixed immediately after download
self.install_pip() self.install_pip()
self.install_packages() self.install_packages()
@@ -362,7 +368,10 @@ class Bootstrapper:
if self.run_app(): if self.run_app():
if self.ui: self.ui.root.quit() if self.ui: self.ui.root.quit()
except Exception as e: except Exception as e:
messagebox.showerror("Setup Error", f"Installation failed: {e}") if self.ui:
import tkinter.messagebox as mb
mb.showerror("Setup Error", f"Installation failed: {e}") # Improved error visibility
log(f"Fatal error: {e}")
import traceback import traceback
traceback.print_exc() traceback.print_exc()

BIN
dist/WhisperVoice.exe vendored Normal file

Binary file not shown.

65
main.py
View File

@@ -87,7 +87,7 @@ def _silent_shutdown_hook(exc_type, exc_value, exc_tb):
sys.excepthook = _silent_shutdown_hook sys.excepthook = _silent_shutdown_hook
class DownloadWorker(QThread): class DownloadWorker(QThread):
"""Background worker for model downloads.""" """Background worker for model downloads with REAL progress."""
progress = Signal(int) progress = Signal(int)
finished = Signal() finished = Signal()
error = Signal(str) error = Signal(str)
@@ -98,20 +98,67 @@ class DownloadWorker(QThread):
def run(self): def run(self):
try: try:
from faster_whisper import download_model import requests
from tqdm import tqdm
model_path = get_models_path() model_path = get_models_path()
# Download to a specific subdirectory to keep things clean and predictable # Determine what to download
# This matches the logic in transcriber.py which looks for this specific path
dest_dir = model_path / f"faster-whisper-{self.model_name}" dest_dir = model_path / f"faster-whisper-{self.model_name}"
logging.info(f"Downloading Model '{self.model_name}' to {dest_dir}...") repo_id = f"Systran/faster-whisper-{self.model_name}"
files = ["config.json", "model.bin", "tokenizer.json", "vocabulary.json"]
base_url = f"https://huggingface.co/{repo_id}/resolve/main"
# Ensure parent exists dest_dir.mkdir(parents=True, exist_ok=True)
model_path.mkdir(parents=True, exist_ok=True) logging.info(f"Downloading {self.model_name} to {dest_dir}...")
# output_dir in download_model specifies where the model files are saved # 1. Calculate Total Size
download_model(self.model_name, output_dir=str(dest_dir)) total_size = 0
file_sizes = {}
with requests.Session() as s:
for fname in files:
url = f"{base_url}/{fname}"
head = s.head(url, allow_redirects=True)
if head.status_code == 200:
size = int(head.headers.get('content-length', 0))
file_sizes[fname] = size
total_size += size
else:
# Fallback for vocabulary.json vs vocabulary.txt
if fname == "vocabulary.json":
# Try .txt? Or just skip if not found?
# Faster-whisper usually has vocabulary.json
pass
# 2. Download loop
downloaded_bytes = 0
with requests.Session() as s:
for fname in files:
if fname not in file_sizes: continue
url = f"{base_url}/{fname}"
dest_file = dest_dir / fname
# Resume check?
# Simpler to just overwrite for reliability unless we want complex resume logic.
# We'll overwrite.
resp = s.get(url, stream=True)
resp.raise_for_status()
with open(dest_file, 'wb') as f:
for chunk in resp.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
downloaded_bytes += len(chunk)
# Emit Progress
if total_size > 0:
pct = int((downloaded_bytes / total_size) * 100)
self.progress.emit(pct)
self.finished.emit() self.finished.emit()
except Exception as e: except Exception as e:
logging.error(f"Download failed: {e}") logging.error(f"Download failed: {e}")
self.error.emit(str(e)) self.error.emit(str(e))

View File

@@ -39,39 +39,36 @@ def build_portable():
print("⏳ This may take 5-10 minutes...") print("⏳ This may take 5-10 minutes...")
PyInstaller.__main__.run([ PyInstaller.__main__.run([
"main.py", # Entry point "bootstrapper.py", # Entry point (Tiny Installer)
"--name=WhisperVoice", # EXE name "--name=WhisperVoice", # EXE name
"--onefile", # Single EXE (slower startup but portable) "--onefile", # Single EXE
"--noconsole", # No terminal window "--noconsole", # No terminal window
"--clean", # Clean cache "--clean", # Clean cache
*add_data_args, # Bundled assets
# Heavy libraries that need special collection # Bundle the app source to be extracted by bootstrapper
"--collect-all", "faster_whisper", # The bootstrapper expects 'app_source' folder in bundled resources
"--collect-all", "ctranslate2", "--add-data", f"src{os.pathsep}app_source/src",
"--collect-all", "PySide6", "--add-data", f"main.py{os.pathsep}app_source",
"--collect-all", "torch", "--add-data", f"requirements.txt{os.pathsep}app_source",
"--collect-all", "numpy",
# Hidden imports (modules imported dynamically) # Add assets
"--hidden-import", "keyboard", "--add-data", f"src/ui/qml{os.pathsep}app_source/src/ui/qml",
"--hidden-import", "pyperclip", "--add-data", f"assets{os.pathsep}app_source/assets",
"--hidden-import", "psutil",
"--hidden-import", "pynvml",
"--hidden-import", "sounddevice",
"--hidden-import", "scipy",
"--hidden-import", "scipy.signal",
"--hidden-import", "huggingface_hub",
"--hidden-import", "tokenizers",
# Qt plugins # No heavy collections!
"--hidden-import", "PySide6.QtQuickControls2", # The bootstrapper uses internal pip to install everything.
"--hidden-import", "PySide6.QtQuick.Controls",
# Icon (convert to .ico for Windows) # Exclude heavy modules to ensure this exe stays tiny
# "--icon=icon.ico", # Uncomment if you have a .ico file "--exclude-module", "faster_whisper",
"--exclude-module", "torch",
"--exclude-module", "PySide6",
# Icon
# "--icon=icon.ico",
]) ])
print("\n" + "="*60) print("\n" + "="*60)
print("✅ BUILD COMPLETE!") print("✅ BUILD COMPLETE!")
print("="*60) print("="*60)

View File

@@ -5,6 +5,7 @@
faster-whisper>=1.0.0 faster-whisper>=1.0.0
torch>=2.0.0 torch>=2.0.0
# UI Framework # UI Framework
PySide6>=6.6.0 PySide6>=6.6.0

View File

@@ -46,7 +46,13 @@ DEFAULT_SETTINGS = {
"best_of": 5, "best_of": 5,
"vad_filter": True, "vad_filter": True,
"no_repeat_ngram_size": 0, "no_repeat_ngram_size": 0,
"condition_on_previous_text": True "condition_on_previous_text": True,
"initial_prompt": "Mm-hmm. Okay, let's go. I speak in full sentences.", # Default: Forces punctuation
# Low VRAM Mode
"unload_models_after_use": False # If True, models are unloaded immediately to free VRAM
} }
class ConfigManager: class ConfigManager:

View File

@@ -15,6 +15,11 @@ import numpy as np
from src.core.config import ConfigManager from src.core.config import ConfigManager
from src.core.paths import get_models_path from src.core.paths import get_models_path
try:
import torch
except ImportError:
torch = None
# Import directly - valid since we are now running in the full environment # Import directly - valid since we are now running in the full environment
from faster_whisper import WhisperModel from faster_whisper import WhisperModel
@@ -94,27 +99,73 @@ class WhisperTranscriber:
language = self.config.get("language") language = self.config.get("language")
# Use task override if provided, otherwise config # Use task override if provided, otherwise config
final_task = task if task else self.config.get("task") # Ensure safe string and lowercase ("transcribe" vs "Transcribe")
raw_task = task if task else self.config.get("task")
final_task = str(raw_task).strip().lower() if raw_task else "transcribe"
# Sanity check for valid Whisper tasks
if final_task not in ["transcribe", "translate"]:
logging.warning(f"Invalid task '{final_task}' detected. Defaulting to 'transcribe'.")
final_task = "transcribe"
# Language handling
final_language = language if language != "auto" else None
# Anti-Hallucination: Force condition_on_previous_text=False for translation
condition_prev = self.config.get("condition_on_previous_text")
# Helper options for Translation Stability
initial_prompt = self.config.get("initial_prompt")
if final_task == "translate":
condition_prev = False
# Force beam search if user has set it to greedy (1)
# Translation requires more search breadth to find the English mapping
if beam_size < 5:
logging.info("Forcing beam_size=5 for Translation task.")
beam_size = 5
# Inject guidance prompt if none exists
if not initial_prompt:
initial_prompt = "Translate this to English."
logging.info(f"Model Dispatch: Task='{final_task}', Language='{final_language}', ConditionPrev={condition_prev}, Beam={beam_size}")
# Build arguments dynamically to avoid passing None if that's the issue
transcribe_opts = {
"beam_size": beam_size,
"best_of": best_of,
"vad_filter": vad,
"task": final_task,
"vad_parameters": dict(min_silence_duration_ms=500),
"condition_on_previous_text": condition_prev,
"without_timestamps": True
}
if initial_prompt:
transcribe_opts["initial_prompt"] = initial_prompt
# Only add language if it's explicitly set (not None/Auto)
# This avoids potentially confusing the model with explicit None
if final_language:
transcribe_opts["language"] = final_language
# Transcribe # Transcribe
segments, info = self.model.transcribe( segments, info = self.model.transcribe(audio_data, **transcribe_opts)
audio_data,
beam_size=beam_size,
best_of=best_of,
vad_filter=vad,
task=final_task,
language=language if language != "auto" else None,
vad_parameters=dict(min_silence_duration_ms=500),
condition_on_previous_text=self.config.get("condition_on_previous_text"),
without_timestamps=True
)
# Aggregate text # Aggregate text
text_result = "" text_result = ""
for segment in segments: for segment in segments:
text_result += segment.text + " " text_result += segment.text + " "
return text_result.strip() text_result = text_result.strip()
# Low VRAM Mode: Unload Whisper Model immediately
if self.config.get("unload_models_after_use"):
self.unload_model()
logging.info(f"Final Transcription Output: '{text_result}'")
return text_result
except Exception as e: except Exception as e:
logging.error(f"Transcription failed: {e}") logging.error(f"Transcription failed: {e}")
@@ -133,3 +184,21 @@ class WhisperTranscriber:
return True return True
return False return False
def unload_model(self):
"""
Unloads model to free memory.
"""
if self.model:
del self.model
self.model = None
self.current_model_size = None
# Force garbage collection
import gc
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
logging.info("Whisper Model unloaded (Low VRAM Mode).")

View File

@@ -376,6 +376,9 @@ class UIBridge(QObject):
try: try:
from src.core.paths import get_models_path from src.core.paths import get_models_path
# Check new simple format used by DownloadWorker # Check new simple format used by DownloadWorker
path_simple = get_models_path() / f"faster-whisper-{size}" path_simple = get_models_path() / f"faster-whisper-{size}"
if path_simple.exists() and any(path_simple.iterdir()): if path_simple.exists() and any(path_simple.iterdir()):

View File

@@ -587,6 +587,53 @@ Window {
Text { text: "Model configuration and performance"; color: SettingsStyle.textSecondary; font.family: mainFont; font.pixelSize: 14 } Text { text: "Model configuration and performance"; color: SettingsStyle.textSecondary; font.family: mainFont; font.pixelSize: 14 }
} }
ModernSettingsSection {
title: "Style & Prompting"
Layout.margins: 32
Layout.topMargin: 0
content: ColumnLayout {
width: parent.width
spacing: 0
ModernSettingsItem {
label: "Punctuation Style"
description: "Hint for how to format text"
control: ModernComboBox {
id: styleCombo
width: 180
model: ["Standard (Proper)", "Casual (Lowercase)", "Custom"]
// Logic to determine initial index based on config string
Component.onCompleted: {
let current = ui.getSetting("initial_prompt")
if (current === "Mm-hmm. Okay, let's go. I speak in full sentences.") currentIndex = 0
else if (current === "um, okay... i guess so.") currentIndex = 1
else currentIndex = 2
}
onActivated: {
if (index === 0) ui.setSetting("initial_prompt", "Mm-hmm. Okay, let's go. I speak in full sentences.")
else if (index === 1) ui.setSetting("initial_prompt", "um, okay... i guess so.")
// Custom: Don't change string immediately, let user type
}
}
}
ModernSettingsItem {
label: "Custom Prompt"
description: "Advanced: Define your own style hint"
visible: styleCombo.currentIndex === 2
control: ModernTextField {
Layout.preferredWidth: 280
placeholderText: "e.g. 'Hello, World.'"
text: ui.getSetting("initial_prompt") || ""
onEditingFinished: ui.setSetting("initial_prompt", text === "" ? null : text)
}
}
}
}
ModernSettingsSection { ModernSettingsSection {
title: "Model Config" title: "Model Config"
Layout.margins: 32 Layout.margins: 32
@@ -785,6 +832,16 @@ Window {
onActivated: ui.setSetting("compute_type", currentText) onActivated: ui.setSetting("compute_type", currentText)
} }
} }
ModernSettingsItem {
label: "Low VRAM Mode"
description: "Unload models immediately after use (Saves VRAM, Adds Delay)"
showSeparator: false
control: ModernSwitch {
checked: ui.getSetting("unload_models_after_use")
onToggled: ui.setSetting("unload_models_after_use", checked)
}
}
} }
} }

View File

@@ -55,6 +55,10 @@ except AttributeError:
def LOWORD(l): return l & 0xffff def LOWORD(l): return l & 0xffff
def HIWORD(l): return (l >> 16) & 0xffff def HIWORD(l): return (l >> 16) & 0xffff
GWL_EXSTYLE = -20
WS_EX_TRANSPARENT = 0x00000020
WS_EX_LAYERED = 0x00080000
class WindowHook: class WindowHook:
def __init__(self, hwnd, width, height, initial_scale=1.0): def __init__(self, hwnd, width, height, initial_scale=1.0):
self.hwnd = hwnd self.hwnd = hwnd
@@ -68,8 +72,32 @@ class WindowHook:
self.enabled = True # New flag self.enabled = True # New flag
def set_enabled(self, enabled): def set_enabled(self, enabled):
"""
Enables or disables interaction.
When disabled, we set WS_EX_TRANSPARENT so clicks pass through physically.
"""
if self.enabled == enabled:
return
self.enabled = enabled self.enabled = enabled
# Get current styles
style = user32.GetWindowLongW(self.hwnd, GWL_EXSTYLE)
if not enabled:
# Enable Click-Through (Add Transparent)
# We also ensure Layered is set (Qt usually sets it, but good to be sure)
new_style = style | WS_EX_TRANSPARENT | WS_EX_LAYERED
else:
# Disable Click-Through (Remove Transparent)
new_style = style & ~WS_EX_TRANSPARENT
if new_style != style:
SetWindowLongPtr(self.hwnd, GWL_EXSTYLE, new_style)
# Force a redraw/frame update just in case
user32.SetWindowPos(self.hwnd, 0, 0, 0, 0, 0, 0x0027) # SWP_NOMOVE | SWP_NOSIZE | SWP_NOZORDER | SWP_FRAMECHANGED
def install(self): def install(self):
proc_address = ctypes.cast(self.new_wnd_proc, ctypes.c_void_p) proc_address = ctypes.cast(self.new_wnd_proc, ctypes.c_void_p)
self.old_wnd_proc = SetWindowLongPtr(self.hwnd, GWLP_WNDPROC, proc_address) self.old_wnd_proc = SetWindowLongPtr(self.hwnd, GWLP_WNDPROC, proc_address)

38
test_m2m.py Normal file
View File

@@ -0,0 +1,38 @@
import sys
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
def test_m2m():
model_name = "facebook/m2m100_418M"
print(f"Loading {model_name}...")
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
# Test cases: (Language Code, Input)
test_cases = [
("en", "he go to school yesterday"),
("pl", "on iść do szkoła wczoraj"), # Intentional broken grammar in Polish
]
print("\nStarting M2M Tests (Self-Translation):\n")
for lang, input_text in test_cases:
tokenizer.src_lang = lang
encoded = tokenizer(input_text, return_tensors="pt")
# Translate to SAME language
generated_tokens = model.generate(
**encoded,
forced_bos_token_id=tokenizer.get_lang_id(lang)
)
corrected = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(f"[{lang}]")
print(f"Input: {input_text}")
print(f"Output: {corrected}")
print("-" * 20)
if __name__ == "__main__":
test_m2m()

40
test_mt0.py Normal file
View File

@@ -0,0 +1,40 @@
import sys
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
def test_mt0():
model_name = "bigscience/mt0-base"
print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Test cases: (Language, Prompt, Input)
# MT0 is instruction tuned, so we should prompt it in the target language or English.
# Cross-lingual prompting (English prompt -> Target tasks) is usually supported.
test_cases = [
("English", "Correct grammar:", "he go to school yesterday"),
("Polish", "Popraw gramatykę:", "to jest testowe zdanie bez kropki"),
("Finnish", "Korjaa kielioppi:", "tämä on testilause ilman pistettä"),
("Russian", "Исправь грамматику:", "это тестовое предложение без точки"),
("Japanese", "文法を直してください:", "これは点のないテスト文です"),
("Spanish", "Corrige la gramática:", "esta es una oración de prueba sin punto"),
]
print("\nStarting MT0 Tests:\n")
for lang, prompt_text, input_text in test_cases:
full_input = f"{prompt_text} {input_text}"
inputs = tokenizer(full_input, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=128)
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"[{lang}]")
print(f"Input: {full_input}")
print(f"Output: {corrected}")
print("-" * 20)
if __name__ == "__main__":
test_mt0()

34
test_punctuation.py Normal file
View File

@@ -0,0 +1,34 @@
import sys
import os
# Add src to path
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from src.core.grammar_assistant import GrammarAssistant
def test_punctuation():
assistant = GrammarAssistant()
assistant.load_model()
samples = [
# User's example (verbatim)
"If the voice recognition doesn't recognize that I like stopped Or something would that would it also correct that",
# Generic run-on
"hello how are you doing today i am doing fine thanks for asking",
# Missing commas/periods
"well i think its valid however we should probably check the logs first"
]
print("\nStarting Punctuation Tests:\n")
for sample in samples:
print(f"Original: {sample}")
corrected = assistant.correct(sample)
print(f"Corrected: {corrected}")
print("-" * 20)
if __name__ == "__main__":
test_punctuation()