17 Commits

Author SHA1 Message Date
Your Name
3137770742 Release v1.0.4: The Compatibility Update
- Added robust CPU Fallback for AMD/Non-CUDA GPUs.
- Implemented Lazy Load for AI Engine to prevent startup crashes.
- Added explicit DLL injection for Cublas/Cudnn on Windows.
- Added Corrupt Model Auto-Repair logic.
- Includes pre-compiled v1.0.4 executable.
2026-01-25 20:28:01 +02:00
Your Name
aed489dd23 Docs: Detailed explanation of Low VRAM Mode and Style Prompting 2026-01-25 13:52:10 +02:00
Your Name
e23c492360 Docs: Add RELEASE_NOTES.md for v1.0.2 2026-01-25 13:46:48 +02:00
Your Name
84f10092e9 Release v1.0.2: Implemented Style Prompting & Removed Grammar Correction
- Removed M2M100 Grammar Correction model completely to reduce bloat/complexity.
- Implemented 'Style Prompting' in Settings -> AI Engine to handle punctuation natively via Whisper.
- Added Style Presets: Standard (Default), Casual, and Custom.
- Optimized Build: Bootstrapper no longer requires transformers/sentencepiece.
- Fixed 'torch' NameError in Low VRAM mode.
- Fixed Bootstrapper missing dependency detection.
- Updated UI to reflect removed features.
- Included compiled v1.0.2 Executable in dist/.
2026-01-25 13:42:06 +02:00
Your Name
03f46ee1e3 Docs: Final polish - Enshittification manifesto and structural refinement 2026-01-24 19:21:01 +02:00
Your Name
0f1bf5f1af Docs: Final polish - 6-col language table and refined manifesto 2026-01-24 19:12:08 +02:00
Your Name
0b2b5848e2 Fix: Translation Reliability, Click-Through, and Docs Sync
- Transcriber: Enforced 'beam_size=5' and prompt injection for robust translation.
- Transcriber: Removed conditioning on previous text to prevent language stickiness.
- Transcriber: Refactored kwargs to sanitize inputs.
- Overlay: Fixed click-through by toggling WS_EX_TRANSPARENT.
- UI: Added real download progress reporting.
- Docs: Refactored language list to table.
2026-01-24 19:05:43 +02:00
Your Name
f3bf7541cf Docs: Detailed expansion of README with Translation features and open layout 2026-01-24 18:33:22 +02:00
Your Name
4b84a27a67 v1.0.1 Feature Update and Polish
Full Changelog:

[New Features]
- Added Native Translation Mode:
  - Whisper model now fully supports Translating any language to English
  - Added 'task' and 'language' parameters to Transcriber core
- Dual Hotkey Support:
  - Added separate Global Hotkeys for Transcribe (default F8) and Translate (default F10)
  - Both hotkeys are fully customizable in Settings
  - Engine dynamically switches modes based on which key is pressed

[UI/UX Improvements]
- Settings Window:
  - Widened Hotkey Input fields (240px) to accommodate long combinations
  - Added Pretty-Printing for hotkey sequences (e.g. 'ctrl+f9' display as 'Ctrl + F9')
  - Replaced Country Code dropdown with Full Language Names (99+ languages)
  - Made Language Dropdown scrollable (max height 300px) to prevent screen overflow
  - Removed redundant 'Task' selector (replaced by dedicated hotkeys)
- System Tray:
  - Tooltip now displays both Transcribe and Translate hotkeys
  - Tooltip hotkeys are formatted readably

[Core & Performance]
- Bootstrapper:
  - Implemented Smart Incremental Sync
  - Now checks filesize and content hash before copying files
  - Drastically reduces startup time for subsequent runs
  - Preserves user settings.json during updates
- Backend:
  - Fixed HotkeyManager to support dynamic configuration keys
  - Fixed Language Lock: selecting a language now correctly forces the model to use it
  - Refactored bridge/main connection for language list handling
2026-01-24 18:29:10 +02:00
Your Name
f184eb0037 Fix: Invisible overlay blocking mouse clicks
Problem:
The overlay window, even when fully transparent or visually hidden (opacity 0), was still intercepting mouse events. This created a 'dead zone' on the screen where users could not click through to applications behind the overlay. This occurred because the low-level window hook was answering 'HTCAPTION' to hit tests regardless of the UI state.

Solution:
1. Modified 'WindowHook' to accept an 'enabled' state.
2. When disabled, 'WM_NCHITTEST' now returns 'HTTRANSPARENT', allowing the OS to pass the click to the window underneath.
3. Updated 'main.py' to toggle this hook state dynamically:
   - ENABLED when Recording or Processing (UI is visible/active).
   - DISABLED when Idling (UI is hidden/transparent).

Result:
The overlay is now completely non-intrusive when not in use.
2026-01-24 17:51:23 +02:00
Your Name
306bd075ed Aesthetic overhaul of documentation 2026-01-24 17:29:59 +02:00
Your Name
a1cc9c61b9 Add language list and file transcription info 2026-01-24 17:27:54 +02:00
Your Name
e627e1b8aa Correct hardware detection statement in docs 2026-01-24 17:24:56 +02:00
Your Name
eaa572b42f Fix release badge for Gitea 2026-01-24 17:22:14 +02:00
Your Name
e900201214 Final documentation polish 2026-01-24 17:20:22 +02:00
Your Name
0d426aea4b Update docs with license and model stats 2026-01-24 17:16:53 +02:00
Your Name
b15ce8076f Enhance documentation 2026-01-24 17:12:21 +02:00
21 changed files with 1117 additions and 188 deletions

205
README.md
View File

@@ -1,71 +1,190 @@
# Whisper Voice
<div align="center">
**Reclaim Your Voice from the Cloud.**
# 🎙️ W H I S P E R &nbsp; V O I C E
### SOVEREIGN SPEECH RECOGNITION
Whisper Voice is a high-performance, strictly local speech-to-text tool designed for the desktop. It provides instant, high-accuracy dictation anywhere on your system—no internet connection required, no corporate servers, and absolutely no data harvesting.
<br>
We believe that the tools of production—and communication—should belong to the individual, not rented from centralized tech giants.
![Status](https://img.shields.io/badge/STATUS-OPERATIONAL-success?style=for-the-badge&logo=server&color=2ecc71)
[![Download](https://img.shields.io/gitea/v/release/lashman/whisper_voice?gitea_url=https%3A%2F%2Fgit.lashman.live&label=Install&style=for-the-badge&logo=windows&logoColor=white&color=3b82f6)](https://git.lashman.live/lashman/whisper_voice/releases/latest)
[![License](https://img.shields.io/badge/LICENSE-PUBLIC_DOMAIN-lightgrey?style=for-the-badge&logo=creative-commons&logoColor=black)](https://creativecommons.org/publicdomain/zero/1.0/)
<br>
> *"The master's tools will never dismantle the master's house."*
> <br>
> **Build your own tools. Run them locally. Free your mind.**
[View Source](https://git.lashman.live/lashman/whisper_voice) • [Report Issue](https://git.lashman.live/lashman/whisper_voice/issues)
</div>
<br>
<br>
## 📡 The Transmission
We are witnessing the **enshittification** of the digital world. What were once vibrant social commons are being walled off, strip-mined for data, and degraded into rent-seeking silos. Your voice is no longer your own; it is a training set for a corporate oracle that charges you for the privilege of listening.
**Whisper Voice** is a small act of sabotage against this trend.
It is built on the axiom of **Technological Sovereignty**. By moving state-of-the-art inference from the server farms to your own silicon, you reclaim the means of digital production. No telemetry. No subscriptions. No "cloud processing" that eavesdrops on your intent.
---
## ✊ Core Principles
## ⚡ The Engine
### 1. Total Autonomy (Local-First)
Your voice data is yours alone. Unlike commercial alternatives that siphon your words to remote data centers for processing and profiling, Whisper Voice runs entirely on your hardware. **No masters, no servers.** You retain full sovereignty over your digital footprint.
Whisper Voice operates directly on the metal. It is not an API wrapper; it is an autonomous machine.
### 2. Decentralized Power
By leveraging optimized local processing, we strip away the need for reliance on massive, energy-hungry corporate infrastructure. This is technology scaled to the human level—powerful, efficient, and completely under your control.
| Component | Technology | Benefit |
| :--- | :--- | :--- |
| **Inference Core** | **Faster-Whisper** | Hyper-optimized C++ implementation via **CTranslate2**. Delivers **4x velocity** over standard PyTorch. |
| **Compression** | **INT8 quantization** | Enables Pro-grade models (`Large-v3`) to run on consumer-grade GPUs, democratizing elite AI. |
| **Sensory Gate** | **Silero VAD** | Enterprise-grade Voice Activity Detection filters out the noise, ensuring only pure intent is processed. |
| **Interface** | **Qt 6 / QML** | Hardware-accelerated, glassmorphic UI that is fluid, responsive, and sovereign. |
### 3. Accessible to All
High-quality speech recognition shouldn't be gated behind subscriptions or paywalls. This tool is free, open, and built to empower users to interact with their machines on their own terms.
### 🛑 Compatibility Matrix (Windows)
The core engine (`CTranslate2`) is heavily optimized for Nvidia tensor cores.
| Manufacturer | Hardware | Status | Notes |
| :--- | :--- | :--- | :--- |
| **Nvidia** | GTX 900+ / RTX | ✅ **Supported** | Full heavy-metal acceleration. |
| **AMD** | Radeon RX | ⚠️ **CPU Fallback** | Runs on CPU. Valid for `Small/Medium`, slow for `Large`. |
| **Intel** | Arc / Iris | ⚠️ **CPU Fallback** | Runs on CPU. Valid for `Small/Medium`, slow for `Large`. |
| **Apple** | M1 / M2 / M3 | ❌ **Unsupported** | Release is strictly Windows x64. |
> **AMD Users**: v1.0.3 auto-detects GPU failures and silently falls back to CPU.
<br>
## 🖋️ Universal Transcription
At its core, Whisper Voice is the ultimate bridge between thought and text. It listens with superhuman precision, converting spoken word into written form across **99 languages**.
* **Punctuation Mastery**: Automatically handles capitalization and complex punctuation formatting.
* **Contextual Intelligence**: Smarter than standard dictation; it understands the flow of sentences to resolve homophones and technical jargon ($1.5k vs "fifteen hundred dollars").
* **Total Privacy**: Your private dictation, legal notes, or creative writing never leave your RAM.
### Workflow: `F9 (Default)`
The primary channel for native-language transcription. It transcribes precisely what it hears in the language you speak (or the one you've locked in Settings).
### ✨ Style Prompting (New in v1.0.2)
Whisper Voice replaces traditional "grammar correction models" with a native **Style Prompting** engine. By injecting a specific "pre-prompt" into the model's context window, we can guide its internal style without external post-processing.
* **Standard (Default)**: Forces the model to use full sentences, proper capitalization, and periods. Ideal for dictation.
* **Casual**: Encourages a relaxed, lowercase style (e.g., "no way that's crazy lol").
* **Custom**: Allows you to seed the model with your own context (e.g., "Here is a list of medical terms:").
This approach incurs **zero latency penalty** and **zero extra VRAM** usage.
<br>
## 🌎 Universal Translation
Whisper Voice v1.0.1 includes a **Neural Translation Engine** that allows you to bridge any linguistic gap instantly.
* **Input**: Speak in French, Japanese, Russian, or **96 other languages**.
* **Output**: The engine instantly reconstructs the semantic meaning into fluent **English**.
* **Task Protocol**: Handled via the dedicated `F10` channel.
### 🔍 Why only English translation?
A common question arises: *Why can't I translate from French to Japanese?*
The architecture of the underlying Whisper model is a **Many-to-English** design. During its massive training phase (680,000 hours of audio), the translation task was specifically optimized to map the global linguistic commons onto a single bridge language: **English**. This allowed the model to reach incredible levels of semantic understanding without the exponential complexity of a "Many-to-Many" mapping.
By focusing its translation decoder solely on English, Whisper achieves "Zero-Shot" quality that rivals specialized translation engines while remaining lightweight enough to run on your local GPU.
---
## ✨ Features
## 🕹️ Command & Control
* **100% Offline Processing**: Once the recognition engine is downloaded, the cable can be cut. Nothing leaves your machine.
* **Universal Compatibility**: Works in any text field—editors, chat apps, terminals, or browsers. If you can type there, you can speak there.
* **Adaptive Input**:
* *Clipboard Mode*: Standard paste injection.
* *High-Speed Simulation*: Simulates keystrokes at supersonic speeds (up to 6000 CPM) for apps that block pasting.
* **System Integration**: Minimalist overlay and system tray presence. It exists when you need it and vanishes when you don't.
* **Resource Efficiency**: Optimized to run smoothly on consumer hardware without monopolizing your system resources.
### Global Hotkeys
The agent runs silently in the background, waiting for your signal.
* **Transcribe (F9)**: Opens the channel for standard speech-to-text.
* **Translate (F10)**: Opens the channel for neural translation.
* **Customization**: Remap these keys in Settings. The recorder supports complex chords (e.g. `Ctrl + Alt + Space`) to fit your workflow.
### Injection Protocols
* **Clipboard Paste**: Standard text injection. Instant, reliable.
* **Simulate Typing**: Mimics physical keystrokes at superhuman speed (6000 CPM). Bypasses anti-paste restrictions and "protected" windows.
<br>
## 📊 Intelligence Matrix
Select the model that aligns with your available resources.
| Model | VRAM (GPU) | RAM (CPU) | Designation | Capability |
| :--- | :--- | :--- | :--- | :--- |
| `Tiny` | **~500 MB** | ~1 GB | ⚡ **Supersonic** | Command & Control, older hardware. |
| `Base` | **~600 MB** | ~1 GB | 🚀 **Very Fast** | Daily driver for low-power laptops. |
| `Small` | **~1 GB** | ~2 GB | ⏩ **Fast** | High accuracy English dictation. |
| `Medium` | **~2 GB** | ~4 GB | ⚖️ **Balanced** | Complex vocabulary, foreign accents. |
| `Large-v3 Turbo` | **~4 GB** | ~6 GB | ✨ **Optimal** | **The Sweet Spot.** Near-Large intelligence, Medium speed. |
| `Large-v3` | **~5 GB** | ~8 GB | 🧠 **Maximum** | Professional grade. Uncompromised. |
> *Note: Acceleration requires you to manually select your Compute Device (CUDA GPU or CPU) in Settings.*
### 📉 Low VRAM Mode
For users with limited GPU memory (e.g., 4GB cards) or those running heavy games simultaneously, Whisper Voice offers a specialized **Low VRAM Mode**.
* **Behavior**: The AI model is aggressively unloaded from the GPU immediately after every transcription.
* **Benefit**: When idle, the app consumes near-zero VRAM (~0MB), leaving your GPU completely free for gaming or rendering.
* **Trade-off**: There is a "cold start" latency of 1-2 seconds for every voice command as the model reloads from the disk cache.
---
## 🚀 Getting Started
## 🛠️ Deployment
### Installation
1. Download the latest release.
2. Run `WhisperVoice.exe`.
3. On the first run, the bootstrapper will autonomously provision the necessary runtime environment. This ensures your system remains clean and dependencies are self-contained.
### 📥 Installation
1. **Acquire**: Download `WhisperVoice.exe` from [Releases](https://git.lashman.live/lashman/whisper_voice/releases).
2. **Deploy**: Place it anywhere. It is portable.
3. **Bootstrap**: Run it. The agent will self-provision an isolated Python runtime (~2GB) on first launch.
4. **Sync**: Future updates are handled by the **Smart Bootstrapper**, which surgically updates only changed files, respecting your bandwidth and your settings.
### Usage
1. **Set Your Trigger**: Configure a global hotkey (default: `F9`) in the settings.
2. **Speak Freely**: Hold the hotkey (or toggle it) and speak.
3. **Direct Action**: Your words are instantly transcribed and injected into your active window.
### 🔧 Troubleshooting
* **App crashes on start**: Ensure you have [Microsoft Visual C++ Redistributable 2015-2022](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist) installed.
* **"Simulate Typing" is slow**: Some applications (remote desktops, legacy games) cannot handle the data stream. Lower the typing speed in Settings to ~1200 CPM.
* **No Audio**: The agent listens to the **Default Communication Device**. Verify your Windows Sound Control Panel.
<br>
---
## ⚙️ Configuration
## 🌐 Supported Languages
The **Settings** panel puts the means of configuration in your hands:
The engine understands the following 99 languages. You can lock the focus to a specific language in Settings to improve accuracy, or rely on **Auto-Detect** for fluid multilingual usage.
* **Recognition Engine**: Choose the size of the model that fits your hardware capabilities (Tiny to Large). Larger models offer greater precision but require more computing power.
* **Input Method**: Switch between "Clipboard Paste" and "Simulate Typing" depending on target application restrictions.
* **Typing Speed**: Adjust the keystroke injection rate. Crank it up to 6000 CPM for instant text delivery.
* **Run on Startup**: Configure the agent to be ready the moment your session begins.
| | | | | | |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Afrikaans 🇿🇦 | Albanian 🇦🇱 | Amharic 🇪🇹 | Arabic 🇸🇦 | Armenian 🇦🇲 | Assamese 🇮🇳 |
| Azerbaijani 🇦🇿 | Bashkir 🇷🇺 | Basque 🇪🇸 | Belarusian 🇧🇾 | Bengali 🇧🇩 | Bosnian 🇧🇦 |
| Breton 🇫🇷 | Bulgarian 🇧🇬 | Burmese 🇲🇲 | Castilian 🇪🇸 | Catalan 🇪🇸 | Chinese 🇨🇳 |
| Croatian 🇭🇷 | Czech 🇨🇿 | Danish 🇩🇰 | Dutch 🇳🇱 | English 🇺🇸 | Estonian 🇪🇪 |
| Faroese 🇫🇴 | Finnish 🇫🇮 | Flemish 🇧🇪 | French 🇫🇷 | Galician 🇪🇸 | Georgian 🇬🇪 |
| German 🇩🇪 | Greek 🇬🇷 | Gujarati 🇮🇳 | Haitian 🇭🇹 | Hausa 🇳🇬 | Hawaiian 🇺🇸 |
| Hebrew 🇮🇱 | Hindi 🇮🇳 | Hungarian 🇭🇺 | Icelandic 🇮🇸 | Indonesian 🇮🇩 | Italian 🇮🇹 |
| Japanese 🇯🇵 | Javanese 🇮 Indonesa | Kannada 🇮🇳 | Kazakh 🇰🇿 | Khmer 🇰🇭 | Korean 🇰🇷 |
| Lao 🇱🇦 | Latin 🇻🇦 | Latvian 🇱🇻 | Lingala 🇨🇩 | Lithuanian 🇱🇹 | Luxembourgish 🇱🇺 |
| Macedonian 🇲🇰 | Malagasy 🇲🇬 | Malay 🇲🇾 | Malayalam 🇮🇳 | Maltese 🇲🇹 | Maori 🇳🇿 |
| Marathi 🇮🇳 | Moldavian 🇲🇩 | Mongolian 🇲🇳 | Myanmar 🇲🇲 | Nepali 🇳🇵 | Norwegian 🇳🇴 |
| Occitan 🇫🇷 | Panjabi 🇮🇳 | Pashto 🇦🇫 | Persian 🇮🇷 | Polish 🇵🇱 | Portuguese 🇵🇹 |
| Punjabi 🇮🇳 | Romanian 🇷🇴 | Russian 🇷🇺 | Sanskrit 🇮🇳 | Serbian 🇷🇸 | Shona 🇿🇼 |
| Sindhi 🇵🇰 | Sinhala 🇱🇰 | Slovak 🇸🇰 | Slovenian 🇸🇮 | Somali 🇸🇴 | Spanish 🇪🇸 |
| Sundanese 🇮🇩 | Swahili 🇰🇪 | Swedish 🇸🇪 | Tagalog 🇵🇭 | Tajik 🇹🇯 | Tamil 🇮🇳 |
| Tatar 🇷🇺 | Telugu 🇮🇳 | Thai 🇹🇭 | Tibetan 🇨🇳 | Turkish 🇹🇷 | Turkmen 🇹🇲 |
| Ukrainian 🇺🇦 | Urdu 🇵🇰 | Uzbek 🇺🇿 | Vietnamese 🇻e | Welsh 🏴󠁧󠁢󠁷󠁬󠁳󠁿 | Yiddish 🇮🇱 |
| Yoruba 🇳🇬 | | | | | |
---
<br>
<br>
## 🤝 Mutual Aid
<div align="center">
This project thrives on community collaboration. If you have improvements, fixes, or ideas, you are encouraged to contribute. We build better systems when we build them together, horizontally and transparently.
### ⚖️ PUBLIC DOMAIN (CC0 1.0)
*No Rights Reserved. No Gods. No Masters. No Managers.*
* **Report Issues**: If something breaks, let us know.
* **Contribute Code**: The source is open. Fork it, improve it, share it.
Credit to **OpenAI** (Whisper), **Systran** (Faster-Whisper), and **Silero** (VAD).
---
*Built with local processing libraries and Qt.*
*No gods, no cloud managers.*
</div>

28
RELEASE_NOTES.md Normal file
View File

@@ -0,0 +1,28 @@
# Release v1.0.4
**"The Compatibility Update"**
This release focuses on maximum stability across different hardware configurations (AMD, Intel, Nvidia) and fixing startup crashes related to corrupted models or missing drivers.
## 🛠️ Critical Fixes
### 1. Robust CPU Fallback (AMD / Intel Support)
* **Problem**: Previously, if an AMD user tried to run the app, it would crash instantly because it tried to load Nvidia CUDA libraries by default.
* **Fix**: The app now **silently detects** if CUDA initialization fails (due to missing DLLs or incompatible hardware) and **automatically falls back to CPU mode**.
* **Result**: The app "just works" on any Windows machine, regardless of GPU.
### 2. Startup Crash Protection
* **Problem**: If `faster_whisper` was imported before checking for valid drivers, the app would crash on launch for some users.
* **Fix**: Implemented **Lazy Loading** for the AI engine. The app now starts the UI first, and only loads the heavy AI libraries inside a safety block that catches errors.
### 3. Corrupt Model Auto-Repair
* **Problem**: Interrupted downloads could leave a corrupted model folder, preventing the app from ever starting again.
* **Fix**: If the app detects a "vocabulary missing" or invalid config error, it will now **automatically delete the corrupt folder** and allow you to re-download it cleanly.
### 4. Windows DLL Injection
* **Fix**: Added explicit DLL path injection for `nvidia-cublas` and `nvidia-cudnn` to ensure Python 3.8+ can find the required CUDA libraries on Windows systems that don't have them in PATH.
## 📦 Installation
1. Download `WhisperVoice.exe` below.
2. Replace your existing `.exe`.
3. Run it.

View File

@@ -259,48 +259,72 @@ class Bootstrapper:
process.wait()
def refresh_app_source(self):
"""Refresh app source files. Skips if already exists to save time."""
# Optimization: If app/main.py exists, skip update to improve startup speed.
# The user can delete the 'runtime' folder to force an update.
if (self.app_path / "main.py").exists():
log("App already exists. Skipping update.")
return True
if self.ui: self.ui.set_status("Updating app files...")
"""
Smartly updates app source files by only copying changed files.
Preserves user settings and reduces disk I/O.
"""
if self.ui: self.ui.set_status("Checking for updates...")
try:
# Preserve settings.json if it exists
settings_path = self.app_path / "settings.json"
temp_settings = None
if settings_path.exists():
try:
temp_settings = settings_path.read_bytes()
except:
log("Failed to backup settings.json, it involves risk of data loss.")
# 1. Ensure destination exists
if not self.app_path.exists():
self.app_path.mkdir(parents=True, exist_ok=True)
if self.app_path.exists():
shutil.rmtree(self.app_path, ignore_errors=True)
# 2. Walk source and sync
# source_path is the temporary bundled folder
# app_path is the persistent runtime folder
shutil.copytree(
self.source_path,
self.app_path,
ignore=shutil.ignore_patterns(
'__pycache__', '*.pyc', '.git', 'venv',
'build', 'dist', '*.egg-info', 'runtime'
)
)
changes_made = 0
# Restore settings.json
if temp_settings:
try:
settings_path.write_bytes(temp_settings)
log("Restored settings.json")
except:
log("Failed to restore settings.json")
for src_dir, dirs, files in os.walk(self.source_path):
# Determine relative path from source root
rel_path = Path(src_dir).relative_to(self.source_path)
dst_dir = self.app_path / rel_path
# Ensure directory exists
if not dst_dir.exists():
dst_dir.mkdir(parents=True, exist_ok=True)
for file in files:
# Skip ignored files
if file in ['__pycache__', '.git', 'settings.json'] or file.endswith('.pyc'):
continue
src_file = Path(src_dir) / file
dst_file = dst_dir / file
# Check if update needed
should_copy = False
if not dst_file.exists():
should_copy = True
else:
# Compare size first (fast)
if src_file.stat().st_size != dst_file.stat().st_size:
should_copy = True
else:
# Compare content (slower but accurate)
# Only read if size matches to verify diff
if src_file.read_bytes() != dst_file.read_bytes():
should_copy = True
if should_copy:
shutil.copy2(src_file, dst_file)
changes_made += 1
if self.ui: self.ui.set_detail(f"Updated: {file}")
# 3. Cleanup logic (Optional: remove files in dest that are not in source)
# For now, we only add/update to prevent deleting generated user files (logs, etc)
if changes_made > 0:
log(f"Update complete. {changes_made} files changed.")
else:
log("App is up to date.")
return True
except Exception as e:
log(f"Error refreshing app source: {e}")
# Fallback to nuclear option if sync fails completely?
# No, 'smart_sync' failing might mean permissions, nuclear wouldn't help.
return False
def run_app(self):
@@ -323,11 +347,17 @@ class Bootstrapper:
messagebox.showerror("WhisperVoice Error", f"Failed to launch app: {e}")
return False
def check_dependencies(self):
"""Quick check if critical dependencies are installed."""
return True # Deprecated logic placeholder
def setup_and_run(self):
"""Full setup/update and run flow."""
try:
# 1. Ensure basics
if not self.is_python_ready():
self.download_python()
self._fix_pth_file() # Ensure pth is fixed immediately after download
self.install_pip()
self.install_packages()
@@ -338,7 +368,10 @@ class Bootstrapper:
if self.run_app():
if self.ui: self.ui.root.quit()
except Exception as e:
messagebox.showerror("Setup Error", f"Installation failed: {e}")
if self.ui:
import tkinter.messagebox as mb
mb.showerror("Setup Error", f"Installation failed: {e}") # Improved error visibility
log(f"Fatal error: {e}")
import traceback
traceback.print_exc()

BIN
dist/WhisperVoice.exe vendored Normal file

Binary file not shown.

206
main.py
View File

@@ -9,6 +9,31 @@ app_dir = os.path.dirname(os.path.abspath(__file__))
if app_dir not in sys.path:
sys.path.insert(0, app_dir)
# -----------------------------------------------------------------------------
# WINDOWS DLL FIX (CRITICAL for Portable CUDA)
# Python 3.8+ on Windows requires explicit DLL directory addition.
# -----------------------------------------------------------------------------
if os.name == 'nt' and hasattr(os, 'add_dll_directory'):
try:
from pathlib import Path
# Scan sys.path for site-packages
for p in sys.path:
path_obj = Path(p)
if path_obj.name == 'site-packages' and path_obj.exists():
nvidia_path = path_obj / "nvidia"
if nvidia_path.exists():
for subdir in nvidia_path.iterdir():
# Add 'bin' folder from each nvidia stub (cublas, cudnn, etc.)
bin_path = subdir / "bin"
if bin_path.exists():
os.add_dll_directory(str(bin_path))
# Also try adding site-packages itself just in case
# os.add_dll_directory(str(path_obj))
break
except Exception:
pass
# -----------------------------------------------------------------------------
from PySide6.QtWidgets import QApplication, QFileDialog, QMessageBox
from PySide6.QtCore import QObject, Slot, Signal, QThread, Qt, QUrl
from PySide6.QtQml import QQmlApplicationEngine
@@ -87,7 +112,7 @@ def _silent_shutdown_hook(exc_type, exc_value, exc_tb):
sys.excepthook = _silent_shutdown_hook
class DownloadWorker(QThread):
"""Background worker for model downloads."""
"""Background worker for model downloads with REAL progress."""
progress = Signal(int)
finished = Signal()
error = Signal(str)
@@ -98,33 +123,81 @@ class DownloadWorker(QThread):
def run(self):
try:
from faster_whisper import download_model
import requests
from tqdm import tqdm
model_path = get_models_path()
# Download to a specific subdirectory to keep things clean and predictable
# This matches the logic in transcriber.py which looks for this specific path
# Determine what to download
dest_dir = model_path / f"faster-whisper-{self.model_name}"
logging.info(f"Downloading Model '{self.model_name}' to {dest_dir}...")
repo_id = f"Systran/faster-whisper-{self.model_name}"
files = ["config.json", "model.bin", "tokenizer.json", "vocabulary.json"]
base_url = f"https://huggingface.co/{repo_id}/resolve/main"
# Ensure parent exists
model_path.mkdir(parents=True, exist_ok=True)
dest_dir.mkdir(parents=True, exist_ok=True)
logging.info(f"Downloading {self.model_name} to {dest_dir}...")
# output_dir in download_model specifies where the model files are saved
download_model(self.model_name, output_dir=str(dest_dir))
# 1. Calculate Total Size
total_size = 0
file_sizes = {}
with requests.Session() as s:
for fname in files:
url = f"{base_url}/{fname}"
head = s.head(url, allow_redirects=True)
if head.status_code == 200:
size = int(head.headers.get('content-length', 0))
file_sizes[fname] = size
total_size += size
else:
# Fallback for vocabulary.json vs vocabulary.txt
if fname == "vocabulary.json":
# Try .txt? Or just skip if not found?
# Faster-whisper usually has vocabulary.json
pass
# 2. Download loop
downloaded_bytes = 0
with requests.Session() as s:
for fname in files:
if fname not in file_sizes: continue
url = f"{base_url}/{fname}"
dest_file = dest_dir / fname
# Resume check?
# Simpler to just overwrite for reliability unless we want complex resume logic.
# We'll overwrite.
resp = s.get(url, stream=True)
resp.raise_for_status()
with open(dest_file, 'wb') as f:
for chunk in resp.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
downloaded_bytes += len(chunk)
# Emit Progress
if total_size > 0:
pct = int((downloaded_bytes / total_size) * 100)
self.progress.emit(pct)
self.finished.emit()
except Exception as e:
logging.error(f"Download failed: {e}")
self.error.emit(str(e))
class TranscriptionWorker(QThread):
finished = Signal(str)
def __init__(self, transcriber, audio_data, is_file=False, parent=None):
def __init__(self, transcriber, audio_data, is_file=False, parent=None, task_override=None):
super().__init__(parent)
self.transcriber = transcriber
self.audio_data = audio_data
self.is_file = is_file
self.task_override = task_override
def run(self):
text = self.transcriber.transcribe(self.audio_data, is_file=self.is_file)
text = self.transcriber.transcribe(self.audio_data, is_file=self.is_file, task=self.task_override)
self.finished.emit(text)
class WhisperApp(QObject):
@@ -166,13 +239,18 @@ class WhisperApp(QObject):
self.tray.transcribe_file_requested.connect(self.transcribe_file)
# Init Tooltip
hotkey = self.config.get("hotkey")
self.tray.setToolTip(f"Whisper Voice - Press {hotkey} to Record")
from src.utils.formatters import format_hotkey
self.format_hotkey = format_hotkey # Store ref
hk1 = self.format_hotkey(self.config.get("hotkey"))
hk2 = self.format_hotkey(self.config.get("hotkey_translate"))
self.tray.setToolTip(f"Whisper Voice\nTranscribe: {hk1}\nTranslate: {hk2}")
# 3. Logic Components Placeholders
self.audio_engine = None
self.transcriber = None
self.hotkey_manager = None
self.hk_transcribe = None
self.hk_translate = None
self.overlay_root = None
# 4. Start Loader
@@ -222,12 +300,23 @@ class WhisperApp(QObject):
self.settings_root.setVisible(False)
# Install Low-Level Window Hook for Transparent Hit Test
# We must keep a reference to 'self.hook' so it isn't GC'd
# scale = self.overlay_root.devicePixelRatio()
# self.hook = WindowHook(int(self.overlay_root.winId()), 500, 300, scale)
# self.hook.install()
try:
from src.utils.window_hook import WindowHook
hwnd = self.overlay_root.winId()
# Initial scale from config
scale = float(self.config.get("ui_scale"))
# NOTE: HitTest hook will be installed here later
# Current Overlay Dimensions
win_w = int(460 * scale)
win_h = int(180 * scale)
self.window_hook = WindowHook(hwnd, win_w, win_h, initial_scale=scale)
self.window_hook.install()
# Initial state: Disabled because we start inactive
self.window_hook.set_enabled(False)
except Exception as e:
logging.error(f"Failed to install WindowHook: {e}")
def center_overlay(self):
"""Calculates and sets the Overlay position above the taskbar."""
@@ -255,9 +344,16 @@ class WhisperApp(QObject):
self.audio_engine.set_visualizer_callback(self.bridge.update_amplitude)
self.audio_engine.set_silence_callback(self.on_silence_detected)
self.transcriber = WhisperTranscriber()
self.hotkey_manager = HotkeyManager()
self.hotkey_manager.triggered.connect(self.toggle_recording)
self.hotkey_manager.start()
# Dual Hotkey Managers
self.hk_transcribe = HotkeyManager(config_key="hotkey")
self.hk_transcribe.triggered.connect(lambda: self.toggle_recording(task_override="transcribe"))
self.hk_transcribe.start()
self.hk_translate = HotkeyManager(config_key="hotkey_translate")
self.hk_translate.triggered.connect(lambda: self.toggle_recording(task_override="translate"))
self.hk_translate.start()
self.bridge.update_status("Ready")
def run(self):
@@ -275,7 +371,8 @@ class WhisperApp(QObject):
except: pass
self.bridge.stats_worker.stop()
if self.hotkey_manager: self.hotkey_manager.stop()
if self.hk_transcribe: self.hk_transcribe.stop()
if self.hk_translate: self.hk_translate.stop()
# Close all QML windows to ensure bindings stop before Python objects die
if self.overlay_root:
@@ -350,10 +447,14 @@ class WhisperApp(QObject):
print(f"Setting Changed: {key} = {value}")
# 1. Hotkey Reload
if key == "hotkey":
if self.hotkey_manager: self.hotkey_manager.reload_hotkey()
if key in ["hotkey", "hotkey_translate"]:
if self.hk_transcribe: self.hk_transcribe.reload_hotkey()
if self.hk_translate: self.hk_translate.reload_hotkey()
if self.tray:
self.tray.setToolTip(f"Whisper Voice - Press {value} to Record")
hk1 = self.format_hotkey(self.config.get("hotkey"))
hk2 = self.format_hotkey(self.config.get("hotkey_translate"))
self.tray.setToolTip(f"Whisper Voice\nTranscribe: {hk1}\nTranslate: {hk2}")
# 2. AI Model Reload (Heavy)
if key in ["model_size", "compute_device", "compute_type"]:
@@ -456,6 +557,8 @@ class WhisperApp(QObject):
file_path, _ = QFileDialog.getOpenFileName(None, "Select Audio", "", "Audio (*.mp3 *.wav *.flac *.m4a *.ogg)")
if file_path:
self.bridge.update_status("Thinking...")
# Files use the default configured task usually, or we could ask?
# Default to config setting for files.
self.worker = TranscriptionWorker(self.transcriber, file_path, is_file=True, parent=self)
self.worker.finished.connect(self.on_transcription_done)
self.worker.start()
@@ -463,10 +566,13 @@ class WhisperApp(QObject):
@Slot()
def on_silence_detected(self):
from PySide6.QtCore import QMetaObject, Qt
# Silence detection always triggers the task that was active?
# Since silence stops recording, it just calls toggle_recording with no arg, using the stored current_task?
# Let's ensure toggle_recording handles no arg calls by stopping the CURRENT task.
QMetaObject.invokeMethod(self, "toggle_recording", Qt.QueuedConnection)
@Slot()
def toggle_recording(self):
@Slot() # Modified to allow lambda override
def toggle_recording(self, task_override=None):
if not self.audio_engine: return
# Prevent starting a new recording while we are still transcribing the last one
@@ -474,23 +580,36 @@ class WhisperApp(QObject):
logging.warning("Ignored toggle request: Transcription in progress.")
return
# Determine which task we are entering
if task_override:
intended_task = task_override
else:
intended_task = self.config.get("task")
if self.audio_engine.recording:
# STOP RECORDING
self.bridge.update_status("Thinking...")
self.bridge.isRecording = False
self.bridge.isProcessing = True # Start Processing
audio_data = self.audio_engine.stop_recording()
self.worker = TranscriptionWorker(self.transcriber, audio_data, parent=self)
# Use the task that started this session, or the override if provided (though usually override is for starting)
final_task = getattr(self, "current_recording_task", self.config.get("task"))
self.worker = TranscriptionWorker(self.transcriber, audio_data, parent=self, task_override=final_task)
self.worker.finished.connect(self.on_transcription_done)
self.worker.start()
else:
self.bridge.update_status("Recording")
# START RECORDING
self.current_recording_task = intended_task
self.bridge.update_status(f"Recording ({intended_task})...")
self.bridge.isRecording = True
self.audio_engine.start_recording()
@Slot(bool)
def on_ui_toggle_request(self, state):
if state != self.audio_engine.recording:
self.toggle_recording()
self.toggle_recording() # Default behavior for UI clicks
@Slot(str)
def on_transcription_done(self, text: str):
@@ -503,8 +622,8 @@ class WhisperApp(QObject):
@Slot(bool)
def on_hotkeys_enabled_toggle(self, state):
if self.hotkey_manager:
self.hotkey_manager.set_enabled(state)
if self.hk_transcribe: self.hk_transcribe.set_enabled(state)
if self.hk_translate: self.hk_translate.set_enabled(state)
@Slot(str)
def on_download_requested(self, size):
@@ -531,6 +650,25 @@ class WhisperApp(QObject):
self.bridge.update_status("Error")
logging.error(f"Download Error: {err}")
@Slot(bool)
def on_ui_toggle_request(self, is_recording):
"""Called when recording state changes."""
# Update Window Hook to allow clicking if active
is_active = is_recording or self.bridge.isProcessing
if hasattr(self, 'window_hook'):
self.window_hook.set_enabled(is_active)
@Slot(bool)
def on_processing_changed(self, is_processing):
is_active = self.bridge.isRecording or is_processing
if hasattr(self, 'window_hook'):
self.window_hook.set_enabled(is_active)
if __name__ == "__main__":
import sys
app = WhisperApp()
app.run()
# Connect extra signal for processing state
app.bridge.isProcessingChanged.connect(app.on_processing_changed)
sys.exit(app.run())

View File

@@ -39,39 +39,36 @@ def build_portable():
print("⏳ This may take 5-10 minutes...")
PyInstaller.__main__.run([
"main.py", # Entry point
"bootstrapper.py", # Entry point (Tiny Installer)
"--name=WhisperVoice", # EXE name
"--onefile", # Single EXE (slower startup but portable)
"--onefile", # Single EXE
"--noconsole", # No terminal window
"--clean", # Clean cache
*add_data_args, # Bundled assets
# Heavy libraries that need special collection
"--collect-all", "faster_whisper",
"--collect-all", "ctranslate2",
"--collect-all", "PySide6",
"--collect-all", "torch",
"--collect-all", "numpy",
# Bundle the app source to be extracted by bootstrapper
# The bootstrapper expects 'app_source' folder in bundled resources
"--add-data", f"src{os.pathsep}app_source/src",
"--add-data", f"main.py{os.pathsep}app_source",
"--add-data", f"requirements.txt{os.pathsep}app_source",
# Hidden imports (modules imported dynamically)
"--hidden-import", "keyboard",
"--hidden-import", "pyperclip",
"--hidden-import", "psutil",
"--hidden-import", "pynvml",
"--hidden-import", "sounddevice",
"--hidden-import", "scipy",
"--hidden-import", "scipy.signal",
"--hidden-import", "huggingface_hub",
"--hidden-import", "tokenizers",
# Add assets
"--add-data", f"src/ui/qml{os.pathsep}app_source/src/ui/qml",
"--add-data", f"assets{os.pathsep}app_source/assets",
# Qt plugins
"--hidden-import", "PySide6.QtQuickControls2",
"--hidden-import", "PySide6.QtQuick.Controls",
# No heavy collections!
# The bootstrapper uses internal pip to install everything.
# Icon (convert to .ico for Windows)
# "--icon=icon.ico", # Uncomment if you have a .ico file
# Exclude heavy modules to ensure this exe stays tiny
"--exclude-module", "faster_whisper",
"--exclude-module", "torch",
"--exclude-module", "PySide6",
# Icon
# "--icon=icon.ico",
])
print("\n" + "="*60)
print("✅ BUILD COMPLETE!")
print("="*60)

73
publish_release.py Normal file
View File

@@ -0,0 +1,73 @@
import os
import requests
import mimetypes
# Configuration
API_URL = "https://git.lashman.live/api/v1"
OWNER = "lashman"
REPO = "whisper_voice"
TAG = "v1.0.4"
TOKEN = "6153890332afff2d725aaf4729bc54b5030d5700" # Extracted from git config
EXE_PATH = r"dist\WhisperVoice.exe"
headers = {
"Authorization": f"token {TOKEN}",
"Accept": "application/json"
}
def create_release():
print(f"Creating release {TAG}...")
# Read Release Notes
with open("RELEASE_NOTES.md", "r", encoding="utf-8") as f:
notes = f.read()
# Create Release
payload = {
"tag_name": TAG,
"name": TAG,
"body": notes,
"draft": False,
"prerelease": False
}
url = f"{API_URL}/repos/{OWNER}/{REPO}/releases"
resp = requests.post(url, json=payload, headers=headers)
if resp.status_code == 201:
print("Release created successfully!")
return resp.json()
elif resp.status_code == 409:
print("Release already exists. Fetching it...")
# Get by tag
resp = requests.get(f"{API_URL}/repos/{OWNER}/{REPO}/releases/tags/{TAG}", headers=headers)
if resp.status_code == 200:
return resp.json()
print(f"Failed to create release: {resp.status_code} - {resp.text}")
return None
def upload_asset(release_id, file_path):
print(f"Uploading asset: {file_path}...")
filename = os.path.basename(file_path)
with open(file_path, "rb") as f:
data = f.read()
url = f"{API_URL}/repos/{OWNER}/{REPO}/releases/{release_id}/assets?name={filename}"
# Gitea API expects raw body
resp = requests.post(url, data=data, headers=headers)
if resp.status_code == 201:
print(f"Uploaded {filename} successfully!")
else:
print(f"Failed to upload asset: {resp.status_code} - {resp.text}")
def main():
release = create_release()
if release:
upload_asset(release["id"], EXE_PATH)
if __name__ == "__main__":
main()

View File

@@ -5,6 +5,7 @@
faster-whisper>=1.0.0
torch>=2.0.0
# UI Framework
PySide6>=6.6.0

View File

@@ -16,6 +16,7 @@ from src.core.paths import get_base_path
# Default Configuration
DEFAULT_SETTINGS = {
"hotkey": "f8",
"hotkey_translate": "f10",
"model_size": "small",
"input_device": None, # Device ID (int) or Name (str), None = Default
"save_recordings": False, # Save .wav files for debugging
@@ -38,13 +39,20 @@ DEFAULT_SETTINGS = {
# AI - Advanced
"language": "auto", # "auto" or ISO code
"task": "transcribe", # "transcribe" or "translate" (to English)
"compute_device": "auto", # "auto", "cuda", "cpu"
"compute_type": "int8", # "int8", "float16", "float32"
"beam_size": 5,
"best_of": 5,
"vad_filter": True,
"no_repeat_ngram_size": 0,
"condition_on_previous_text": True
"condition_on_previous_text": True,
"initial_prompt": "Mm-hmm. Okay, let's go. I speak in full sentences.", # Default: Forces punctuation
# Low VRAM Mode
"unload_models_after_use": False # If True, models are unloaded immediately to free VRAM
}
class ConfigManager:

View File

@@ -30,15 +30,16 @@ class HotkeyManager(QObject):
triggered = Signal()
def __init__(self, hotkey: str = "f8"):
def __init__(self, config_key: str = "hotkey"):
"""
Initialize the HotkeyManager.
Args:
hotkey (str): The global hotkey string description. Default: "f8".
config_key (str): The configuration key to look up (e.g. "hotkey").
"""
super().__init__()
self.hotkey = hotkey
self.config_key = config_key
self.hotkey = "f8" # Placeholder
self.is_listening = False
self._enabled = True
@@ -58,9 +59,9 @@ class HotkeyManager(QObject):
from src.core.config import ConfigManager
config = ConfigManager()
self.hotkey = config.get("hotkey")
self.hotkey = config.get(self.config_key)
logging.info(f"Registering global hotkey: {self.hotkey}")
logging.info(f"Registering global hotkey ({self.config_key}): {self.hotkey}")
try:
# We don't suppress=True here because we want the app to see keys during recording
# (Wait, actually if we are recording we WANT keyboard to see it,

120
src/core/languages.py Normal file
View File

@@ -0,0 +1,120 @@
"""
Supported Languages Module
==========================
Full list of languages supported by OpenAI Whisper.
Maps ISO codes to display names.
"""
LANGUAGES = {
"auto": "Auto Detect",
"af": "Afrikaans",
"sq": "Albanian",
"am": "Amharic",
"ar": "Arabic",
"hy": "Armenian",
"as": "Assamese",
"az": "Azerbaijani",
"ba": "Bashkir",
"eu": "Basque",
"be": "Belarusian",
"bn": "Bengali",
"bs": "Bosnian",
"br": "Breton",
"bg": "Bulgarian",
"my": "Burmese",
"ca": "Catalan",
"zh": "Chinese",
"hr": "Croatian",
"cs": "Czech",
"da": "Danish",
"nl": "Dutch",
"en": "English",
"et": "Estonian",
"fo": "Faroese",
"fi": "Finnish",
"fr": "French",
"gl": "Galician",
"ka": "Georgian",
"de": "German",
"el": "Greek",
"gu": "Gujarati",
"ht": "Haitian",
"ha": "Hausa",
"haw": "Hawaiian",
"he": "Hebrew",
"hi": "Hindi",
"hu": "Hungarian",
"is": "Icelandic",
"id": "Indonesian",
"it": "Italian",
"ja": "Japanese",
"jw": "Javanese",
"kn": "Kannada",
"kk": "Kazakh",
"km": "Khmer",
"ko": "Korean",
"lo": "Lao",
"la": "Latin",
"lv": "Latvian",
"ln": "Lingala",
"lt": "Lithuanian",
"lb": "Luxembourgish",
"mk": "Macedonian",
"mg": "Malagasy",
"ms": "Malay",
"ml": "Malayalam",
"mt": "Maltese",
"mi": "Maori",
"mr": "Marathi",
"mn": "Mongolian",
"ne": "Nepali",
"no": "Norwegian",
"oc": "Occitan",
"pa": "Punjabi",
"ps": "Pashto",
"fa": "Persian",
"pl": "Polish",
"pt": "Portuguese",
"ro": "Romanian",
"ru": "Russian",
"sa": "Sanskrit",
"sr": "Serbian",
"sn": "Shona",
"sd": "Sindhi",
"si": "Sinhala",
"sk": "Slovak",
"sl": "Slovenian",
"so": "Somali",
"es": "Spanish",
"su": "Sundanese",
"sw": "Swahili",
"sv": "Swedish",
"tl": "Tagalog",
"tg": "Tajik",
"ta": "Tamil",
"tt": "Tatar",
"te": "Telugu",
"th": "Thai",
"bo": "Tibetan",
"tr": "Turkish",
"tk": "Turkmen",
"uk": "Ukrainian",
"ur": "Urdu",
"uz": "Uzbek",
"vi": "Vietnamese",
"cy": "Welsh",
"yi": "Yiddish",
"yo": "Yoruba",
}
def get_language_names():
return list(LANGUAGES.values())
def get_code_by_name(name):
for code, lang in LANGUAGES.items():
if lang == name:
return code
return "auto"
def get_name_by_code(code):
return LANGUAGES.get(code, "Auto Detect")

View File

@@ -15,8 +15,13 @@ import numpy as np
from src.core.config import ConfigManager
from src.core.paths import get_models_path
try:
import torch
except ImportError:
torch = None
# Import directly - valid since we are now running in the full environment
from faster_whisper import WhisperModel
class WhisperTranscriber:
"""
@@ -57,6 +62,8 @@ class WhisperTranscriber:
# Force offline if path exists to avoid HF errors
local_only = new_path.exists()
try:
from faster_whisper import WhisperModel
self.model = WhisperModel(
model_input,
device=device,
@@ -64,6 +71,23 @@ class WhisperTranscriber:
download_root=str(get_models_path()),
local_files_only=local_only
)
except Exception as load_err:
# CRITICAL FALLBACK: If CUDA/cublas fails (AMD/Intel users), fallback to CPU
err_str = str(load_err).lower()
if "cublas" in err_str or "cudnn" in err_str or "library" in err_str or "device" in err_str:
logging.warning(f"CUDA Init Failed ({load_err}). Falling back to CPU...")
self.config.set("compute_device", "cpu") # Update config for persistence/UI
self.current_compute_device = "cpu"
self.model = WhisperModel(
model_input,
device="cpu",
compute_type="int8", # CPU usually handles int8 well with newer extensions, or standard
download_root=str(get_models_path()),
local_files_only=local_only
)
else:
raise load_err
self.current_model_size = size
self.current_compute_device = device
@@ -74,41 +98,119 @@ class WhisperTranscriber:
logging.error(f"Failed to load model: {e}")
self.model = None
def transcribe(self, audio_data, is_file: bool = False) -> str:
# Auto-Repair: Detect vocabulary/corrupt errors
err_str = str(e).lower()
if "vocabulary" in err_str or "tokenizer" in err_str or "config.json" in err_str:
# ... existing auto-repair logic ...
logging.warning("Corrupt model detected on load. Attempting to delete and reset...")
try:
import shutil
# Differentiate between simple path and HF path
new_path = get_models_path() / f"faster-whisper-{size}"
if new_path.exists():
shutil.rmtree(new_path)
logging.info(f"Deleted corrupt model at {new_path}")
else:
# Try legacy HF path
hf_path = get_models_path() / f"models--Systran--faster-whisper-{size}"
if hf_path.exists():
shutil.rmtree(hf_path)
logging.info(f"Deleted corrupt HF model at {hf_path}")
# Notify UI to refresh state (will show 'Download' button now)
# We can't reach bridge easily here without passing it in,
# but the UI polls or listens to logs.
# The user will simply see "Model Missing" in settings after this.
except Exception as del_err:
logging.error(f"Failed to delete corrupt model: {del_err}")
def transcribe(self, audio_data, is_file: bool = False, task: Optional[str] = None) -> str:
"""
Transcribe audio data.
"""
logging.info(f"Starting transcription... (is_file={is_file})")
logging.info(f"Starting transcription... (is_file={is_file}, task={task})")
# Ensure model is loaded
if not self.model:
self.load_model()
if not self.model:
return "Error: Model failed to load."
return "Error: Model failed to load. Please check Settings -> Model Info."
try:
# Config
beam_size = int(self.config.get("beam_size"))
best_of = int(self.config.get("best_of"))
vad = False if is_file else self.config.get("vad_filter")
language = self.config.get("language")
# Use task override if provided, otherwise config
# Ensure safe string and lowercase ("transcribe" vs "Transcribe")
raw_task = task if task else self.config.get("task")
final_task = str(raw_task).strip().lower() if raw_task else "transcribe"
# Sanity check for valid Whisper tasks
if final_task not in ["transcribe", "translate"]:
logging.warning(f"Invalid task '{final_task}' detected. Defaulting to 'transcribe'.")
final_task = "transcribe"
# Language handling
final_language = language if language != "auto" else None
# Anti-Hallucination: Force condition_on_previous_text=False for translation
condition_prev = self.config.get("condition_on_previous_text")
# Helper options for Translation Stability
initial_prompt = self.config.get("initial_prompt")
if final_task == "translate":
condition_prev = False
# Force beam search if user has set it to greedy (1)
# Translation requires more search breadth to find the English mapping
if beam_size < 5:
logging.info("Forcing beam_size=5 for Translation task.")
beam_size = 5
# Inject guidance prompt if none exists
if not initial_prompt:
initial_prompt = "Translate this to English."
logging.info(f"Model Dispatch: Task='{final_task}', Language='{final_language}', ConditionPrev={condition_prev}, Beam={beam_size}")
# Build arguments dynamically to avoid passing None if that's the issue
transcribe_opts = {
"beam_size": beam_size,
"best_of": best_of,
"vad_filter": vad,
"task": final_task,
"vad_parameters": dict(min_silence_duration_ms=500),
"condition_on_previous_text": condition_prev,
"without_timestamps": True
}
if initial_prompt:
transcribe_opts["initial_prompt"] = initial_prompt
# Only add language if it's explicitly set (not None/Auto)
# This avoids potentially confusing the model with explicit None
if final_language:
transcribe_opts["language"] = final_language
# Transcribe
segments, info = self.model.transcribe(
audio_data,
beam_size=beam_size,
best_of=best_of,
vad_filter=vad,
vad_parameters=dict(min_silence_duration_ms=500),
condition_on_previous_text=self.config.get("condition_on_previous_text"),
without_timestamps=True
)
segments, info = self.model.transcribe(audio_data, **transcribe_opts)
# Aggregate text
text_result = ""
for segment in segments:
text_result += segment.text + " "
return text_result.strip()
text_result = text_result.strip()
# Low VRAM Mode: Unload Whisper Model immediately
if self.config.get("unload_models_after_use"):
self.unload_model()
logging.info(f"Final Transcription Output: '{text_result}'")
return text_result
except Exception as e:
logging.error(f"Transcription failed: {e}")
@@ -117,7 +219,10 @@ class WhisperTranscriber:
def model_exists(self, size: str) -> bool:
"""Checks if a model size is already downloaded."""
new_path = get_models_path() / f"faster-whisper-{size}"
if (new_path / "config.json").exists():
if new_path.exists():
# Strict check
required = ["config.json", "model.bin", "vocabulary.json"]
if all((new_path / f).exists() for f in required):
return True
# Legacy HF cache check
@@ -127,3 +232,21 @@ class WhisperTranscriber:
return True
return False
def unload_model(self):
"""
Unloads model to free memory.
"""
if self.model:
del self.model
self.model = None
self.current_model_size = None
# Force garbage collection
import gc
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
logging.info("Whisper Model unloaded (Low VRAM Mode).")

View File

@@ -245,6 +245,26 @@ class UIBridge(QObject):
# --- Methods called from QML ---
@Slot(result=list)
def get_supported_languages(self):
from src.core.languages import get_language_names
return get_language_names()
@Slot(str)
def set_language_by_name(self, name):
from src.core.languages import get_code_by_name
from src.core.config import ConfigManager
code = get_code_by_name(name)
ConfigManager().set("language", code)
self.settingChanged.emit("language", code)
@Slot(result=str)
def get_current_language_name(self):
from src.core.languages import get_name_by_code
from src.core.config import ConfigManager
code = ConfigManager().get("language")
return get_name_by_code(code)
@Slot(str, result='QVariant')
def getSetting(self, key):
from src.core.config import ConfigManager
@@ -356,9 +376,15 @@ class UIBridge(QObject):
try:
from src.core.paths import get_models_path
# Check new simple format used by DownloadWorker
path_simple = get_models_path() / f"faster-whisper-{size}"
if path_simple.exists() and any(path_simple.iterdir()):
if path_simple.exists():
# Strict check: Ensure all critical files exist
required = ["config.json", "model.bin", "vocabulary.json"]
if all((path_simple / f).exists() for f in required):
return True
# Check HF Cache format (legacy/default)
@@ -366,16 +392,12 @@ class UIBridge(QObject):
path_hf = get_models_path() / folder_name
snapshots = path_hf / "snapshots"
if snapshots.exists() and any(snapshots.iterdir()):
return True
return True # Legacy cache structure is complex, assume valid if present
# Check direct folder (simple)
path_direct = get_models_path() / size
if (path_direct / "config.json").exists():
return True
return False
except Exception as e:
logging.error(f"Error checking model status: {e}")
return False
@Slot(str)

View File

@@ -100,7 +100,7 @@ ComboBox {
popup: Popup {
y: control.height - 1
width: control.width
implicitHeight: contentItem.implicitHeight
implicitHeight: Math.min(contentItem.implicitHeight, 300)
padding: 5
contentItem: ListView {

View File

@@ -25,7 +25,7 @@ Rectangle {
Text {
anchors.centerIn: parent
text: control.recording ? "Listening..." : (control.currentSequence || "None")
text: control.recording ? "Listening..." : (formatSequence(control.currentSequence) || "None")
color: control.recording ? SettingsStyle.accent : (control.currentSequence ? "#ffffff" : "#808080")
font.family: "JetBrains Mono"
font.pixelSize: 13
@@ -72,6 +72,23 @@ Rectangle {
if (!activeFocus) control.recording = false
}
function formatSequence(seq) {
if (!seq) return ""
var parts = seq.split("+")
for (var i = 0; i < parts.length; i++) {
var p = parts[i]
// Standardize modifiers
if (p === "ctrl") parts[i] = "Ctrl"
else if (p === "alt") parts[i] = "Alt"
else if (p === "shift") parts[i] = "Shift"
else if (p === "win") parts[i] = "Win"
else if (p === "esc") parts[i] = "Esc"
// Capitalize F-keys and others (e.g. f8 -> F8, space -> Space)
else parts[i] = p.charAt(0).toUpperCase() + p.slice(1)
}
return parts.join(" + ")
}
function getKeyName(key, text) {
// F-Keys
if (key >= Qt.Key_F1 && key <= Qt.Key_F35) return "f" + (key - Qt.Key_F1 + 1)

View File

@@ -314,15 +314,25 @@ Window {
spacing: 0
ModernSettingsItem {
label: "Global Hotkey"
description: "Press to record a new shortcut (e.g. Ctrl+Space)"
label: "Global Hotkey (Transcribe)"
description: "Press to record a new shortcut (e.g. F9)"
control: ModernKeySequenceRecorder {
Layout.preferredWidth: 200
implicitWidth: 240
currentSequence: ui.getSetting("hotkey")
onSequenceChanged: (seq) => ui.setSetting("hotkey", seq)
}
}
ModernSettingsItem {
label: "Global Hotkey (Translate)"
description: "Press to record a new shortcut (e.g. F10)"
control: ModernKeySequenceRecorder {
implicitWidth: 240
currentSequence: ui.getSetting("hotkey_translate")
onSequenceChanged: (seq) => ui.setSetting("hotkey_translate", seq)
}
}
ModernSettingsItem {
label: "Run on Startup"
description: "Automatically launch when you log in"
@@ -577,6 +587,53 @@ Window {
Text { text: "Model configuration and performance"; color: SettingsStyle.textSecondary; font.family: mainFont; font.pixelSize: 14 }
}
ModernSettingsSection {
title: "Style & Prompting"
Layout.margins: 32
Layout.topMargin: 0
content: ColumnLayout {
width: parent.width
spacing: 0
ModernSettingsItem {
label: "Punctuation Style"
description: "Hint for how to format text"
control: ModernComboBox {
id: styleCombo
width: 180
model: ["Standard (Proper)", "Casual (Lowercase)", "Custom"]
// Logic to determine initial index based on config string
Component.onCompleted: {
let current = ui.getSetting("initial_prompt")
if (current === "Mm-hmm. Okay, let's go. I speak in full sentences.") currentIndex = 0
else if (current === "um, okay... i guess so.") currentIndex = 1
else currentIndex = 2
}
onActivated: {
if (index === 0) ui.setSetting("initial_prompt", "Mm-hmm. Okay, let's go. I speak in full sentences.")
else if (index === 1) ui.setSetting("initial_prompt", "um, okay... i guess so.")
// Custom: Don't change string immediately, let user type
}
}
}
ModernSettingsItem {
label: "Custom Prompt"
description: "Advanced: Define your own style hint"
visible: styleCombo.currentIndex === 2
control: ModernTextField {
Layout.preferredWidth: 280
placeholderText: "e.g. 'Hello, World.'"
text: ui.getSetting("initial_prompt") || ""
onEditingFinished: ui.setSetting("initial_prompt", text === "" ? null : text)
}
}
}
}
ModernSettingsSection {
title: "Model Config"
Layout.margins: 32
@@ -742,15 +799,17 @@ Window {
ModernSettingsItem {
label: "Language"
description: "Force language or Auto-detect"
description: "Spoken language to transcribe"
control: ModernComboBox {
width: 140
model: ["auto", "en", "fr", "de", "es", "it", "ja", "zh", "ru"]
currentIndex: model.indexOf(ui.getSetting("language"))
onActivated: ui.setSetting("language", currentText)
Layout.preferredWidth: 200
model: ui.get_supported_languages()
currentIndex: model.indexOf(ui.get_current_language_name())
onActivated: (index) => ui.set_language_by_name(currentText)
}
}
// Task selector removed as per user request (Hotkeys handle this now)
ModernSettingsItem {
label: "Compute Device"
description: "Hardware acceleration (CUDA requires NVidia GPU)"
@@ -773,6 +832,16 @@ Window {
onActivated: ui.setSetting("compute_type", currentText)
}
}
ModernSettingsItem {
label: "Low VRAM Mode"
description: "Unload models immediately after use (Saves VRAM, Adds Delay)"
showSeparator: false
control: ModernSwitch {
checked: ui.getSetting("unload_models_after_use")
onToggled: ui.setSetting("unload_models_after_use", checked)
}
}
}
}

32
src/utils/formatters.py Normal file
View File

@@ -0,0 +1,32 @@
"""
Formatter Utilities
===================
Helper functions for text formatting.
"""
def format_hotkey(sequence: str) -> str:
"""
Formats a hotkey sequence string (e.g. 'ctrl+alt+f9')
into a pretty readable string (e.g. 'Ctrl + Alt + F9').
"""
if not sequence:
return "None"
parts = sequence.split('+')
formatted_parts = []
for p in parts:
p = p.strip().lower()
if p == 'ctrl': formatted_parts.append('Ctrl')
elif p == 'alt': formatted_parts.append('Alt')
elif p == 'shift': formatted_parts.append('Shift')
elif p == 'win': formatted_parts.append('Win')
elif p == 'esc': formatted_parts.append('Esc')
else:
# Capitalize first letter
if len(p) > 0:
formatted_parts.append(p[0].upper() + p[1:])
else:
formatted_parts.append(p)
return " + ".join(formatted_parts)

View File

@@ -55,6 +55,10 @@ except AttributeError:
def LOWORD(l): return l & 0xffff
def HIWORD(l): return (l >> 16) & 0xffff
GWL_EXSTYLE = -20
WS_EX_TRANSPARENT = 0x00000020
WS_EX_LAYERED = 0x00080000
class WindowHook:
def __init__(self, hwnd, width, height, initial_scale=1.0):
self.hwnd = hwnd
@@ -65,6 +69,34 @@ class WindowHook:
# (Window 420x140, Pill 380x100)
self.logical_rect = [20, 20, 20+380, 20+100]
self.current_scale = initial_scale
self.enabled = True # New flag
def set_enabled(self, enabled):
"""
Enables or disables interaction.
When disabled, we set WS_EX_TRANSPARENT so clicks pass through physically.
"""
if self.enabled == enabled:
return
self.enabled = enabled
# Get current styles
style = user32.GetWindowLongW(self.hwnd, GWL_EXSTYLE)
if not enabled:
# Enable Click-Through (Add Transparent)
# We also ensure Layered is set (Qt usually sets it, but good to be sure)
new_style = style | WS_EX_TRANSPARENT | WS_EX_LAYERED
else:
# Disable Click-Through (Remove Transparent)
new_style = style & ~WS_EX_TRANSPARENT
if new_style != style:
SetWindowLongPtr(self.hwnd, GWL_EXSTYLE, new_style)
# Force a redraw/frame update just in case
user32.SetWindowPos(self.hwnd, 0, 0, 0, 0, 0, 0x0027) # SWP_NOMOVE | SWP_NOSIZE | SWP_NOZORDER | SWP_FRAMECHANGED
def install(self):
proc_address = ctypes.cast(self.new_wnd_proc, ctypes.c_void_p)
@@ -73,6 +105,10 @@ class WindowHook:
def wnd_proc_callback(self, hwnd, msg, wParam, lParam):
try:
if msg == WM_NCHITTEST:
# If disabled (invisible/inactive), let clicks pass through (HTTRANSPARENT)
if not self.enabled:
return HTTRANSPARENT
res = self.on_nchittest(lParam)
if res != 0:
return res

38
test_m2m.py Normal file
View File

@@ -0,0 +1,38 @@
import sys
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
def test_m2m():
model_name = "facebook/m2m100_418M"
print(f"Loading {model_name}...")
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
# Test cases: (Language Code, Input)
test_cases = [
("en", "he go to school yesterday"),
("pl", "on iść do szkoła wczoraj"), # Intentional broken grammar in Polish
]
print("\nStarting M2M Tests (Self-Translation):\n")
for lang, input_text in test_cases:
tokenizer.src_lang = lang
encoded = tokenizer(input_text, return_tensors="pt")
# Translate to SAME language
generated_tokens = model.generate(
**encoded,
forced_bos_token_id=tokenizer.get_lang_id(lang)
)
corrected = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(f"[{lang}]")
print(f"Input: {input_text}")
print(f"Output: {corrected}")
print("-" * 20)
if __name__ == "__main__":
test_m2m()

40
test_mt0.py Normal file
View File

@@ -0,0 +1,40 @@
import sys
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
def test_mt0():
model_name = "bigscience/mt0-base"
print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Test cases: (Language, Prompt, Input)
# MT0 is instruction tuned, so we should prompt it in the target language or English.
# Cross-lingual prompting (English prompt -> Target tasks) is usually supported.
test_cases = [
("English", "Correct grammar:", "he go to school yesterday"),
("Polish", "Popraw gramatykę:", "to jest testowe zdanie bez kropki"),
("Finnish", "Korjaa kielioppi:", "tämä on testilause ilman pistettä"),
("Russian", "Исправь грамматику:", "это тестовое предложение без точки"),
("Japanese", "文法を直してください:", "これは点のないテスト文です"),
("Spanish", "Corrige la gramática:", "esta es una oración de prueba sin punto"),
]
print("\nStarting MT0 Tests:\n")
for lang, prompt_text, input_text in test_cases:
full_input = f"{prompt_text} {input_text}"
inputs = tokenizer(full_input, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=128)
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"[{lang}]")
print(f"Input: {full_input}")
print(f"Output: {corrected}")
print("-" * 20)
if __name__ == "__main__":
test_mt0()

34
test_punctuation.py Normal file
View File

@@ -0,0 +1,34 @@
import sys
import os
# Add src to path
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from src.core.grammar_assistant import GrammarAssistant
def test_punctuation():
assistant = GrammarAssistant()
assistant.load_model()
samples = [
# User's example (verbatim)
"If the voice recognition doesn't recognize that I like stopped Or something would that would it also correct that",
# Generic run-on
"hello how are you doing today i am doing fine thanks for asking",
# Missing commas/periods
"well i think its valid however we should probably check the logs first"
]
print("\nStarting Punctuation Tests:\n")
for sample in samples:
print(f"Original: {sample}")
corrected = assistant.correct(sample)
print(f"Corrected: {corrected}")
print("-" * 20)
if __name__ == "__main__":
test_punctuation()