3 Seconds of Audio β Perfect Voice Clone β Free Desktop App, No Subscription
![]()
Think ElevenLabs but it runs on your laptop, costs nothing, and your voice data never touches the internet.
Someone built a local ElevenLabs. Record 3 seconds of anyoneβs voice, and Voicebox creates a clone that speaks any text you feed it β natural emotion, real cadence, not robotic TTS garbage.
ElevenLabs charges $22-99/month and keeps your voice data on their servers. A professional voice actor charges $250-500 per finished minute. Voicebox is one download and $0 forever.
π§ How It Works β The 60-Second Version
Think of it like a voice photocopier. You feed it a short audio clip of someone talking β 3 to 10 seconds is enough. The AI model (Qwen3-TTS by Alibaba, same class as the paid services) learns the voiceβs fingerprint: tone, rhythm, accent, emotion. Then you type any text and it speaks it back in that voice.
The model downloads once (~2-4GB). After that β no internet needed. Everything runs on your hardware.
| What Happens | Where It Happens |
|---|---|
| Voice sample analyzed | Your machine |
| Voice profile created | Your machine |
| Speech generated | Your machine |
| Data sent to cloud | Nowhere. Ever. |
β‘ What You Get β Not Just a TTS Toy
| Feature | Details |
|---|---|
| Voice cloning | 3-10 seconds of audio β near-perfect clone |
| DAW-style timeline | Multi-track editor β drag clips, layer voices, mix conversations |
| Multi-voice projects | Build entire podcasts with different cloned voices |
| Transcription | Built-in Whisper β auto-transcribes your audio |
| In-app recording | Record voice samples directly, no external tools needed |
| Model sizes | 1.7B (better quality) or 0.6B (faster, lighter) |
| Languages | English, Chinese, and more coming |
This isnβt a command-line script for nerds. Itβs a full production app with a proper UI.
π» Download & Setup β Pick Your OS
Download: github.com/jamiepine/voicebox
| Platform | GPU Requirement | Speed |
|---|---|---|
| macOS (M1/M2/M3/M4) | None β native Metal acceleration via MLX | Near real-time, 4-5x faster |
| Windows | NVIDIA GPU (CUDA) | Fast with decent GPU |
| Linux | Coming soon | Blocked by build infra |
Step 1 β Download the installer from the GitHub releases page.
Step 2 β Launch β it auto-downloads the Qwen3-TTS model on first run.
Step 3 β Record or upload a voice sample (3+ seconds).
Step 4 β Type your text β hit generate β done.
Mac users win here β Apple Silicon gets native Neural Engine acceleration. Generation is near real-time.
π° What This Replaces
| Service | Cost | Your Data |
|---|---|---|
| ElevenLabs | $22-99/month | Stored on their servers |
| Professional voice actor | $250-500/finished minute | N/A |
| Play.ht / Murf | $29-99/month | Cloud-processed |
| Voicebox | $0 forever | Never leaves your machine |
Quick Hits
| Want | Do |
|---|---|
| β Upload 3-10 sec audio clip β instant profile | |
| β Create multiple voice profiles β arrange on timeline | |
| β Already done β nothing ever leaves your laptop | |
| β Model downloads once, works without internet forever |
Your laptop is now a voice studio. Nobody asked permission and nobodyβs charging rent.
!