πŸŽ™οΈ Clone Any Voice for Free β€” This Local App Replaces ElevenLabs

:studio_microphone: 3 Seconds of Audio β†’ Perfect Voice Clone β€” Free Desktop App, No Subscription

Think ElevenLabs but it runs on your laptop, costs nothing, and your voice data never touches the internet.

Someone built a local ElevenLabs. Record 3 seconds of anyone’s voice, and Voicebox creates a clone that speaks any text you feed it β€” natural emotion, real cadence, not robotic TTS garbage.

ElevenLabs charges $22-99/month and keeps your voice data on their servers. A professional voice actor charges $250-500 per finished minute. Voicebox is one download and $0 forever.


🧠 How It Works β€” The 60-Second Version

Think of it like a voice photocopier. You feed it a short audio clip of someone talking β€” 3 to 10 seconds is enough. The AI model (Qwen3-TTS by Alibaba, same class as the paid services) learns the voice’s fingerprint: tone, rhythm, accent, emotion. Then you type any text and it speaks it back in that voice.

The model downloads once (~2-4GB). After that β€” no internet needed. Everything runs on your hardware.

What Happens Where It Happens
Voice sample analyzed Your machine
Voice profile created Your machine
Speech generated Your machine
Data sent to cloud Nowhere. Ever.
⚑ What You Get β€” Not Just a TTS Toy
Feature Details
Voice cloning 3-10 seconds of audio β†’ near-perfect clone
DAW-style timeline Multi-track editor β€” drag clips, layer voices, mix conversations
Multi-voice projects Build entire podcasts with different cloned voices
Transcription Built-in Whisper β€” auto-transcribes your audio
In-app recording Record voice samples directly, no external tools needed
Model sizes 1.7B (better quality) or 0.6B (faster, lighter)
Languages English, Chinese, and more coming

This isn’t a command-line script for nerds. It’s a full production app with a proper UI.

πŸ’» Download & Setup β€” Pick Your OS

:inbox_tray: Download: github.com/jamiepine/voicebox

Platform GPU Requirement Speed
macOS (M1/M2/M3/M4) None β€” native Metal acceleration via MLX Near real-time, 4-5x faster
Windows NVIDIA GPU (CUDA) Fast with decent GPU
Linux Coming soon Blocked by build infra

Step 1 β€” Download the installer from the GitHub releases page.

Step 2 β€” Launch β†’ it auto-downloads the Qwen3-TTS model on first run.

Step 3 β€” Record or upload a voice sample (3+ seconds).

Step 4 β€” Type your text β†’ hit generate β†’ done.

:high_voltage: Mac users win here β€” Apple Silicon gets native Neural Engine acceleration. Generation is near real-time.

πŸ’° What This Replaces
Service Cost Your Data
ElevenLabs $22-99/month Stored on their servers
Professional voice actor $250-500/finished minute N/A
Play.ht / Murf $29-99/month Cloud-processed
Voicebox $0 forever Never leaves your machine

:high_voltage: Quick Hits

Want Do
:studio_microphone: Clone a voice β†’ Upload 3-10 sec audio clip β†’ instant profile
:headphone: Build a podcast β†’ Create multiple voice profiles β†’ arrange on timeline
:locked: Keep voice data private β†’ Already done β€” nothing ever leaves your laptop
:counterclockwise_arrows_button: Use offline β†’ Model downloads once, works without internet forever

Your laptop is now a voice studio. Nobody asked permission and nobody’s charging rent.

8 Likes

very slow generation

only works with GPU possibly with 10+ VRAM !!

2 Likes

yea very slow , its taking too long to download the qwen model while generating

Only works reasonably on a 4090 or 5090. Luckily, I have both!!

I’ve been trying the recently released MioTTS. It’s better, faster, runs on much lower VRAM, and only needs 5 seconds of audio recording to clone someone’s voice.