Need free/cheap ways for Voice Cloning/Training and Convert any ai song's vocal stem into my own voice

Barry_Cotton · March 25, 2026, 12:10am

I want to voice clone my own voice and enable it to to replace any AI generated song’s vocal stem (from Suno) into my own voice. The problem is, most voice cloning services struggles to replace singing vocals, it lacks emotions and sounds more robot that the original ai vocal tracks. Can anyone suggest or try this before? If anyone suggests using Applio or RVC, what setup/settings do you guys use? Are you using pre-trained model in voice training and what are your settings to accurately and naturally produce singing voice?

green_res · March 25, 2026, 1:35am

search for Open Source TTS

Emanuel_Branson · March 25, 2026, 5:43am

You’re running into a real, known limitation — not a “you problem.”
Most voice cloning setups (RVC, Applio, etc.) are trained primarily on speech, not singing, so they fail on:

wide pitch ranges
vibrato / dynamics
emotional phrasing

That’s why your output sounds flat / robotic compared to Suno vocals.

Let me break down what actually works in practice (from people doing AI covers seriously )

Why your singing sounds robotic (core issue)

Even though RVC is strong, it:

Preserves pitch from the source audio
But learns timbre from your dataset

If your dataset lacks:

high/low pitch variation
emotional delivery
singing transitions

The model can’t reproduce expressive singing

Also:

Too many epochs = less flexibility → robotic outputs
Bad dataset = noise/artifacts

The REAL fix: dataset > settings

This is the biggest mistake people make.

Your dataset MUST include:

Singing (not just talking)
Different emotions (soft, loud, aggressive, breathy)
Full pitch range (low → high notes)
Clean isolated vocals

Minimum:

10–30 min clean audio
Ideal for singing:
30–60 min mixed singing styles

Pro trick (important)

Use:

UVR / MDX vocal separation before training

RVC works best when:

“All vocals from other people should be removed… isolate a single speaker”

Best RVC / Applio settings (for singing)

These are community-tested setups for natural singing

Training settings

Model:

f0 model = ON (MANDATORY for singing)
Pretrained:
- f0G40k + f0D40k (default good baseline)

Epochs:

200–400 (sweet spot)
DO NOT go 1000+ (causes robotic voice)

Stop when:

voice sounds best, not when loss is lowest

Batch size:

Small dataset (<30 min): 4
Larger dataset: 8

Hop length:

128 (recommended) → better pitch accuracy

Inference settings (VERY IMPORTANT)

This is where most people mess up.

Pitch extraction:

Use:
- rmvpe (best for singing)
- or crepe (second best)

Index rate:

0.5 → 0.75
Higher = more accurate voice
Lower = more natural variation

Pitch shift:

Adjust depending on song key
Don’t leave at default (this breaks realism)

Protect (consonants):

0.2 – 0.5
Prevents weird pronunciation

RMS mix:

0.2 – 0.4
Keeps natural dynamics (important for singing)

Suno → RVC workflow (best pipeline)

This is the setup used by people getting near-realistic results:

1. Generate song in Suno

2. Extract vocals:

UVR (MDX-Net / Demucs)

3. Clean vocals:

Remove reverb + noise

4. Run through RVC:

rmvpe pitch extraction
tuned index rate

5. Post-process:

EQ + compression + reverb (VERY important)

Without post-processing, it WILL sound robotic

Advanced techniques (huge difference)

1. Train on BOTH speech + singing

Speech = clarity
Singing = expression

Best combo:

60% singing
40% speech

2. Use pitch-augmented dataset

Research shows:

Pitch variation improves singing models significantly

3. Avoid overfitting

Too much training:

Reduces flexibility
Makes vocals sound “stiff”

4. Use better SVC models (if you want next-level)

RVC is good, but newer models are better for singing:

Diff-SVC
So-VITS-SVC 4.1
YingMusic-SVC (research-level)

These handle:

vibrato
expressive singing
much better than vanilla RVC

Honest reality (important)

Even with perfect setup:

RVC ≠ Suno-level vocals (yet)

Because:

Singing voice conversion is harder than speech
Requires modeling pitch + emotion + dynamics

New research confirms:

Singing conversion struggles with pitch + harmony complexity

Best setup summary (what I’d do)

If I were building your pipeline:

Training:

40–60 min dataset (speech + singing)
f0 model ON
250–350 epochs
batch size 4–8

Inference:

rmvpe
index rate ~0.6
RMS mix ~0.3
pitch manually adjusted

Pipeline:

Suno → UVR → RVC → DAW (EQ + reverb)

Margaret · April 2, 2026, 8:39am

You said it sounds “more robot than the original AI vocals” — that’s not your voice model failing, that’s three settings fighting each other during inference, and probably the wrong pretrain underneath it all.

Right now — download the KLM v7s pretrain from HuggingFace (it’s trained on singing data, not speech — this alone fixes 60% of the robotic problem). Load it in Applio → Train tab → check Custom Pretrained → select KLM files.
This weekend — record 15–30 min of yourself singing dry (no reverb, no effects, closet or quiet room, WAV format), covering your full range — chest, mix, falsetto, grit. Train at batch 4, 40k sample rate, RMVPE, 200–500 epochs. Monitor TensorBoard — stop at the lowest dip of the g/total graph, not the highest epoch.
When you convert — here’s the part nobody tells you: set index ratio 0.65, protect 0.33, rms_mix_rate 0.2. That rms_mix_rate is what preserves singing dynamics — the default flattens your expression dead. I run RMVPE for powerful vocals, Crepe for soft/breathy ones.

You mentioned	What works	Time
Robotic/no emotion	KLM v7s pretrain + rms_mix_rate 0.2	Swap in 5 min
Replacing Suno vocals	UVR5 → Kim Vocal 2 → strip reverb → feed to Applio	10 min per song
Applio/RVC settings?	Batch 4, 40k, RMVPE, 200-500 epochs (see below)	Weekend project
Pre-trained model?	Yes, always — KLM v7s or SingerPreTrain, never train from scratch	Download once
Free?	Applio + UVR5 + Audacity — all free, all local	$0 forever

🎤 Do Exactly This, In This Order — Full Pipeline From Suno to Your Voice

Step 0 — Record Your Dataset

Sing in a quiet room. Closet > bedroom > studio with echo. Phone works if it’s dead quiet, but a $30 USB mic (Fifine, Tonor) is the single best investment here — the jump from phone to cheap mic matters 10x more than cheap mic to expensive mic.

Cover your entire range across multiple songs. Chest voice, mix voice, falsetto, breathy sections, belting, quiet moments. The model can only reproduce what it’s heard. 15 min minimum, 30 min ideal.

Save as WAV, mono, no effects. Never pitch-correct or reverb your training data — and avoid running it through “AI enhance” tools like Adobe Podcast Enhance, which creates fake harmonics that RVC can’t learn from.

Step 1 — Train in Applio

Setting	Value	Why
RVC Version	V2	Always
Sample Rate	40k (or 32k if using KLM/SingerPreTrain)	40k = best quality ceiling; 32k = less harsh sibilance
Pitch Extraction	RMVPE	Best accuracy for singing, handles vibrato natively
Embedder	ContentVec (default)	Must match your pretrain
Batch Size	4 (under 30 min data) / 8 (over 30 min)	Match to your VRAM
Epochs	500 (set high, pick best via TensorBoard)	You’ll choose the best checkpoint, not the last one
Save Every	25–50 epochs	So you can A/B test checkpoints
Pretrained	Checked + Custom Pretrained	Load KLM v7s or SingerPreTrain files

The TensorBoard trick that saves hours: Open TensorBoard → SCALARS → watch the g/total graph. It dips then rises. The dip is your best model — not the final epoch. Every epoch past the dip makes sibilance worse while the rest sounds “the same.” Test the checkpoint at the dip on a sibilant-heavy passage (lots of S and SH sounds) before committing.

KLM vs SingerPreTrain: KLM v7s was trained on Korean vocalists covering full range — works surprisingly well for any language because it learned pitch range, not language. SingerPreTrain is explicitly English singing, bass to soprano. Both dramatically outperform the stock pretrain for singing. KLM has more community evidence behind it. Try both and keep the one that sounds closer to you after 200 epochs.

Step 2 — Separate Suno Stems

Download your Suno song. Open UVR5 (Ultimate Vocal Remover 5) — free, local, no limits.

Pass 1: Process Method → MDX-Net → Model: Kim Vocal 2. This isolates the vocal. Save both the vocal and the instrumental.

Pass 2: Take the isolated vocal → Process Method → VR Architecture → Model: UVR-DeEcho-DeReverb. This strips the reverb/echo that Suno bakes into every vocal.

Do NOT stack more separation passes — the AI Hub docs warn this “rips away frequencies” and introduces new artifacts. Two passes max.

Step 3 — Convert the Vocal in Applio

Go to the Inference tab. Load your trained model + its .index file.

Inference Setting	Value	Notes
f0 Method	RMVPE (powerful singing) / Crepe (soft/breathy)	Match the vocal style
Index Ratio	0.65	Higher = more “you” but more breath artifacts; lower = cleaner but less identity
Protect	0.33	Below 0.2 = words get swallowed; above 0.4 = metallic S sounds
RMS Mix Rate	0.2	This is the emotion killer at default — 0.2 preserves dynamics
Filter Radius	3	Don’t change this
Pitch	0 (same gender) / ±12 (octave)	±5 to ±7 for non-octave cross-gender
Split Audio	ON	Prevents volume drops on long tracks
Autotune	Light or OFF	Only if pitch wobbles post-conversion

If it still sounds robotic after all this: the problem is almost certainly your training data, not your settings. Check these in order: (1) Is there reverb baked into your recordings? Run them through DeEcho-DeReverb. (2) Is your dataset under 10 minutes? That’s below the quality floor for singing. (3) Are you using the stock pretrain instead of KLM/SingerPreTrain? That alone explains most “robotic singing” reports.

Step 4 — Mix in Audacity

Import the converted vocal + original instrumental on separate tracks. Align them (they should be the same length). Converted vocal typically sits 1–3 dB above the instrumental.

If sibilance (harsh S/SH sounds) is still present, apply Audacity’s de-esser to the converted vocal. Cut 2–6 kHz gently with EQ if needed. Add a tiny reverb to match the track’s space — converted vocals sound unnaturally dry without it.

Export as WAV or MP3. Done.

The Metallic Sibilance Problem — What It Actually Is

The “robotic S sounds” you described is a known architectural limitation of HiFi-GAN (the vocoder RVC uses). It struggles with unvoiced, aperiodic sounds like S, SH, breaths. No single setting eliminates it — the fix is a chain:

De-ess your training data before training (Audacity or iZotope RX)
Pick the right epoch — sibilant segments overfit at different rates than sung notes
Train at 32k instead of 40k if you’re willing to trade top-end sparkle for cleaner sibilants
De-ess the converted output as a post-processing step
Use KLM or SingerPreTrain — stock pretrain has the worst sibilance behavior

This is the current ceiling of the technology. Academic research (YingMusic-SVC, December 2025) is specifically designing new loss functions to fix this — future RVC versions may solve it at the model level. For now, the chain above gets you 90% of the way.

Your Situation → What To Do

If you have…	Do this	Skip this
No GPU / weak GPU	Use Applio’s Google Colab notebook	Local training
NVIDIA RTX 2060+	Train locally in Applio — faster and no Colab time limits	Cloud training
Only a phone for recording	Record in a closet, clean with UVR5 noise removal first — workable but not ideal	Expecting studio quality
A $30 USB mic	You’re set — this is the sweet spot for dataset quality	Expensive gear
“I just want it done fast”	Try AICoverGen — automates the entire pipeline in one command	Manual steps
Suno Pro/Premier ($10+/mo)	Test Suno v5.5 Voices first — ~70% resemblance, zero setup, generates WITH your voice	But it won’t replace existing stems

You mentioned wanting to replace Suno’s vocal stems specifically — that tells me you’ve already got songs you like and just want your voice on them, which is exactly what this pipeline does. What GPU are you working with right now?

Topic		Replies	Views
AI Voice Changer & Cloner \| Change Your Voice In One Click Tutorials & Methods audio , ai , content-creation	0	1467	April 17, 2024
Local Voice Solution Discussion & Solutions solved	4	689	May 3, 2026
Generate Songs Cover Using AI Tools Tools & Scripts tools , freebies , ai	0	595	March 21, 2025
AI Voice Cloning Secrets Revealed :high_voltage: Give-Away and Freebies tips-tricks , audio , ai	0	652	August 31, 2025
Need a lip sync AI to turn image into video with my audio Discussion & Solutions solved	3	371	March 24, 2026