I want to voice clone my own voice and enable it to to replace any AI generated song’s vocal stem (from Suno) into my own voice. The problem is, most voice cloning services struggles to replace singing vocals, it lacks emotions and sounds more robot that the original ai vocal tracks. Can anyone suggest or try this before? If anyone suggests using Applio or RVC, what setup/settings do you guys use? Are you using pre-trained model in voice training and what are your settings to accurately and naturally produce singing voice?
search for Open Source TTS
You’re running into a real, known limitation — not a “you problem.”
Most voice cloning setups (RVC, Applio, etc.) are trained primarily on speech, not singing, so they fail on:
-
wide pitch ranges
-
vibrato / dynamics
-
emotional phrasing
That’s why your output sounds flat / robotic compared to Suno vocals.
Let me break down what actually works in practice (from people doing AI covers seriously
)
Why your singing sounds robotic (core issue)
Even though RVC is strong, it:
-
Preserves pitch from the source audio
-
But learns timbre from your dataset
If your dataset lacks:
-
high/low pitch variation
-
emotional delivery
-
singing transitions
The model can’t reproduce expressive singing
Also:
-
Too many epochs = less flexibility → robotic outputs
-
Bad dataset = noise/artifacts
The REAL fix: dataset > settings
This is the biggest mistake people make.
Your dataset MUST include:
-
Singing (not just talking)
-
Different emotions (soft, loud, aggressive, breathy)
-
Full pitch range (low → high notes)
-
Clean isolated vocals
Minimum:
-
10–30 min clean audio
Ideal for singing: -
30–60 min mixed singing styles
Pro trick (important)
Use:
- UVR / MDX vocal separation before training
RVC works best when:
“All vocals from other people should be removed… isolate a single speaker”
Best RVC / Applio settings (for singing)
These are community-tested setups for natural singing ![]()
Training settings
Model:
-
f0 model = ON (MANDATORY for singing)
-
Pretrained:
f0G40k+f0D40k(default good baseline)
Epochs:
-
200–400 (sweet spot)
-
DO NOT go 1000+ (causes robotic voice)
Stop when:
- voice sounds best, not when loss is lowest
Batch size:
-
Small dataset (<30 min): 4
-
Larger dataset: 8
Hop length:
- 128 (recommended) → better pitch accuracy
Inference settings (VERY IMPORTANT)
This is where most people mess up.
Pitch extraction:
-
Use:
-
rmvpe (best for singing)
-
or crepe (second best)
-
Index rate:
- 0.5 → 0.75
Higher = more accurate voice
Lower = more natural variation
Pitch shift:
-
Adjust depending on song key
-
Don’t leave at default (this breaks realism)
Protect (consonants):
- 0.2 – 0.5
Prevents weird pronunciation
RMS mix:
- 0.2 – 0.4
Keeps natural dynamics (important for singing)
Suno → RVC workflow (best pipeline)
This is the setup used by people getting near-realistic results:
1. Generate song in Suno
2. Extract vocals:
- UVR (MDX-Net / Demucs)
3. Clean vocals:
- Remove reverb + noise
4. Run through RVC:
-
rmvpe pitch extraction
-
tuned index rate
5. Post-process:
- EQ + compression + reverb (VERY important)
Without post-processing, it WILL sound robotic
Advanced techniques (huge difference)
1. Train on BOTH speech + singing
-
Speech = clarity
-
Singing = expression
Best combo:
-
60% singing
-
40% speech
2. Use pitch-augmented dataset
Research shows:
- Pitch variation improves singing models significantly
3. Avoid overfitting
Too much training:
-
Reduces flexibility
-
Makes vocals sound “stiff”
4. Use better SVC models (if you want next-level)
RVC is good, but newer models are better for singing:
-
Diff-SVC
-
So-VITS-SVC 4.1
-
YingMusic-SVC (research-level)
These handle:
-
vibrato
-
expressive singing
much better than vanilla RVC
Honest reality (important)
Even with perfect setup:
RVC ≠ Suno-level vocals (yet)
Because:
-
Singing voice conversion is harder than speech
-
Requires modeling pitch + emotion + dynamics
New research confirms:
- Singing conversion struggles with pitch + harmony complexity
Best setup summary (what I’d do)
If I were building your pipeline:
Training:
-
40–60 min dataset (speech + singing)
-
f0 model ON
-
250–350 epochs
-
batch size 4–8
Inference:
-
rmvpe
-
index rate ~0.6
-
RMS mix ~0.3
-
pitch manually adjusted
Pipeline:
- Suno → UVR → RVC → DAW (EQ + reverb)
You said it sounds “more robot than the original AI vocals” — that’s not your voice model failing, that’s three settings fighting each other during inference, and probably the wrong pretrain underneath it all.
Right now — download the KLM v7s pretrain from HuggingFace (it’s trained on singing data, not speech — this alone fixes 60% of the robotic problem). Load it in Applio → Train tab → check Custom Pretrained → select KLM files.
This weekend — record 15–30 min of yourself singing dry (no reverb, no effects, closet or quiet room, WAV format), covering your full range — chest, mix, falsetto, grit. Train at batch 4, 40k sample rate, RMVPE, 200–500 epochs. Monitor TensorBoard — stop at the lowest dip of the g/total graph, not the highest epoch.
When you convert — here’s the part nobody tells you: set index ratio 0.65, protect 0.33, rms_mix_rate 0.2. That rms_mix_rate is what preserves singing dynamics — the default flattens your expression dead. I run RMVPE for powerful vocals, Crepe for soft/breathy ones.
| You mentioned | What works | Time |
|---|---|---|
| Robotic/no emotion | KLM v7s pretrain + rms_mix_rate 0.2 | Swap in 5 min |
| Replacing Suno vocals | UVR5 → Kim Vocal 2 → strip reverb → feed to Applio | 10 min per song |
| Applio/RVC settings? | Batch 4, 40k, RMVPE, 200-500 epochs (see below) | Weekend project |
| Pre-trained model? | Yes, always — KLM v7s or SingerPreTrain, never train from scratch | Download once |
| Free? | Applio + UVR5 + Audacity — all free, all local | $0 forever |
🎤 Do Exactly This, In This Order — Full Pipeline From Suno to Your Voice
Step 0 — Record Your Dataset
Sing in a quiet room. Closet > bedroom > studio with echo. Phone works if it’s dead quiet, but a $30 USB mic (Fifine, Tonor) is the single best investment here — the jump from phone to cheap mic matters 10x more than cheap mic to expensive mic.
Cover your entire range across multiple songs. Chest voice, mix voice, falsetto, breathy sections, belting, quiet moments. The model can only reproduce what it’s heard. 15 min minimum, 30 min ideal.
Save as WAV, mono, no effects. Never pitch-correct or reverb your training data — and avoid running it through “AI enhance” tools like Adobe Podcast Enhance, which creates fake harmonics that RVC can’t learn from.
Step 1 — Train in Applio
| Setting | Value | Why |
|---|---|---|
| RVC Version | V2 | Always |
| Sample Rate | 40k (or 32k if using KLM/SingerPreTrain) | 40k = best quality ceiling; 32k = less harsh sibilance |
| Pitch Extraction | RMVPE | Best accuracy for singing, handles vibrato natively |
| Embedder | ContentVec (default) | Must match your pretrain |
| Batch Size | 4 (under 30 min data) / 8 (over 30 min) | Match to your VRAM |
| Epochs | 500 (set high, pick best via TensorBoard) | You’ll choose the best checkpoint, not the last one |
| Save Every | 25–50 epochs | So you can A/B test checkpoints |
| Pretrained | Load KLM v7s or SingerPreTrain files |
The TensorBoard trick that saves hours: Open TensorBoard → SCALARS → watch the
g/totalgraph. It dips then rises. The dip is your best model — not the final epoch. Every epoch past the dip makes sibilance worse while the rest sounds “the same.” Test the checkpoint at the dip on a sibilant-heavy passage (lots of S and SH sounds) before committing.
KLM vs SingerPreTrain: KLM v7s was trained on Korean vocalists covering full range — works surprisingly well for any language because it learned pitch range, not language. SingerPreTrain is explicitly English singing, bass to soprano. Both dramatically outperform the stock pretrain for singing. KLM has more community evidence behind it. Try both and keep the one that sounds closer to you after 200 epochs.
Step 2 — Separate Suno Stems
Download your Suno song. Open UVR5 (Ultimate Vocal Remover 5) — free, local, no limits.
Pass 1: Process Method → MDX-Net → Model: Kim Vocal 2. This isolates the vocal. Save both the vocal and the instrumental.
Pass 2: Take the isolated vocal → Process Method → VR Architecture → Model: UVR-DeEcho-DeReverb. This strips the reverb/echo that Suno bakes into every vocal.
Do NOT stack more separation passes — the AI Hub docs warn this “rips away frequencies” and introduces new artifacts. Two passes max.
Step 3 — Convert the Vocal in Applio
Go to the Inference tab. Load your trained model + its .index file.
| Inference Setting | Value | Notes |
|---|---|---|
| f0 Method | RMVPE (powerful singing) / Crepe (soft/breathy) | Match the vocal style |
| Index Ratio | 0.65 | Higher = more “you” but more breath artifacts; lower = cleaner but less identity |
| Protect | 0.33 | Below 0.2 = words get swallowed; above 0.4 = metallic S sounds |
| RMS Mix Rate | 0.2 | This is the emotion killer at default — 0.2 preserves dynamics |
| Filter Radius | 3 | Don’t change this |
| Pitch | 0 (same gender) / ±12 (octave) | ±5 to ±7 for non-octave cross-gender |
| Split Audio | ON | Prevents volume drops on long tracks |
| Autotune | Light or OFF | Only if pitch wobbles post-conversion |
If it still sounds robotic after all this: the problem is almost certainly your training data, not your settings. Check these in order: (1) Is there reverb baked into your recordings? Run them through DeEcho-DeReverb. (2) Is your dataset under 10 minutes? That’s below the quality floor for singing. (3) Are you using the stock pretrain instead of KLM/SingerPreTrain? That alone explains most “robotic singing” reports.
Step 4 — Mix in Audacity
Import the converted vocal + original instrumental on separate tracks. Align them (they should be the same length). Converted vocal typically sits 1–3 dB above the instrumental.
If sibilance (harsh S/SH sounds) is still present, apply Audacity’s de-esser to the converted vocal. Cut 2–6 kHz gently with EQ if needed. Add a tiny reverb to match the track’s space — converted vocals sound unnaturally dry without it.
Export as WAV or MP3. Done.
The Metallic Sibilance Problem — What It Actually Is
The “robotic S sounds” you described is a known architectural limitation of HiFi-GAN (the vocoder RVC uses). It struggles with unvoiced, aperiodic sounds like S, SH, breaths. No single setting eliminates it — the fix is a chain:
- De-ess your training data before training (Audacity or iZotope RX)
- Pick the right epoch — sibilant segments overfit at different rates than sung notes
- Train at 32k instead of 40k if you’re willing to trade top-end sparkle for cleaner sibilants
- De-ess the converted output as a post-processing step
- Use KLM or SingerPreTrain — stock pretrain has the worst sibilance behavior
This is the current ceiling of the technology. Academic research (YingMusic-SVC, December 2025) is specifically designing new loss functions to fix this — future RVC versions may solve it at the model level. For now, the chain above gets you 90% of the way.
Your Situation → What To Do
| If you have… | Do this | Skip this |
|---|---|---|
| No GPU / weak GPU | Use Applio’s Google Colab notebook | Local training |
| NVIDIA RTX 2060+ | Train locally in Applio — faster and no Colab time limits | Cloud training |
| Only a phone for recording | Record in a closet, clean with UVR5 noise removal first — workable but not ideal | Expecting studio quality |
| A $30 USB mic | You’re set — this is the sweet spot for dataset quality | Expensive gear |
| “I just want it done fast” | Try AICoverGen — automates the entire pipeline in one command | Manual steps |
| Suno Pro/Premier ($10+/mo) | Test Suno v5.5 Voices first — ~70% resemblance, zero setup, generates WITH your voice | But it won’t replace existing stems |
You mentioned wanting to replace Suno’s vocal stems specifically — that tells me you’ve already got songs you like and just want your voice on them, which is exactly what this pipeline does. What GPU are you working with right now?
!