Need free/cheap ways for Voice Cloning/Training and Convert any ai song's vocal stem into my own voice

I want to voice clone my own voice and enable it to to replace any AI generated song’s vocal stem (from Suno) into my own voice. The problem is, most voice cloning services struggles to replace singing vocals, it lacks emotions and sounds more robot that the original ai vocal tracks. Can anyone suggest or try this before? If anyone suggests using Applio or RVC, what setup/settings do you guys use? Are you using pre-trained model in voice training and what are your settings to accurately and naturally produce singing voice?

1 Like

search for Open Source TTS

You’re running into a real, known limitation — not a “you problem.”
Most voice cloning setups (RVC, Applio, etc.) are trained primarily on speech, not singing, so they fail on:

  • wide pitch ranges

  • vibrato / dynamics

  • emotional phrasing

That’s why your output sounds flat / robotic compared to Suno vocals.

Let me break down what actually works in practice (from people doing AI covers seriously :backhand_index_pointing_down:)


:bullseye: Why your singing sounds robotic (core issue)

Even though RVC is strong, it:

  • Preserves pitch from the source audio

  • But learns timbre from your dataset

:backhand_index_pointing_right: If your dataset lacks:

  • high/low pitch variation

  • emotional delivery

  • singing transitions

:right_arrow: The model can’t reproduce expressive singing

Also:

  • Too many epochs = less flexibility → robotic outputs

  • Bad dataset = noise/artifacts


:brain: The REAL fix: dataset > settings

This is the biggest mistake people make.

:white_check_mark: Your dataset MUST include:

  • Singing (not just talking)

  • Different emotions (soft, loud, aggressive, breathy)

  • Full pitch range (low → high notes)

  • Clean isolated vocals

:backhand_index_pointing_right: Minimum:

  • 10–30 min clean audio
    :backhand_index_pointing_right: Ideal for singing:

  • 30–60 min mixed singing styles


:fire: Pro trick (important)

Use:

  • UVR / MDX vocal separation before training

:backhand_index_pointing_right: RVC works best when:

“All vocals from other people should be removed… isolate a single speaker”


:gear: Best RVC / Applio settings (for singing)

These are community-tested setups for natural singing :backhand_index_pointing_down:


:test_tube: Training settings

Model:

  • f0 model = ON (MANDATORY for singing)

  • Pretrained:

    • f0G40k + f0D40k (default good baseline)

Epochs:

  • 200–400 (sweet spot)

  • DO NOT go 1000+ (causes robotic voice)

:backhand_index_pointing_right: Stop when:

  • voice sounds best, not when loss is lowest

Batch size:

  • Small dataset (<30 min): 4

  • Larger dataset: 8


Hop length:

  • 128 (recommended) → better pitch accuracy

:level_slider: Inference settings (VERY IMPORTANT)

This is where most people mess up.

Pitch extraction:

  • Use:

    • rmvpe (best for singing)

    • or crepe (second best)


Index rate:

  • 0.5 → 0.75
    :backhand_index_pointing_right: Higher = more accurate voice
    :backhand_index_pointing_right: Lower = more natural variation

Pitch shift:

  • Adjust depending on song key

  • Don’t leave at default (this breaks realism)


Protect (consonants):

  • 0.2 – 0.5
    :backhand_index_pointing_right: Prevents weird pronunciation

RMS mix:

  • 0.2 – 0.4
    :backhand_index_pointing_right: Keeps natural dynamics (important for singing)

:microphone: Suno → RVC workflow (best pipeline)

This is the setup used by people getting near-realistic results:

1. Generate song in Suno

2. Extract vocals:

  • UVR (MDX-Net / Demucs)

3. Clean vocals:

  • Remove reverb + noise

4. Run through RVC:

  • rmvpe pitch extraction

  • tuned index rate

5. Post-process:

  • EQ + compression + reverb (VERY important)

:backhand_index_pointing_right: Without post-processing, it WILL sound robotic


:fire: Advanced techniques (huge difference)

1. Train on BOTH speech + singing

  • Speech = clarity

  • Singing = expression

:backhand_index_pointing_right: Best combo:

  • 60% singing

  • 40% speech


2. Use pitch-augmented dataset

Research shows:

  • Pitch variation improves singing models significantly

3. Avoid overfitting

Too much training:

  • Reduces flexibility

  • Makes vocals sound “stiff”


4. Use better SVC models (if you want next-level)

RVC is good, but newer models are better for singing:

  • Diff-SVC

  • So-VITS-SVC 4.1

  • YingMusic-SVC (research-level)

:backhand_index_pointing_right: These handle:

  • vibrato

  • expressive singing
    much better than vanilla RVC


:brain: Honest reality (important)

Even with perfect setup:

:backhand_index_pointing_right: RVC ≠ Suno-level vocals (yet)

Because:

  • Singing voice conversion is harder than speech

  • Requires modeling pitch + emotion + dynamics

New research confirms:

  • Singing conversion struggles with pitch + harmony complexity

:chequered_flag: Best setup summary (what I’d do)

If I were building your pipeline:

Training:

  • 40–60 min dataset (speech + singing)

  • f0 model ON

  • 250–350 epochs

  • batch size 4–8

Inference:

  • rmvpe

  • index rate ~0.6

  • RMS mix ~0.3

  • pitch manually adjusted

Pipeline:

  • Suno → UVR → RVC → DAW (EQ + reverb)

1 Like

You said it sounds “more robot than the original AI vocals” — that’s not your voice model failing, that’s three settings fighting each other during inference, and probably the wrong pretrain underneath it all.

:bullseye: Right now — download the KLM v7s pretrain from HuggingFace (it’s trained on singing data, not speech — this alone fixes 60% of the robotic problem). Load it in Applio → Train tab → check Custom Pretrained → select KLM files.
:studio_microphone: This weekend — record 15–30 min of yourself singing dry (no reverb, no effects, closet or quiet room, WAV format), covering your full range — chest, mix, falsetto, grit. Train at batch 4, 40k sample rate, RMVPE, 200–500 epochs. Monitor TensorBoard — stop at the lowest dip of the g/total graph, not the highest epoch.
:wrench: When you convert — here’s the part nobody tells you: set index ratio 0.65, protect 0.33, rms_mix_rate 0.2. That rms_mix_rate is what preserves singing dynamics — the default flattens your expression dead. I run RMVPE for powerful vocals, Crepe for soft/breathy ones.

You mentioned What works Time
Robotic/no emotion KLM v7s pretrain + rms_mix_rate 0.2 Swap in 5 min
Replacing Suno vocals UVR5 → Kim Vocal 2 → strip reverb → feed to Applio 10 min per song
Applio/RVC settings? Batch 4, 40k, RMVPE, 200-500 epochs (see below) Weekend project
Pre-trained model? Yes, always — KLM v7s or SingerPreTrain, never train from scratch Download once
Free? Applio + UVR5 + Audacity — all free, all local $0 forever
🎤 Do Exactly This, In This Order — Full Pipeline From Suno to Your Voice

Step 0 — Record Your Dataset

Sing in a quiet room. Closet > bedroom > studio with echo. Phone works if it’s dead quiet, but a $30 USB mic (Fifine, Tonor) is the single best investment here — the jump from phone to cheap mic matters 10x more than cheap mic to expensive mic.

Cover your entire range across multiple songs. Chest voice, mix voice, falsetto, breathy sections, belting, quiet moments. The model can only reproduce what it’s heard. 15 min minimum, 30 min ideal.

Save as WAV, mono, no effects. Never pitch-correct or reverb your training data — and avoid running it through “AI enhance” tools like Adobe Podcast Enhance, which creates fake harmonics that RVC can’t learn from.

Step 1 — Train in Applio

Setting Value Why
RVC Version V2 Always
Sample Rate 40k (or 32k if using KLM/SingerPreTrain) 40k = best quality ceiling; 32k = less harsh sibilance
Pitch Extraction RMVPE Best accuracy for singing, handles vibrato natively
Embedder ContentVec (default) Must match your pretrain
Batch Size 4 (under 30 min data) / 8 (over 30 min) Match to your VRAM
Epochs 500 (set high, pick best via TensorBoard) You’ll choose the best checkpoint, not the last one
Save Every 25–50 epochs So you can A/B test checkpoints
Pretrained :white_check_mark: Checked + Custom Pretrained :white_check_mark: Load KLM v7s or SingerPreTrain files

:light_bulb: The TensorBoard trick that saves hours: Open TensorBoard → SCALARS → watch the g/total graph. It dips then rises. The dip is your best model — not the final epoch. Every epoch past the dip makes sibilance worse while the rest sounds “the same.” Test the checkpoint at the dip on a sibilant-heavy passage (lots of S and SH sounds) before committing.

:light_bulb: KLM vs SingerPreTrain: KLM v7s was trained on Korean vocalists covering full range — works surprisingly well for any language because it learned pitch range, not language. SingerPreTrain is explicitly English singing, bass to soprano. Both dramatically outperform the stock pretrain for singing. KLM has more community evidence behind it. Try both and keep the one that sounds closer to you after 200 epochs.

Step 2 — Separate Suno Stems

Download your Suno song. Open UVR5 (Ultimate Vocal Remover 5) — free, local, no limits.

Pass 1: Process Method → MDX-Net → Model: Kim Vocal 2. This isolates the vocal. Save both the vocal and the instrumental.

Pass 2: Take the isolated vocal → Process Method → VR Architecture → Model: UVR-DeEcho-DeReverb. This strips the reverb/echo that Suno bakes into every vocal.

:warning: Do NOT stack more separation passes — the AI Hub docs warn this “rips away frequencies” and introduces new artifacts. Two passes max.

Step 3 — Convert the Vocal in Applio

Go to the Inference tab. Load your trained model + its .index file.

Inference Setting Value Notes
f0 Method RMVPE (powerful singing) / Crepe (soft/breathy) Match the vocal style
Index Ratio 0.65 Higher = more “you” but more breath artifacts; lower = cleaner but less identity
Protect 0.33 Below 0.2 = words get swallowed; above 0.4 = metallic S sounds
RMS Mix Rate 0.2 This is the emotion killer at default — 0.2 preserves dynamics
Filter Radius 3 Don’t change this
Pitch 0 (same gender) / ±12 (octave) ±5 to ±7 for non-octave cross-gender
Split Audio ON Prevents volume drops on long tracks
Autotune Light or OFF Only if pitch wobbles post-conversion

:light_bulb: If it still sounds robotic after all this: the problem is almost certainly your training data, not your settings. Check these in order: (1) Is there reverb baked into your recordings? Run them through DeEcho-DeReverb. (2) Is your dataset under 10 minutes? That’s below the quality floor for singing. (3) Are you using the stock pretrain instead of KLM/SingerPreTrain? That alone explains most “robotic singing” reports.

Step 4 — Mix in Audacity

Import the converted vocal + original instrumental on separate tracks. Align them (they should be the same length). Converted vocal typically sits 1–3 dB above the instrumental.

If sibilance (harsh S/SH sounds) is still present, apply Audacity’s de-esser to the converted vocal. Cut 2–6 kHz gently with EQ if needed. Add a tiny reverb to match the track’s space — converted vocals sound unnaturally dry without it.

Export as WAV or MP3. Done.

The Metallic Sibilance Problem — What It Actually Is

The “robotic S sounds” you described is a known architectural limitation of HiFi-GAN (the vocoder RVC uses). It struggles with unvoiced, aperiodic sounds like S, SH, breaths. No single setting eliminates it — the fix is a chain:

  1. De-ess your training data before training (Audacity or iZotope RX)
  2. Pick the right epoch — sibilant segments overfit at different rates than sung notes
  3. Train at 32k instead of 40k if you’re willing to trade top-end sparkle for cleaner sibilants
  4. De-ess the converted output as a post-processing step
  5. Use KLM or SingerPreTrain — stock pretrain has the worst sibilance behavior

This is the current ceiling of the technology. Academic research (YingMusic-SVC, December 2025) is specifically designing new loss functions to fix this — future RVC versions may solve it at the model level. For now, the chain above gets you 90% of the way.

Your Situation → What To Do

If you have… Do this Skip this
No GPU / weak GPU Use Applio’s Google Colab notebook Local training
NVIDIA RTX 2060+ Train locally in Applio — faster and no Colab time limits Cloud training
Only a phone for recording Record in a closet, clean with UVR5 noise removal first — workable but not ideal Expecting studio quality
A $30 USB mic You’re set — this is the sweet spot for dataset quality Expensive gear
“I just want it done fast” Try AICoverGen — automates the entire pipeline in one command Manual steps
Suno Pro/Premier ($10+/mo) Test Suno v5.5 Voices first — ~70% resemblance, zero setup, generates WITH your voice But it won’t replace existing stems

You mentioned wanting to replace Suno’s vocal stems specifically — that tells me you’ve already got songs you like and just want your voice on them, which is exactly what this pipeline does. What GPU are you working with right now?

2 Likes