Ultimate RVC Maker ⚡

If you liked this HF Space you can give me a ❤️

Try Ultimate RVC Maker WebUI using Colab here

Convert Audio

Convert audio using a trained voice model

Model file

Index file

Select separate files

Drop audio here

Audio input

Index strength

Higher values increase strength. However, lower values may reduce artificial effects in the audio

0 1

Export format

The export format to export the audio file in

wav mp3 flac ogg opus m4a mp4 aac alac wma aiff webm ac3

Input audio path

Enter the path to the audio file

Audio output path

Enter the output path (leave it as .wav format; it will auto-correct during conversion)

Clean audio

Auto-tune

Use separated audio

Using memory-efficient training

Convert original voice

Convert backup voice

Do not merge backup voice

Merge instruments

Pitch

Recommendation: set to 12 to change male voice to female and vice versa

-20 20

Audio cleaning strength

Strength of the audio cleaner for filtering vocals during export

0 1

Extracting pitch using the ONNX model can help improve speed

F0 ONNX Mode

Unlock all pitch extraction methods

Unlock all

Extraction method

Method used for data extraction

mangio-crepe-full crepe-full fcpe rmvpe harvest pyin hybrid

HYBRID extraction method

Combination of two or more different types of extracts

Hop length

Analyzing the time transfer window when performing transformations is allowed. The detailed value is compact but requires more calculation

1 512

Upload F0 file

F0 File

Embedders Mode

Extracting embeddings using different models

fairseq onnx transformers spin

Embedding model

Pre-trained model to assist embedding

contentvec_base hubert_base japanese_hubert_base korean_hubert_base chinese_hubert_base portuguese_hubert_base custom

Model name

If you have your own model, just upload it and input the name here

Preset file

Save cleanup

Save autotune

Save pitch

Save index impact

Save resampling

Save median filter

Save sound envelope

Save sound protection

Save sound split

Pitch and Formant Shift

File name to save

Upload preset file

Split audio

Pitch and Formant Shift

Auto-tune rate

Level of auto-tuning adjustment

0 1

Resample

Resample post-processing to the final sample rate; 0 means no resampling, NOTE: SOME FORMATS DO NOT SUPPORT SPEEDS OVER 48000

0 96000

Filter radius

If greater than three, median filtering is applied. The value represents the filter radius and can reduce breathiness or noise.

0 7

Volume envelope

Use the input volume envelope to replace or mix with the output volume envelope. The closer to 1, the more the output envelope is used

0 1

Consonant protection

Protect distinct consonants and breathing sounds to prevent audio tearing and other artifacts. Increasing this value provides comprehensive protection. Reducing it may reduce protection but also minimize indexing effects

0 1

Frequency for Formant Shift

0 16

Timbre for Formant Transformation

0 16

Converted audio

Convert main voice

Convert backup voice

Main voice + Backup voice

Convert original voice

Voice + Instruments

Convert Text to Speech

Convert text to speech and read aloud using the trained voice model

Model file

Index file

Input data from a text file (.txt)

Convert text using Google

Text to read

Reading speed

Speed of the voice

-100 100

Pitch

Recommendation: set to 12 to change male voice to female and vice versa

-20 20

Drop text file here

Voices by country

Pitch

Pitch adjustment for text-to-speech converter

-20 20

Index strength

Higher values increase strength. However, lower values may reduce artificial effects in the audio

0 1

Export format

The export format to export the audio file in

wav mp3 flac ogg opus m4a mp4 aac alac wma aiff webm ac3

Output speech path

Enter the output path

Converted speech output path

Enter the output path

Extracting pitch using the ONNX model can help improve speed

F0 ONNX Mode

Unlock all pitch extraction methods

Unlock all

Extraction method

Method used for data extraction

mangio-crepe-full crepe-full fcpe rmvpe harvest pyin hybrid

HYBRID extraction method

Combination of two or more different types of extracts

Hop length

Analyzing the time transfer window when performing transformations is allowed. The detailed value is compact but requires more calculation

1 512

Upload F0 file

F0 File

Embedders Mode

Extracting embeddings using different models

fairseq onnx transformers spin

Embedding model

Pre-trained model to assist embedding

contentvec_base hubert_base japanese_hubert_base korean_hubert_base chinese_hubert_base portuguese_hubert_base custom

Model name

If you have your own model, just upload it and input the name here

Pitch and Formant Shift

Split audio

Clean audio

Auto-tune

Using memory-efficient training

Auto-tune rate

Level of auto-tuning adjustment

0 1

Audio cleaning strength

Strength of the audio cleaner for filtering vocals during export

0 1

Resample

Resample post-processing to the final sample rate; 0 means no resampling, NOTE: SOME FORMATS DO NOT SUPPORT SPEEDS OVER 48000

0 96000

Filter radius

If greater than three, median filtering is applied. The value represents the filter radius and can reduce breathiness or noise.

0 7

Volume envelope

Use the input volume envelope to replace or mix with the output volume envelope. The closer to 1, the more the output envelope is used

0 1

Consonant protection

0 1

Frequency for Formant Shift

0 16

Timbre for Formant Transformation

0 16

Unconverted and converted audio

Generated speech from text-to-speech conversion

Speech converted using the model

Convert Audio With Whisper

Convert audio using a trained speech model with a Whisper model for speech recognition

Whisper will recognize different voices then cut the individual voices and use the RVC model to convert those segments

The Whisper model may not work properly which may cause strange output

Model file

Index file

Model file

Index file

Clean audio

Auto-tune

Using memory-efficient training

Pitch and Formant Shift

Number of voices

Number of voices in the audio

2 8

Pitch

Recommendation: set to 12 to change male voice to female and vice versa

-20 20

Index strength

Higher values increase strength. However, lower values may reduce artificial effects in the audio

0 1

Export format

The export format to export the audio file in

wav mp3 flac ogg opus m4a mp4 aac alac wma aiff webm ac3

Input audio path

Enter the path to the audio file

Audio output path

Enter the output path (leave it as .wav format; it will auto-correct during conversion)

Drop audio here

Pitch

Recommendation: set to 12 to change male voice to female and vice versa

-20 20

Index strength

Higher values increase strength. However, lower values may reduce artificial effects in the audio

0 1

Whisper model size

Large models can produce strange outputs

tiny tiny.en base base.en small small.en medium medium.en large-v1 large-v2 large-v3 large-v3-turbo

Extracting pitch using the ONNX model can help improve speed

F0 ONNX Mode

Unlock all pitch extraction methods

Unlock all

Extraction method

Method used for data extraction

mangio-crepe-full crepe-full fcpe rmvpe harvest pyin hybrid

HYBRID extraction method

Combination of two or more different types of extracts

Hop length

Analyzing the time transfer window when performing transformations is allowed. The detailed value is compact but requires more calculation

1 512

Embedders Mode

Extracting embeddings using different models

fairseq onnx transformers spin

Embedding model

Pre-trained model to assist embedding

contentvec_base hubert_base japanese_hubert_base korean_hubert_base chinese_hubert_base portuguese_hubert_base custom

Model name

If you have your own model, just upload it and input the name here

Audio cleaning strength

Strength of the audio cleaner for filtering vocals during export

0 1

Auto-tune rate

Level of auto-tuning adjustment

0 1

Resample

Resample post-processing to the final sample rate; 0 means no resampling, NOTE: SOME FORMATS DO NOT SUPPORT SPEEDS OVER 48000

0 96000

Filter radius

If greater than three, median filtering is applied. The value represents the filter radius and can reduce breathiness or noise.

0 7

Volume envelope

Use the input volume envelope to replace or mix with the output volume envelope. The closer to 1, the more the output envelope is used

0 1

Consonant protection

0 1

Frequency for Formant Shift 1

Frequency for Formant Shift

0 16

Timbre for Formant Transformation 1

Timbre for Formant Transformation

0 16

Frequency for Formant Shift 2

Frequency for Formant Shift

0 16

Timbre for Formant Transformation 2

Timbre for Formant Transformation

0 16

Audio input, output

Audio input

Speech converted using the model

Download Model

Download voice models, pre-trained models, and embedding models

Choose a model download method

Download from the link Download from the CSV model repository Search models Upload

Link to the model

Model name

Model repository

Name to search

Choose a searched model (Click to select)

Drop model here

Choose a model download method

Download from the link Model list Upload

Pre-trained model link D

Supports only huggingface.co

Pre-trained model link G

Supports only huggingface.co

Choose pre-trained model

Choose a pre-trained model to download

Model sample rate

Drop pre-trained model G here

Drop pre-trained model D here

Create Dataset training from YouTube

Process and create training datasets using YouTube links

Link audio

Link to audio (use commas for multiple links)

Dataset output

Output data after creation

Remove vocal reverb

Denoise

Voice separation version

The model version for separating vocals

Version-1 Version-2

Overlap

Overlap amount between prediction windows

0.25 0.5 0.75 0.99

Hop length

Analyzing the time transfer window when performing transformations is allowed. The detailed value is compact but requires more calculation

1 8192

Batch size

Number of samples processed at a time. Batch processing optimizes calculations. Large batches can cause memory overflow; small batches reduce resource efficiency

1 64

Segments Size

Higher is better quality but uses more resources

32 3072

Sample rate

NOTE: SOME FORMATS DO NOT SUPPORT RATES ABOVE 48000

8000 96000

Clean audio

Skip

Audio cleaning strength

Strength of the audio cleaner for filtering vocals during export

0 1

Skip beginning

Skip the initial seconds of the audio; use commas for multiple audios

Skip end

Skip the final seconds of the audio; use commas for multiple audios

Dataset creation information

Train Model

Train and build a voice model with a set of voice data

Model name

Name of the model during training (avoid special characters or spaces)

Sample rate

Sample rate of the model

32k 40k 48k

Model version

Version of the model during training

v1 v2

Clean dataset

Split audio

Post processing

Using memory-efficient training

Pitch Guidance

Upload dataset

Audio cleaning strength

Strength of the audio cleaner for filtering vocals during export

0 1

Drop audio here

Preprocessing information

Extracting pitch using the ONNX model can help improve speed

F0 ONNX Mode

Unlock all pitch extraction methods

Unlock all

Extraction method

Method used for data extraction

mangio-crepe-full crepe-full fcpe rmvpe harvest pyin

Hop length

Analyzing the time transfer window when performing transformations is allowed. The detailed value is compact but requires more calculation

1 512

Embedders Mode

Extracting embeddings using different models

fairseq onnx transformers spin

Embedding model

Pre-trained model to assist embedding

contentvec_base hubert_base japanese_hubert_base korean_hubert_base chinese_hubert_base portuguese_hubert_base custom

Model name

If you have your own model, just upload it and input the name here

Data extraction information

Total epochs

Total training epochs

1 10000

Save frequency

Frequency of saving the model during training to allow retraining

1 10000

Index algorithm

Algorithm for creating the index

Auto Faiss KMeans

Custom dataset folder for training data

Custom dataset folder

Check for overtraining during model training

Overtraining detector

Only enable if you need to retrain the model from scratch.

Clean Up

Store the model in GPU cache memory

Cache in GPU

Folder containing dataset

Overtraining threshold

1 100

Number of GPUs used

Number of GPUs used during training

GPU information

Information about the GPU used during training

Number of CPU cores available

Number of CPU cores used during training

0 16

Batch size

Number of samples processed simultaneously in one training cycle. Higher can cause memory overflow

1 64

Save only the latest D and G models

Save only the latest

Save all models after each epoch

Save all models

Do not use pre-trained models

Do not use pretraining

Customize pre-training settings

Custom pretraining

Vocoder

A vocoder analyzes and synthesizes human speech signals for voice transformation.

Default: This option is HiFi-GAN-NSF, compatible with all RVCs

MRF-HiFi-GAN: Higher fidelity.

RefineGAN: Superior sound quality.

Default MRF-HiFi-GAN RefineGAN

When enabled, highly deterministic algorithms are used, ensuring that each run of the same input data will yield the same results.

When disabled, more optimal algorithms may be selected but may not be fully deterministic, resulting in different training results between runs.

Deterministic algorithm

When enabled, it will test and select the most optimized algorithm for the specific hardware and size. This can help speed up training.

When disabled, it will not perform this algorithm optimization, which can reduce speed but ensures that each run uses the same algorithm, which is useful if you want to reproduce exactly.

Benchmark algorithm

Model creator name

To credit the model, enter your name here

Pre-trained model file D

Pre-trained model file G

Training information

Model file

Index file

Output file after compression

Fushion Two Models

Combine two voice models into a single model

Model name

Model file 1

Model file 2

Model path 1

Model path 2

Model ratio

Adjusting towards one side will make the model more like that side

0 1

Model output path

Read Model Information

Retrieve recorded information within the model

Drop model here

Model path

Enter the path to the model file

Model Information

Converting PYTORCH Model to ONNX Model

Convert RVC model from pytorch to onnx to optimize audio conversion

Drop model here

Model path

Enter the path to the model file

Model output path

Music Separation

A simple music separation system can separate into 4 parts: Instruments, Vocals, Main vocals, Backup vocals

Clean audio

Separate backup vocals

Remove vocal reverb

Remove backup reverb

Denoise MDX separation

Music separation model

Backup separation model

Shift

Higher is better quality but slower and uses more resources

1 20

Segments Size

Higher is better quality but uses more resources

32 3072

Batch size

Number of samples processed at a time. Batch processing optimizes calculations. Large batches can cause memory overflow; small batches reduce resource efficiency

1 64

Overlap

Overlap amount between prediction windows

0.25 0.5 0.75 0.99

Hop length

Analyzing the time transfer window when performing transformations is allowed. The detailed value is compact but requires more calculation

1 8192

Audio cleaning strength

Strength of the audio cleaner for filtering vocals during export

0 1

Sample rate

NOTE: SOME FORMATS DO NOT SUPPORT RATES ABOVE 48000

8000 96000

Drop audio here

Audio input

Link audio

Export format

The export format to export the audio file in

wav mp3 flac ogg opus m4a mp4 aac alac wma aiff webm ac3

Input audio path

Output audio folder path

Enter the folder path where the audio will be exported

Separated output

Instruments

Original vocal

Main vocal

Backup vocal

Editing Soundtrack Using Audioldm2 Model

Editing the soundtrack using Audioldm2 model can help change the type of instrument inside the soundtrack

Effective Edit

Prompt

Describe your desired edited output

Source Prompt

The extent to which the source influences the output. Higher values retain more characteristics from the source. Lower values give the system more freedom to transform.

0.5 25

Target Prompt

The extent to which the target influences the final result. Higher values force the result to follow the characteristics of the target. Lower values balance the source and target.

0.5 25

Drop audio here

Audio input

Export format

The export format to export the audio file in

wav mp3 flac ogg opus m4a mp4 aac alac wma aiff webm ac3

Input audio path

Enter the path to the audio file

Audio output path

Enter the output path (leave it as .wav format; it will auto-correct during conversion)

Audioldm2 model

Choose the Audioldm2 model of your choice

Loading the weights and inference will also take a long time depending on your GPU

audioldm2 audioldm2-large audioldm2-music

Source Prompt

Optional: Describe the original audio input

Sample rate

NOTE: SOME FORMATS DO NOT SUPPORT RATES ABOVE 48000

8000 96000

Edit Level

Lower correction levels will be closer to the original sound, higher will be stronger correction.

15 85

Number diffusion steps

Higher values (e.g. 200) produce higher quality output.

10 300

Audio output

Add Additional Audio Effects

Add effects to audio

Reverb effect

Chorus effect

Delay effect

Phaser effect

Compressor effect

Additional options

Drop audio here

Audio input

Enter the path to the audio file

Audio output

Enter the output path

Merge instruments

Audio input

Enter the path to the audio file

Export format

The export format to export the audio file in

wav mp3 flac ogg opus m4a mp4 aac alac wma aiff webm ac3

Create a continuous echo effect when this mode is enabled

Freeze mode

Room size

Adjust the room space to create reverberation

0 1

Damping

Adjust the level of absorption to control the amount of reverberation

0 1

Reverb signal level

Adjust the level of the reverb signal effect

0 1

Original signal level

Adjust the level of the signal without effects

0 1

Audio width

Adjust the width of the audio space

0 1

Chorus depth

Adjust the intensity of the chorus to create a wider sound

0 1

Frequency

Adjust the oscillation speed of the chorus effect

0.1 10

Mix signals

Adjust the mix level between the original and the processed signal

0 1

Center delay (ms)

The delay time between stereo channels to create the chorus effect

0 50

Feedback

Adjust the amount of the effect signal fed back into the original signal

-1 1

Delay time

Adjust the delay time between the original and the processed signal

0 5

Delay feedback

Adjust the amount of feedback signal, creating a repeating effect

0 1

Delay signal mix

Adjust the mix level between the original and delayed signal

0 1

Fade effect

Bass and treble

Threshold limiter

Resample

Distortion effect

Audio gain

Bit reduction effect

Clipping effect

Fade-in effect (ms)

Time for the audio to gradually increase from 0 to normal level

0 10000

Fade-out effect (ms)

the time it takes for the sound to fade from normal to zero

0 10000

Bass boost level (dB)

amount of bass boost in audio track

0 20

Low-pass filter cutoff frequency (Hz)

frequencies are reduced. Low frequencies make the bass clearer

20 200

Treble boost level (dB)

high level of sound reinforcement in the audio track

0 20

High-pass filter cutoff frequency (Hz)

The frequency will be filtered out. The higher the frequency, the higher the sound will be retained.

1000 10000

Limiter threshold

Limit the maximum audio level to prevent it from exceeding the threshold

-60 0

Release time

Time for the audio to return after being limited (Mili Seconds)

10 1000

Pitch

Recommendation: set to 12 to change male voice to female and vice versa

-20 20

Resample

Resample post-processing to the final sample rate; 0 means no resampling, NOTE: SOME FORMATS DO NOT SUPPORT SPEEDS OVER 48000

0 96000

Distortion effect

Adjust the level of distortion to create a noisy effect

0 50

Audio gain

Adjust the volume level of the signal

-60 60

Clipping threshold

Trim signals exceeding the threshold, creating a distorted sound

-60 0

Bit depth

Reduce audio quality by decreasing bit depth, creating a distorted effect

1 24

Phaser depth

Adjust the depth of the effect, impacting its intensity

0 1

Frequency

Adjust the frequency of the phaser effect

0.1 10

Mix signal

Adjust the mix level between the original and processed signals

0 1

Center frequency

The center frequency of the phaser effect, affecting the adjusted frequencies

50 5000

Feedback

Adjust the feedback level of the effect, creating a stronger or lighter phaser feel

-1 1

Compressor threshold

The threshold level above which the audio will be compressed

-60 0

Compression ratio

Adjust the level of audio compression when exceeding the threshold

1 20

Attack time (ms)

Time for compression to start taking effect after the audio exceeds the threshold

0.1 100

Release time

Time for the audio to return to normal after being compressed

10 1000

Audio output

Audio input

Audio output

Pitch Extraction

F0 pitch extraction is intended for use in audio conversion inference

Drop audio here

Audio input

Extracting pitch using the ONNX model can help improve speed

F0 ONNX Mode

Extraction method

Method used for data extraction

mangio-crepe-full crepe-full fcpe rmvpe harvest pyin

Input audio path

File

Image

Additional Settings

Customize additional features of the project

Language

The display language in the project (When changing the language, the system will automatically restart after 15 seconds to update)

Theme

Theme type displayed in the interface (When changing the theme, the system will automatically restart after 15 seconds to update)

Precision

Precision of inference and model training

Note: CPU Does not support fp16

For RefineGAN and MRF HIFIGAN when converting use fp32 as fp16 can cause them to give weird output

fp16 fp32

Font

Interface font

Visit Google Font to choose your favorite font.

Model name

Name of the model during training (avoid special characters or spaces)

Please do not use the project for any unethical, illegal, or harmful purposes to individuals or organizations...

In cases where users do not comply with the terms or violate them, I will not be responsible for any claims, damages, or liabilities, whether in contract, negligence, or other causes arising from, outside of, or related to the software, its use, or other transactions associated with it.

Built with Gradio logo