This is a free, fast, accurate voice cloning tool built around dots.tts, a 2B-parameter text-to-speech model for turning written text into generated speech from a reference voice.
Voice cloning usually splits between polished commercial APIs and lightweight open‑source models that trade speaker similarity for speed.
Dots.tts delivers commercial‑grade cloning quality under an open‑source license. It runs without discrete tokens, using a 48 kHz AudioVAE and a language model backbone initialized from Qwen2.5‑1.5B‑Base.
Upload a reference audio clip and its transcript, type the text you want to synthesize, and the model generates speech that preserves the reference timbre across English, Chinese, and many other languages.
The self‑hosted solution supports continuation cloning, x‑vector‑only cloning, and a MeanFlow‑distilled fast mode that cuts inference steps without collapsing quality.
Features
- Generates 48 kHz mono audio from text with a reference audio sample as the voice source.
- Continuation cloning pairs reference audio with its exact transcript for the highest-fidelity voice match.
- X-vector-only cloning extracts timbre from a reference audio file without a transcript input.
- Three published checkpoints: base pretrain (dots.tts-base), self-corrective-aligned (dots.tts-soar), and MeanFlow distilled NFE=4 (dots.tts-mf).
- dots.tts-mf reaches first-packet audio latency of 85 milliseconds in streaming mode and 54 milliseconds in dual-streaming mode.
- Accepts a local model directory path or a Hugging Face repo ID as the model argument.
- 24 languages: Chinese, Cantonese, English, French, German, Spanish, Japanese, Korean, Portuguese, Russian, Italian, Polish, Dutch, Greek, Czech, Finnish, Romanian, Indonesian, Hindi, Arabic, Ukrainian, Turkish, Thai, and Vietnamese.
- Gradio web demo ships in the repository for local GPU server self-hosting.
- Python API through
DotsTtsRuntime.from_pretrained()for pipeline integration. - Fine-tuning from any released checkpoint via
accelerate launchwith a JSONL audio manifest. - MeanFlow distillation entry point for training a faster student model from a flow-matching teacher.
- 1T1A interleaved sequence mode alternates one BPE text token with one audio step for low-latency duplex dialogue systems.
- Fully continuous pipeline with no discrete audio tokens; produces 48 kHz mono WAV output.
Example Output
Use Cases
- Clone a consented speaker voice for narration drafts, product demos, or internal audio prototypes.
- Generate multilingual TTS samples from a short reference recording.
- Build a local voice cloning workflow with CLI commands and scripted Python generation.
How To Use It (web version)
1. Visit the dots.tts Voice Cloning Tool and upload a short reference audio clip from a voice you have permission to clone.
2. Add the transcript for the reference audio. The transcript should match the spoken words closely.
3. Enter the text you want the model to synthesize.
4. Generate the audio and listen for pronunciation, pacing, and speaker similarity.
5. Regenerate with a cleaner prompt clip if the output drifts, mispronounces words, or loses the target voice.
A clean reference clip matters. Use speech with minimal background noise, no music bed, and one speaker. The transcript matters just as much as the audio because continuation voice cloning pairs--prompt-audiowith--prompt-text.
Self-Hosted Setup
Create a conda environment with Python 3.10, 3.11, or 3.12:
conda create -n dots_tts python=3.10 -y
conda activate dots_ttsInstall the package from source:
python -m pip install --upgrade pip
python -m pip install -e . -c constraints/recommended.txtDownload a checkpoint from Hugging Face. Use dots.tts-soar as the default starting point for voice cloning quality, or dots.tts-mf for streaming latency:
huggingface-cli download rednote-hilab/dots.tts-soar \
--local-dir pretrained_models/dots.tts-soarTable Of Contents
CLI Inference
Continuation cloning (reference audio plus transcript, recommended):
dots.tts \
--model-name-or-path pretrained_models/dots.tts-soar \
--text "Your synthesized text here." \
--prompt-audio reference.wav \
--prompt-text "The exact transcript of the reference audio." \
--output output.wavX-vector-only cloning (reference audio, no transcript):
dots.tts \
--model-name-or-path pretrained_models/dots.tts-soar \
--text "Your synthesized text here." \
--prompt-audio reference.wav \
--output output.wavMultilingual inference with an explicit language tag:
dots.tts \
--model-name-or-path pretrained_models/dots.tts-soar \
--text "Your text in English." \
--prompt-audio reference.wav \
--prompt-text "Transcript of reference." \
--language EN \
--output output.wavUse --language auto_detect to infer the language tag from the text, or pass a code such as EN, ZH, or Cantonese to override.
Local Gradio Server
python apps/gradio/app.py \
--model-name-or-path pretrained_models/dots.tts-soar \
--optimizeThe server starts at http://0.0.0.0:7860. The --optimize flag runs torch.compile warmup at startup for faster steady-state inference. The checkpoint, execution mode, precision, and max generation length lock at server start. Changing any of them requires a server restart.
Python API
from dots_tts.runtime import DotsTtsRuntime
import soundfile as sf
runtime = DotsTtsRuntime.from_pretrained(
"rednote-hilab/dots.tts-soar",
precision="bfloat16",
optimize=True,
)
result = runtime.generate(
text="Your synthesized text here.",
prompt_audio_path="reference.wav",
prompt_text="The exact transcript of the reference audio.",
num_steps=10,
guidance_scale=1.0,
)
sf.write("output.wav", result["audio"].float().cpu().squeeze().numpy(), result["sample_rate"])Pass a Hugging Face repo ID or a local directory path to from_pretrained. The optimize=True flag enables torch.compile acceleration with a one-time warmup penalty on first load.
CLI Flags
| Flag | Description | Default |
|---|---|---|
--model-name-or-path | Local model directory or Hugging Face repo ID | Required |
--text | Text to synthesize | Required |
--output | Output WAV file path | Required |
--prompt-audio | Reference audio file for voice cloning | None (falls back to random voice) |
--prompt-text | Exact transcript of the reference audio | None (uses x-vector only) |
--num-steps | Flow-matching sampling steps; higher improves quality at the cost of speed | 10 |
--guidance-scale | CFG scale for flow-matching; values above 2 amplify audio energy | 1.0 |
--normalize-text | Apply WeTextProcessing text normalization before inference | Off |
--language | Language tag: none, auto_detect, EN, ZH, Cantonese, or language names | none |
--seed | RNG seed for deterministic output | 42 |
Checkpoint Summary
| Checkpoint | HuggingFace Repo | Characteristic |
|---|---|---|
| Base (Pretrain) | rednote-hilab/dots.tts-base | Best average WER on Seed-TTS-Eval (2.92); clean pretrained base for fine-tuning |
| SCA | rednote-hilab/dots.tts-soar | Highest average speaker similarity; leads 19/24 languages on MiniMax multilingual |
| MeanFlow (NFE=4) | rednote-hilab/dots.tts-mf | 85 ms first-packet latency in streaming mode; near-identical benchmark scores to SCA |
Fine-Tuning
Prepare a JSONL manifest with one JSON object per audio file, minimum three fields:
{"fid": "sample-0001", "audio": "/abs/path/to/audio.wav", "text": "hello world"}Launch fine-tuning:
accelerate launch scripts/train_dots_tts.py --config configs/dots_tts.yamlEdit the config to replace train.pretrained_model_path, train_data.sources, val_data.sources, train.output_dir, and train.max_train_steps with your own values.
MeanFlow Distillation Key Flags
| Flag | Description | Default |
|---|---|---|
--teacher-model-path | Frozen flow-matching teacher model directory | train.pretrained_model_path |
--teacher-steps | Teacher rollout steps for distillation target; higher is slower and produces stronger targets | 8 |
--teacher-solver | ODE solver: euler, midpoint, or rk4 | euler |
--cfg-distill-mode | fused distills guided teacher target into student; natural trains on conditional/unconditional masks without CFG fusion | fused |
--distill-cfg-scale | CFG coefficient for fused distillation mode | 1.2 |
--anchor-prob | Probability of zero-duration anchor sample in MeanFlow training | 0.5 |
Alternatives and Related Resources
- 7 Best Free AI Voice Cloning Tools: A comparison of free voice cloning tools for narration, dubbing, and developer pipelines.
- MOSS-TTS-Nan: A 0.1B-parameter TTS model that runs on CPU; compare when GPU access is unavailable.
- Voicebox: A free, open-source desktop app for local voice cloning on macOS and Windows.
- dots.tts Technical Report (arXiv): Architecture, training methodology, and full benchmark tables.
- dots.tts Audio Demo Page: Sample audio comparisons across benchmark systems.
Pros
- Free web version with no signup.
- Free for commercial use.
- Top average speaker similarity on Seed-TTS-Eval among open-source models.
- Three checkpoints target quality, expressiveness, and streaming latency separately.
- 85 ms first-packet latency on the MeanFlow checkpoint.
- Fine-tuning and MeanFlow distillation scripts in the repo.
- Fully continuous 48 kHz pipeline with no discrete token bottleneck.
Cons
- GPU required for all local inference.
- Word error rates significantly higher on Arabic, Hindi, Turkish, and Vietnamese.
- No singing voice generation.
FAQs
Q: Is dots.tts free to use?
A: The code, all three model checkpoints, and the Hugging Face Space demo are free. The Apache-2.0 license permits commercial use. Local GPU compute is the only cost for self-hosted deployment.
Q: What hardware does dots.tts require for local inference?
A: A CUDA-capable GPU is required. The package does not run on CPU-only hardware. Using bfloat16 precision reduces VRAM requirements compared with the default float32.
Q: How many languages does dots.tts support?
A: dots.tts currently supports 24 languages: Chinese, Cantonese, English, French, German, Spanish, Japanese, Korean, Portuguese, Russian, Italian, Polish, Dutch, Greek, Czech, Finnish, Romanian, Indonesian, Hindi, Arabic, Ukrainian, Turkish, Thai, and Vietnamese.
Q: Which checkpoint should I use?
A: Use rednote-hilab/dots.tts-soar (SCA) as the default for voice cloning quality; it posts the highest average speaker similarity across benchmarks. Use rednote-hilab/dots.tts-mf (MeanFlow, NFE=4) for streaming pipelines where 85 ms first-packet latency matters more than maximum audio quality. Use rednote-hilab/dots.tts-base as a starting point for fine-tuning on custom voice data.
Q: Can I fine-tune dots.tts on my own voice samples?
A: Yes. The repository includes a fine-tuning entry point via accelerate launch scripts/train_dots_tts.py. You provide WAV audio files and a JSONL manifest linking each file to its transcript.










