dots.tts: Free Open-Source Voice Cloning With 24-Language Support

This is a free, fast, accurate voice cloning tool built around dots.tts, a 2B-parameter text-to-speech model for turning written text into generated speech from a reference voice.

Voice cloning usually splits between polished commercial APIs and lightweight open‑source models that trade speaker similarity for speed.

Dots.tts delivers commercial‑grade cloning quality under an open‑source license. It runs without discrete tokens, using a 48 kHz AudioVAE and a language model backbone initialized from Qwen2.5‑1.5B‑Base.

Upload a reference audio clip and its transcript, type the text you want to synthesize, and the model generates speech that preserves the reference timbre across English, Chinese, and many other languages.

The self‑hosted solution supports continuation cloning, x‑vector‑only cloning, and a MeanFlow‑distilled fast mode that cuts inference steps without collapsing quality.

Visit dots.tts Voice Clone Tool

dots.tts GitHub

Features

Generates 48 kHz mono audio from text with a reference audio sample as the voice source.
Continuation cloning pairs reference audio with its exact transcript for the highest-fidelity voice match.
X-vector-only cloning extracts timbre from a reference audio file without a transcript input.
Three published checkpoints: base pretrain (dots.tts-base), self-corrective-aligned (dots.tts-soar), and MeanFlow distilled NFE=4 (dots.tts-mf).
dots.tts-mf reaches first-packet audio latency of 85 milliseconds in streaming mode and 54 milliseconds in dual-streaming mode.
Accepts a local model directory path or a Hugging Face repo ID as the model argument.
24 languages: Chinese, Cantonese, English, French, German, Spanish, Japanese, Korean, Portuguese, Russian, Italian, Polish, Dutch, Greek, Czech, Finnish, Romanian, Indonesian, Hindi, Arabic, Ukrainian, Turkish, Thai, and Vietnamese.
Gradio web demo ships in the repository for local GPU server self-hosting.
Python API through DotsTtsRuntime.from_pretrained() for pipeline integration.
Fine-tuning from any released checkpoint via accelerate launch with a JSONL audio manifest.
MeanFlow distillation entry point for training a faster student model from a flow-matching teacher.
1T1A interleaved sequence mode alternates one BPE text token with one audio step for low-latency duplex dialogue systems.
Fully continuous pipeline with no discrete audio tokens; produces 48 kHz mono WAV output.

Example Output

Reference Audio

dots.tts is a 2B-parameter fully continuous, end-to-end autoregressive (AR) text-to-speech system. The backbone pairs a semantic encoder, an LLM, and an autoregressive flow-matching acoustic head over a 48 kHz AudioVAE, with no discrete tokens anywhere in the pipeline.

Use Cases

Clone a consented speaker voice for narration drafts, product demos, or internal audio prototypes.
Generate multilingual TTS samples from a short reference recording.
Build a local voice cloning workflow with CLI commands and scripted Python generation.

How To Use It (web version)

1. Visit the dots.tts Voice Cloning Tool and upload a short reference audio clip from a voice you have permission to clone.

2. Add the transcript for the reference audio. The transcript should match the spoken words closely.

3. Enter the text you want the model to synthesize.

4. Generate the audio and listen for pronunciation, pacing, and speaker similarity.

5. Regenerate with a cleaner prompt clip if the output drifts, mispronounces words, or loses the target voice.

A clean reference clip matters. Use speech with minimal background noise, no music bed, and one speaker. The transcript matters just as much as the audio because continuation voice cloning pairs --prompt-audio with --prompt-text.

Self-Hosted Setup

Create a conda environment with Python 3.10, 3.11, or 3.12:

conda create -n dots_tts python=3.10 -y
conda activate dots_tts

Install the package from source:

python -m pip install --upgrade pip
python -m pip install -e . -c constraints/recommended.txt

Download a checkpoint from Hugging Face. Use dots.tts-soar as the default starting point for voice cloning quality, or dots.tts-mf for streaming latency:

huggingface-cli download rednote-hilab/dots.tts-soar \
  --local-dir pretrained_models/dots.tts-soar

CLI Inference

Continuation cloning (reference audio plus transcript, recommended):

dots.tts \
  --model-name-or-path pretrained_models/dots.tts-soar \
  --text "Your synthesized text here." \
  --prompt-audio reference.wav \
  --prompt-text "The exact transcript of the reference audio." \
  --output output.wav

X-vector-only cloning (reference audio, no transcript):

dots.tts \
  --model-name-or-path pretrained_models/dots.tts-soar \
  --text "Your synthesized text here." \
  --prompt-audio reference.wav \
  --output output.wav

Multilingual inference with an explicit language tag:

dots.tts \
  --model-name-or-path pretrained_models/dots.tts-soar \
  --text "Your text in English." \
  --prompt-audio reference.wav \
  --prompt-text "Transcript of reference." \
  --language EN \
  --output output.wav

Use --language auto_detect to infer the language tag from the text, or pass a code such as EN, ZH, or Cantonese to override.

Local Gradio Server

python apps/gradio/app.py \
  --model-name-or-path pretrained_models/dots.tts-soar \
  --optimize

The server starts at http://0.0.0.0:7860. The --optimize flag runs torch.compile warmup at startup for faster steady-state inference. The checkpoint, execution mode, precision, and max generation length lock at server start. Changing any of them requires a server restart.

Python API

from dots_tts.runtime import DotsTtsRuntime
import soundfile as sf
runtime = DotsTtsRuntime.from_pretrained(
    "rednote-hilab/dots.tts-soar",
    precision="bfloat16",
    optimize=True,
)
result = runtime.generate(
    text="Your synthesized text here.",
    prompt_audio_path="reference.wav",
    prompt_text="The exact transcript of the reference audio.",
    num_steps=10,
    guidance_scale=1.0,
)
sf.write("output.wav", result["audio"].float().cpu().squeeze().numpy(), result["sample_rate"])

Pass a Hugging Face repo ID or a local directory path to from_pretrained. The optimize=True flag enables torch.compile acceleration with a one-time warmup penalty on first load.

CLI Flags

Flag	Description	Default
`--model-name-or-path`	Local model directory or Hugging Face repo ID	Required
`--text`	Text to synthesize	Required
`--output`	Output WAV file path	Required
`--prompt-audio`	Reference audio file for voice cloning	None (falls back to random voice)
`--prompt-text`	Exact transcript of the reference audio	None (uses x-vector only)
`--num-steps`	Flow-matching sampling steps; higher improves quality at the cost of speed	`10`
`--guidance-scale`	CFG scale for flow-matching; values above 2 amplify audio energy	`1.0`
`--normalize-text`	Apply WeTextProcessing text normalization before inference	Off
`--language`	Language tag: `none`, `auto_detect`, `EN`, `ZH`, `Cantonese`, or language names	`none`
`--seed`	RNG seed for deterministic output	`42`

Checkpoint Summary

Checkpoint	HuggingFace Repo	Characteristic
Base (Pretrain)	`rednote-hilab/dots.tts-base`	Best average WER on Seed-TTS-Eval (2.92); clean pretrained base for fine-tuning
SCA	`rednote-hilab/dots.tts-soar`	Highest average speaker similarity; leads 19/24 languages on MiniMax multilingual
MeanFlow (NFE=4)	`rednote-hilab/dots.tts-mf`	85 ms first-packet latency in streaming mode; near-identical benchmark scores to SCA

Fine-Tuning

Prepare a JSONL manifest with one JSON object per audio file, minimum three fields:

{"fid": "sample-0001", "audio": "/abs/path/to/audio.wav", "text": "hello world"}

Launch fine-tuning:

accelerate launch scripts/train_dots_tts.py --config configs/dots_tts.yaml

Edit the config to replace train.pretrained_model_path, train_data.sources, val_data.sources, train.output_dir, and train.max_train_steps with your own values.

MeanFlow Distillation Key Flags

Flag	Description	Default
`--teacher-model-path`	Frozen flow-matching teacher model directory	`train.pretrained_model_path`
`--teacher-steps`	Teacher rollout steps for distillation target; higher is slower and produces stronger targets	`8`
`--teacher-solver`	ODE solver: `euler`, `midpoint`, or `rk4`	`euler`
`--cfg-distill-mode`	`fused` distills guided teacher target into student; `natural` trains on conditional/unconditional masks without CFG fusion	`fused`
`--distill-cfg-scale`	CFG coefficient for fused distillation mode	`1.2`
`--anchor-prob`	Probability of zero-duration anchor sample in MeanFlow training	`0.5`

Alternatives and Related Resources

7 Best Free AI Voice Cloning Tools: A comparison of free voice cloning tools for narration, dubbing, and developer pipelines.
MOSS-TTS-Nan: A 0.1B-parameter TTS model that runs on CPU; compare when GPU access is unavailable.
Voicebox: A free, open-source desktop app for local voice cloning on macOS and Windows.
dots.tts Technical Report (arXiv): Architecture, training methodology, and full benchmark tables.
dots.tts Audio Demo Page: Sample audio comparisons across benchmark systems.

Pros

Free web version with no signup.
Free for commercial use.
Top average speaker similarity on Seed-TTS-Eval among open-source models.
Three checkpoints target quality, expressiveness, and streaming latency separately.
85 ms first-packet latency on the MeanFlow checkpoint.
Fine-tuning and MeanFlow distillation scripts in the repo.
Fully continuous 48 kHz pipeline with no discrete token bottleneck.

Cons

GPU required for all local inference.
Word error rates significantly higher on Arabic, Hindi, Turkish, and Vietnamese.
No singing voice generation.

FAQs

Q: Is dots.tts free to use?
A: The code, all three model checkpoints, and the Hugging Face Space demo are free. The Apache-2.0 license permits commercial use. Local GPU compute is the only cost for self-hosted deployment.

Q: What hardware does dots.tts require for local inference?
A: A CUDA-capable GPU is required. The package does not run on CPU-only hardware. Using bfloat16 precision reduces VRAM requirements compared with the default float32.

Q: How many languages does dots.tts support?
A: dots.tts currently supports 24 languages: Chinese, Cantonese, English, French, German, Spanish, Japanese, Korean, Portuguese, Russian, Italian, Polish, Dutch, Greek, Czech, Finnish, Romanian, Indonesian, Hindi, Arabic, Ukrainian, Turkish, Thai, and Vietnamese.

Q: Which checkpoint should I use?
A: Use rednote-hilab/dots.tts-soar (SCA) as the default for voice cloning quality; it posts the highest average speaker similarity across benchmarks. Use rednote-hilab/dots.tts-mf (MeanFlow, NFE=4) for streaming pipelines where 85 ms first-packet latency matters more than maximum audio quality. Use rednote-hilab/dots.tts-base as a starting point for fine-tuning on custom voice data.

Q: Can I fine-tune dots.tts on my own voice samples?
A: Yes. The repository includes a fine-tuning entry point via accelerate launch scripts/train_dots_tts.py. You provide WAV audio files and a JSONL manifest linking each file to its transcript.