TTS Playground (Voice Cloning)

TTS Playground is the dedicated testing interface for speech synthesis models (such as Qwen3-TTS) in Herdsman. This guide focuses on the Voice Clone feature: provide a reference audio clip and have the model speak new text in the cloned voice.

Launching a TTS Model

  1. Open the Models page.

  2. Locate a model tagged TTS or Speech Synthesis (e.g., Qwen3-TTS Voice Clone).

  3. Click Launch Now on the model card.

  4. Wait for the status to switch to Ready, then click the card to open the Playground.

    Launch TTS model

Interface Overview

The TTS Playground is divided into a parameter panel (left), a script workspace (center), and a runtime monitor (right).

TTS Playground interface

Parameters (left sidebar)

  • Basic settings:
    • Language — dropdown (e.g., Auto) selecting the language of the synthesis text.
    • Speed — slider (default 1). Drag right to speed up, left to slow down.
  • Advanced settings:
    • Output mode:
      • Non-streaming (default) — waits until the full audio is generated, then plays it. Best for short texts.
      • Streaming / Realtime — plays as audio is generated. Lower latency (subject to model support).
  • Voice (Clone core):
    • Reference Audio — the voice you want to imitate. Accepts a local file path, a URL, or Base64-encoded data.
    • Reference Text — the exact transcript of the reference audio. Providing an accurate transcript significantly improves cloning accuracy.

Script Workspace (center)

  • Text input:
    • Enter the text you want the model to speak in the top text box.
    • Example: "Hello — how is your day going? Welcome to Herdsman voice playground."
    • Supports mixed Chinese-English text and punctuation-driven pauses.
  • Tags — the top-right corner provides Chinese, English, and Emotion tags to quickly switch input hints or adjust style.
  • Output area:
    • Caption / waveform — generated captions appear here.
    • Audio player — standard controls (play/pause, scrub, volume) at the bottom.
  • Action buttons:
    • Generate — start synthesis.
    • Stop — cancel the current job.

Runtime Monitor (right sidebar)

  • Runtime status:
    • TTFA / RTF — time-to-first-audio and real-time factor.
    • Latency / chunks — generation latency and chunk size.
    • Hardware usage — live CPU, GPU, and memory monitoring.
  • Run log:
    • Shows model load events (e.g., Selected qwen3-tts-voiceclone) and audio initialization details.

Voice Cloning Workflow

  1. Prepare assets:
    • A clean reference audio clip (5–10 seconds is ideal).
    • The exact reference text spoken in the clip.
  2. Configure cloning:
    • Enter the audio path or URL in Reference Audio.
    • Enter the transcript in Reference Text.
  3. Write your script — type the text you want the model to read in the center workspace.
  4. Adjust parameters — change Speed if needed (e.g., 1.2 for slightly faster delivery).
  5. Generate and review:
    • Click Generate at the bottom left.
    • Use the audio player to listen. If the voice doesn't sound right, check that the reference audio is clean and verify that the reference text matches it exactly.

Notes:

  • Reference text matters. For voice cloning, the reference text must accurately match the reference audio — mismatches degrade cloning quality.
  • Audio format support. Reference audio typically supports WAV and MP3, depending on the model backend.