Ultra-Fast Diffusion-Based Voice AI
Discover Echo-TTS—a breakthrough 2.4B parameter diffusion transformer developed at the University of Rochester. Generate 30-second audio in just 1.45 seconds on A100 with RTF < 0.05. Clone voices from up to 2 minutes of reference audio at pristine 44.1kHz quality.
Experience ultra-fast voice cloning with diffusion-based generation
Note: Enter your text, optionally upload a reference audio file (up to 2 minutes) to clone a specific voice, and generate high-quality speech at 44.1kHz. The model generates 30-second segments in approximately 1.45 seconds on A100 GPU. The demo runs on Hugging Face Spaces and may take a moment to initialize.
Echo-TTS represents a breakthrough in diffusion-based text-to-speech synthesis, combining ultra-fast generation with high-fidelity voice cloning capabilities.
Echo-TTS achieves unprecedented speed with a real-time factor (RTF) of less than 0.05on A100 GPUs. This means it can generate 30 seconds of high-quality audio in just 1.45 seconds, making it significantly faster than comparable open-source models.
For comparison, Higgs Audio v2 takes ~12 seconds, and VibeVoice-7B requires ~55 seconds for the same task. This exceptional speed makes Echo-TTS ideal for real-time applications, live streaming, interactive voice assistants, and large-scale batch processing.
Echo-TTS supports sophisticated voice cloning by conditioning on up to 2 minutes of speaker reference audio. The model captures unique voice characteristics, timbre, prosody, and speaking style to generate highly authentic synthetic speech that closely matches the reference speaker.
The system outputs pristine 44.1kHz audio using the Fish Speech S1-DAC codec, ensuring professional-grade quality suitable for audiobook narration, content creation, accessibility tools, and personalized voice applications. Each generation can produce up to 30 seconds of continuous speech.
Discover what makes Echo-TTS a cutting-edge voice synthesis solution for 2025
Advanced 2.4B parameter Diffusion Transformer architecture with three main components: speaker reference transformer, text transformer, and diffusion decoder. Effective compute equivalent to ~1.4B transformer.
Generate 30 seconds of audio in just 1.45 seconds on A100 GPU with RTF < 0.05. Significantly outperforms comparable models like Higgs Audio v2 (~12s) and VibeVoice-7B (~55s).
Clone voices from up to 2 minutes of reference audio. The model captures unique voice characteristics, timbre, prosody, and speaking style for highly authentic synthetic speech.
Outputs pristine 44.1kHz audio using Fish Speech S1-DAC codec. Professional-grade quality suitable for audiobooks, content creation, accessibility tools, and commercial applications.
Supports multiple sampling methods including joint unconditional CFG, independent guidance (separate text/speaker scales), alternating guidance between modalities, and configurable steps (30-60).
Trained on ~160,000 hours of podcast-like audio using TPU v4-64 pod with JAX/Flax. Batch size of 768 across 800,000 training steps with Muon optimizer and BF16 computation.
Built on cutting-edge diffusion transformer technology optimized for speed and quality
Echo-TTS employs a sophisticated 2.4B parameter Diffusion Transformer (DiT)architecture with three main components: a speaker reference transformer, a text transformer, and a diffusion decoder. The effective decoder compute is equivalent to approximately 1.4B transformer parameters.
This architecture enables the model to efficiently process both text and speaker conditioning while maintaining ultra-fast generation speeds. The diffusion-based approach allows for high-quality audio synthesis with better control over prosody, pitch, and speaking style compared to traditional autoregressive models.
Echo-TTS generates 44.1kHz audio using the advanced Fish Speech S1-DAC codec. This high sampling rate ensures professional-grade audio quality with rich harmonic content, clear articulation, and natural-sounding speech that rivals studio recordings.
The model can generate audio segments of up to 30 seconds per inference, conditioned on target text and up to 2 minutes of speaker reference audio. This allows for flexible voice cloning applications where users can provide their own reference audio to create personalized voice profiles.
Echo-TTS was trained on an extensive dataset of ~160,000 hours of podcast-like audiousing a TPU v4-64 pod with JAX/Flax framework. This massive training corpus ensures robust generalization across diverse speakers, accents, recording conditions, and speaking styles.
The training employed a batch size of 768 across 800,000 training steps, utilizing the Muon optimizer with BF16 computation for efficient mixed-precision training. This rigorous training process, supported by the TPU Research Cloud program, enables Echo-TTS to achieve exceptional audio quality and speaker fidelity.
Deploy Echo-TTS for professional voice synthesis across industries
Generate professional audiobook narration with consistent voice quality. Ultra-fast generation enables efficient large-scale production. Clone author voices for authentic audiobook experiences.
Create high-quality voiceovers for videos, films, and multimedia content. Fast generation speeds enable rapid turnaround for commercial projects and content creation workflows.
Generate dynamic NPC dialogue and character voices. Low latency enables real-time voice generation for interactive gaming experiences and immersive virtual environments.
Produce podcast intros, outros, and voiceovers with custom voices. Clone your own voice to maintain consistency across episodes or create entirely new character voices.
Build screen readers, assistive communication devices, and accessibility applications. High-quality, natural-sounding voices improve user experience for individuals with disabilities.
Create engaging educational content with clear pronunciation and natural pacing. Voice cloning enables instructors to scale their courses while maintaining personal connection.
Deploy automated customer support with natural-sounding voices. Fast generation enables real-time conversational AI for customer service applications.
Generate vocal samples, spoken-word elements, and audio effects. 44.1kHz output ensures professional quality suitable for music production and audio engineering.
Build responsive voice assistants with personalized voices. Fast inference enables real-time conversational experiences for smart devices and virtual assistants.
See how Echo-TTS outperforms other leading TTS models
| Model | Generation Time | RTF (A100) | Audio Quality |
|---|---|---|---|
🔊Echo-TTS | 1.45s | <0.05 | 44.1kHz |
| Higgs Audio v2 | ~12s | ~0.4 | 24kHz |
| VibeVoice-7B | ~55s | ~1.8 | 24kHz |
* Generation time for 30 seconds of audio. RTF = Real-Time Factor (lower is better).
Built by researchers at the University of Rochester
Lead Researcher
Echo-TTS was developed by Jordan Darefsky at the University of Rochester with support from the TPU Research Cloud program. The project represents cutting-edge research in diffusion-based voice synthesis, pushing the boundaries of speed and quality in text-to-speech technology.
Model weights and code are planned for release. However, the speaker-reference transformer weightswill not be released due to safety considerations regarding potential misuse for voice impersonation. Alternative voice cloning methods may be provided in future releases.
Try Echo-TTS now and discover the future of diffusion-based text-to-speech technology