2.4B Parameter Diffusion Transformer

Echo-TTS

Ultra-Fast Diffusion-Based Voice AI

Discover Echo-TTS—a breakthrough 2.4B parameter diffusion transformer developed at the University of Rochester. Generate 30-second audio in just 1.45 seconds on A100 with RTF < 0.05. Clone voices from up to 2 minutes of reference audio at pristine 44.1kHz quality.

2.4B

Parameters

<0.05

RTF on A100

2 min

Voice Clone

44.1kHz

Audio Quality

Try Live Demo Hugging Face Space

Try Echo-TTS Now

Experience ultra-fast voice cloning with diffusion-based generation

Note: Enter your text, optionally upload a reference audio file (up to 2 minutes) to clone a specific voice, and generate high-quality speech at 44.1kHz. The model generates 30-second segments in approximately 1.45 seconds on A100 GPU. The demo runs on Hugging Face Spaces and may take a moment to initialize.

What is Echo-TTS?

Echo-TTS represents a breakthrough in diffusion-based text-to-speech synthesis, combining ultra-fast generation with high-fidelity voice cloning capabilities.

Revolutionary Speed

Echo-TTS achieves unprecedented speed with a real-time factor (RTF) of less than 0.05on A100 GPUs. This means it can generate 30 seconds of high-quality audio in just 1.45 seconds, making it significantly faster than comparable open-source models.

For comparison, Higgs Audio v2 takes ~12 seconds, and VibeVoice-7B requires ~55 seconds for the same task. This exceptional speed makes Echo-TTS ideal for real-time applications, live streaming, interactive voice assistants, and large-scale batch processing.

Advanced Voice Cloning

Echo-TTS supports sophisticated voice cloning by conditioning on up to 2 minutes of speaker reference audio. The model captures unique voice characteristics, timbre, prosody, and speaking style to generate highly authentic synthetic speech that closely matches the reference speaker.

The system outputs pristine 44.1kHz audio using the Fish Speech S1-DAC codec, ensuring professional-grade quality suitable for audiobook narration, content creation, accessibility tools, and personalized voice applications. Each generation can produce up to 30 seconds of continuous speech.

Echo-TTS Key Features

Discover what makes Echo-TTS a cutting-edge voice synthesis solution for 2025

2.4B Parameter DiT

Advanced 2.4B parameter Diffusion Transformer architecture with three main components: speaker reference transformer, text transformer, and diffusion decoder. Effective compute equivalent to ~1.4B transformer.

Ultra-Fast Generation

Generate 30 seconds of audio in just 1.45 seconds on A100 GPU with RTF < 0.05. Significantly outperforms comparable models like Higgs Audio v2 (~12s) and VibeVoice-7B (~55s).

Voice Cloning from Reference

Clone voices from up to 2 minutes of reference audio. The model captures unique voice characteristics, timbre, prosody, and speaking style for highly authentic synthetic speech.

44.1kHz Professional Audio

Outputs pristine 44.1kHz audio using Fish Speech S1-DAC codec. Professional-grade quality suitable for audiobooks, content creation, accessibility tools, and commercial applications.

Advanced Guidance Control

Supports multiple sampling methods including joint unconditional CFG, independent guidance (separate text/speaker scales), alternating guidance between modalities, and configurable steps (30-60).

Massive Training Dataset

Trained on ~160,000 hours of podcast-like audio using TPU v4-64 pod with JAX/Flax. Batch size of 768 across 800,000 training steps with Muon optimizer and BF16 computation.

Technical Architecture

Built on cutting-edge diffusion transformer technology optimized for speed and quality

Diffusion Transformer Architecture

Echo-TTS employs a sophisticated 2.4B parameter Diffusion Transformer (DiT)architecture with three main components: a speaker reference transformer, a text transformer, and a diffusion decoder. The effective decoder compute is equivalent to approximately 1.4B transformer parameters.

This architecture enables the model to efficiently process both text and speaker conditioning while maintaining ultra-fast generation speeds. The diffusion-based approach allows for high-quality audio synthesis with better control over prosody, pitch, and speaking style compared to traditional autoregressive models.

High-Fidelity Audio Generation

Echo-TTS generates 44.1kHz audio using the advanced Fish Speech S1-DAC codec. This high sampling rate ensures professional-grade audio quality with rich harmonic content, clear articulation, and natural-sounding speech that rivals studio recordings.

The model can generate audio segments of up to 30 seconds per inference, conditioned on target text and up to 2 minutes of speaker reference audio. This allows for flexible voice cloning applications where users can provide their own reference audio to create personalized voice profiles.

Massive-Scale Training

Echo-TTS was trained on an extensive dataset of ~160,000 hours of podcast-like audiousing a TPU v4-64 pod with JAX/Flax framework. This massive training corpus ensures robust generalization across diverse speakers, accents, recording conditions, and speaking styles.

The training employed a batch size of 768 across 800,000 training steps, utilizing the Muon optimizer with BF16 computation for efficient mixed-precision training. This rigorous training process, supported by the TPU Research Cloud program, enables Echo-TTS to achieve exceptional audio quality and speaker fidelity.

Real-World Applications

Deploy Echo-TTS for professional voice synthesis across industries

📚

Audiobook Production

Generate professional audiobook narration with consistent voice quality. Ultra-fast generation enables efficient large-scale production. Clone author voices for authentic audiobook experiences.

🎬

Video Dubbing & Voiceovers

Create high-quality voiceovers for videos, films, and multimedia content. Fast generation speeds enable rapid turnaround for commercial projects and content creation workflows.

🎮

Gaming & Interactive Media

Generate dynamic NPC dialogue and character voices. Low latency enables real-time voice generation for interactive gaming experiences and immersive virtual environments.

🎙️

Podcast & Content Creation

Produce podcast intros, outros, and voiceovers with custom voices. Clone your own voice to maintain consistency across episodes or create entirely new character voices.

♿

Accessibility Tools

Build screen readers, assistive communication devices, and accessibility applications. High-quality, natural-sounding voices improve user experience for individuals with disabilities.

📖

E-Learning & Education

Create engaging educational content with clear pronunciation and natural pacing. Voice cloning enables instructors to scale their courses while maintaining personal connection.

🏢

Call Centers & IVR Systems

Deploy automated customer support with natural-sounding voices. Fast generation enables real-time conversational AI for customer service applications.

🎵

Music & Audio Production

Generate vocal samples, spoken-word elements, and audio effects. 44.1kHz output ensures professional quality suitable for music production and audio engineering.

📱

Voice Assistants & Chatbots

Build responsive voice assistants with personalized voices. Fast inference enables real-time conversational experiences for smart devices and virtual assistants.

Performance Comparison

See how Echo-TTS outperforms other leading TTS models

Model	Generation Time	RTF (A100)	Audio Quality
🔊Echo-TTS	1.45s	<0.05	44.1kHz
Higgs Audio v2	~12s	~0.4	24kHz
VibeVoice-7B	~55s	~1.8	24kHz

* Generation time for 30 seconds of audio. RTF = Real-Time Factor (lower is better).

Development Team

Built by researchers at the University of Rochester

Jordan Darefsky

Lead Researcher

Echo-TTS was developed by Jordan Darefsky at the University of Rochester with support from the TPU Research Cloud program. The project represents cutting-edge research in diffusion-based voice synthesis, pushing the boundaries of speed and quality in text-to-speech technology.

Technical Blog Post Hugging Face Profile

Model Release Status

Model weights and code are planned for release. However, the speaker-reference transformer weightswill not be released due to safety considerations regarding potential misuse for voice impersonation. Alternative voice cloning methods may be provided in future releases.

Ready to Experience Ultra-Fast Voice Synthesis?

Try Echo-TTS now and discover the future of diffusion-based text-to-speech technology

Try Live Demo Explore More Tools