🧬 Darwin-TTS-1.7B-Cross

💬 Text to Speak

📝 Quick Examples

🧬 LLM Blend Ratio

★ Recommended — emotion appears

0% Original 1% 2% 3% ★ 4% 5% ★★

🧬 Blending LLM intelligence into TTS...
This may take 30~60 seconds on first run.

✨ Darwin-TTS Output

🔬 How It Works

Load Qwen3-TTS-1.7B

A 4-module TTS model: talker (LM backbone) + code_predictor + speech_tokenizer + decoder

Extract Qwen3-1.7B LLM FFN

84 FFN tensors (gate_proj, up_proj, down_proj × 28 layers) — identical shape to TTS talker

Blend with lerp (α=3%)

p.lerp_(llm_weight, 0.03) — 97% TTS + 3% LLM language understanding

Generate Speech

The TTS model now "understands" text slightly better, producing more emotionally expressive speech

Qwen3-TTS-1.7B (4-module):
├── talker (28L Qwen3 LM) ← FFN blended α=3%
├── code_predictor (5L) ← untouched
├── speech_tokenizer ← untouched
└── encoder/decoder ← untouched

Key mapping: talker.model.layers.N ↔ model.layers.N (1:1)