🧬 World's First Cross-Modal FFN Transfer Β· πŸ€— Model Card

Darwin-TTS-1.7B-Cross

Blend an LLM's language understanding into a TTS model β€” enhancing emotional expressiveness without any training. Just weight-space arithmetic.

🎯
84
FFN Tensors Blended
βœ…
100%
Shape Match
⚑
0
Training Required
πŸ•
<2m
Build Time
β˜… Recommended β€” emotion appears
3%
0% Original 1% 2% 3% β˜… 4% 5% β˜…β˜…
🧬 Blending LLM intelligence into TTS...
This may take 30~60 seconds on first run.

✨ Darwin-TTS Output

πŸ”¬ How It Works

1

Load Qwen3-TTS-1.7B

A 4-module TTS model: talker (LM backbone) + code_predictor + speech_tokenizer + decoder

2

Extract Qwen3-1.7B LLM FFN

84 FFN tensors (gate_proj, up_proj, down_proj Γ— 28 layers) β€” identical shape to TTS talker

3

Blend with lerp (Ξ±=3%)

p.lerp_(llm_weight, 0.03) β€” 97% TTS + 3% LLM language understanding

4

Generate Speech

The TTS model now "understands" text slightly better, producing more emotionally expressive speech

Qwen3-TTS-1.7B (4-module):
β”œβ”€β”€ talker (28L Qwen3 LM) ← FFN blended Ξ±=3%
β”œβ”€β”€ code_predictor (5L) ← untouched
β”œβ”€β”€ speech_tokenizer ← untouched
└── encoder/decoder ← untouched

Key mapping: talker.model.layers.N ↔ model.layers.N (1:1)