Blend an LLM's language understanding into a TTS model β enhancing emotional expressiveness without any training. Just weight-space arithmetic.
A 4-module TTS model: talker (LM backbone) + code_predictor + speech_tokenizer + decoder
84 FFN tensors (gate_proj, up_proj, down_proj Γ 28 layers) β identical shape to TTS talker
p.lerp_(llm_weight, 0.03) β 97% TTS + 3% LLM language understanding
The TTS model now "understands" text slightly better, producing more emotionally expressive speech