TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument

TokenSynth is a codec-language-model-based neural synthesizer that generates audio from MIDI with timbre conditioned by audio or text, enabling zero-shot instrument cloning and text-to-instrument synthesis.

TokenSynth is a decoder-only Transformer that autoregressively generates neural codec tokens conditioned on MIDI tokens and a timbre embedding from a pretrained CLAP encoder. Because CLAP maps audio and text into a shared embedding space, TokenSynth can clone timbre from reference audio, synthesize instruments from text descriptions, and even interpolate between audio and text conditions without fine-tuning. It is trained on large-scale synthetic MIDI-audio pairs built from NSynth and Lakh MIDI, using DAC for audio tokenization. For inference control, it supports classifier-free guidance and introduces first-note guidance to stabilize timbre while avoiding guidance-induced noise across silent regions.

Paper | Code | Demo

TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument. Kyungsu Kim, Junghyun Koo, Sungho Lee, Haesun Joung, Kyogu Lee. ICASSP 2025.