Sori: Real-time General Audio to MIDI Transformation with Accordance-Musicality Trade-off Controllability

Sori is a real-time system that transforms arbitrary audio into symbolic music, with controllable trade-off between input accordance and musicality.

Sori (meaning “sound” in Korean) tackles General Audio to Symbolic Music Transformation (GASMT), a newly proposed task that aims to generate symbolic music that is both faithful to the input audio and musically coherent, even when the input is non-musical.

It is built on a streaming, causal encoder-decoder architecture, where the encoder is trained with domain adversarial learning to produce domain-invariant representations across musical and general audio. A causal Transformer decoder then autoregressively generates onset and offset events, while frame activity is reconstructed from past events to mitigate autoregressive degeneration.

At inference time, classifier-free guidance provides a simple, real-time control knob for adjusting the trade-off between input accordance and musicality.

Paper | Demo

Sori: Real-time General Audio to MIDI Transformation with Accordance–Musicality Trade-off Controllability. Kyungsu Kim, Yejin Kim, Kyogu Lee. ISMIR 2025, Late Breaking Demo