Singing MIDI Transcription with Music Language Models

This is an accompanying webpage for the following paper.

Yu Sugimoto, Jun-You Wang, Li Su, Eita Nakamura
Singing MIDI transcription with music language models: Formulation and comparison
Proc. APSIPA ASC, to be presented, 2025.

Abstract of the research

This study investigates the use of music language models (LMs) in singing MIDI transcription, the task of estimating the pitch, onset time, and offset time of each note in the vocal part from a musical audio signal. While recent studies have investigated acoustic models that predict pitch frame by frame using deep neural networks (DNNs), transcription errors remain due to large pitch fluctuations and ambiguous note boundaries in singing. To address this issue, we formulate Markov- and DNN-based LMs that estimate pitch probabilities at the note level, and integrate them with a DNN-based acoustic model using two methods: generative modeling and the sequential transducer. Experimental results show that both integration methods significantly improve transcription accuracy over a baseline acoustic model. Moreover, different strengths and characteristics of the compared LMs and integration methods are discussed.

Example of automatically transcribed MIDI

Contact

For any inquiries please contact Yu Sugimoto sugimoto.yu.681[at]s.kyushu-u.ac.jp