Bridging Piano Transcription and Rendering via Disentangled Score Content and Style

Anonymous Author(s)
Affiliation
JointPianist

Abstract

Expressive performance rendering (EPR) and automatic piano transcription (APT) are fundamental yet inverse tasks in music information retrieval: EPR generates expressive performances from symbolic scores, while APT recovers scores from performances. Despite their dual nature, prior work has addressed them independently. In this paper, we propose a unified framework that jointly models EPR and APT by disentangling note-level score content and global performance style representations from both paired and unpaired data. Our framework is built on a transformer-based sequence-to-sequence (Seq2Seq) architecture and is trained using only sequence-aligned data, without requiring fine-grained note-level alignment. To automate the rendering process while ensuring stylistic compatibility with the score, we introduce an independent diffusion-based performance style recommendation (PSR) module that generates style embeddings directly from score content. This modular component supports both style transfer and flexible rendering across a range of expressive styles. Experimental results from both objective and subjective evaluations demonstrate that our framework achieves competitive performance on EPR and APT tasks, while enabling effective content–style disentanglement, reliable style transfer, and stylistically appropriate rendering.

MY ALT TEXT

Figure 1: Relationship between EPR and APT (top left) and an overview of the proposed framework. The model comprises a joint transformer-based architecture for EPR and APT, along with a diffusion-based performance style recommendation (PSR) module. Four tasks are trained jointly: masked score reconstruction, masked performance reconstruction, expressive performance rendering (EPR), and automatic performance transcription (APT). Score content features extracted from score and performance inputs respectively, are encouraged to align. A global style feature is learned as a disentangled factor to support style transfer. The PSR module is trained to generate style representation from score content alone, emulating a pianist's ability to select appropriate performance styles.

Demo

EPR results with baselines

Sample Score Human DExter VirtuosoNet Ours (Target Style) Ours (PSR)
1
2
3
4
5
  • Score: MIDI exported directly from MuseScore using the MusicXML file.
  • Human: Human-performed rendition.
  • DExter: Performance generated by the DExter model.
  • VirtuosoNet: Performance generated by the VirtuosoNet model.
  • Ours (Target Style): Performance generated by our model using the style embedding extracted from the corresponding human performance.
  • Ours (PSR): Performance generated by our model using the style embedding generated by the PSR module.

Style transfer

Sample Original Target Transfer Mean
1
2
3
  • Original: Performance generated using the style extracted from the original (source) performance.
  • Target: Reference performance with a different musical content but the desired style.
  • Transfer: Performance generated using the style transferred from the target reference to the original content.
  • Mean: Performance generated by averaging the style embeddings from both the original and target styles, representing interpolation in the style space.