Abstract This paper introduces M2M-Gen, a multimodal framework for generating background music tailored to Japanese manga. The key challenges in this task are the lack of available data or a baseline. We propose M2M-Gen, an automated background music generation pipeline which produces background music for an input manga book. Initially, we use the dialogues in a manga to detect scene boundaries, then perform emotion classification and generate detailed captions for each page within a scene. GPT-4o transforms these detailed music captions into high-level musical directives that guide a text-to-music model to produce music aligned with the manga's evolving narrative. The effectiveness of M2M-Gen is confirmed through extensive subjective evaluations, showcasing its capability to significantly enhance the manga reading experience by synchronizing music that complements specific scenes.
The following sections provide examples of background music generated for manga scenes using M2M-Gen, a baseline and a random model.
Courtesy of Kato Masaki
M2M-Gen
Baseline
Random
Courtesy of Tanaka Masato
M2M-Gen
Baseline
Random
Courtesy of Aida Mayumi
M2M-Gen
Baseline
Random
Courtesy of Shindou Uni
M2M-Gen
Baseline
Random
Courtesy of Inohara Daisuke
M2M-Gen
Baseline
Random
Courtesy of Taira Masamie
M2M-Gen
Baseline
Random