DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method
Zihao Chen, Zhentao Lin, Bi Zeng, Linyi Huang, Zhi Li, Jia Cai
Abstract
This paper addresses the challenges of accurately enumerating and describing scenes and the laborintensive process required to replicate acoustic environments using non-generative methods. We introduce the prompt-based Dynamic Generative Scene-based Noise Addition method (DGSNA), which innovatively combines the Dynamic Generation of Scene Information (DGSI) with Scene-based Noise Addition for Audio (SNAA). Employing generative chat models structured within the Background-Examples-Task (BET) prompt framework, DGSI component facilitates the dynamic synthesis of tailored Scene Information (SI) for specific acoustic environments. Additionally, the SNAA component leverages Room Impulse Response (RIR) filters and Text-To-Audio (TTA) systems to generate realistic, scene-based noise that can be adapted for both indoor and outdoor environments. Through comprehensive experiments, the adaptability of DGSNA across different generative chat models was demonstrated. The results, assessed through both objective and subjective evaluations, show that DGSNA provides robust performance in dynamically generating precise SI and effectively enhancing scene-based noise addition capabilities, thus offering significant improvements over traditional methods in acoustic scene simulation.
Prompt-based Dynamic Generative Scene-based Noise Addition Method
正在使用的机场[airport]:
一个长宽高为(15,10,6)的机场[airport],麦克风坐标在(7.5,5,2.4),人声坐标在(4,8,3),有飞机起降声[the sound of planes taking off and landing]声源坐标在(5,9,2.4)、广播声[the sound of the announcement system]声源坐标在(7.5,10,2.4)。 |
吵闹的公交车[bus]:
一个长宽高为(6,4.5,2.5)的公交车[bus],麦克风坐标在(3,2,1.2),人声坐标在(2,2.5,1.6),有乘客说话声[the sound of passengers talking]声源坐标在(1.5,2.5,1.2)、发动机声音[the sound of engine]声源坐标在(4,2.5,2.5)。 |
现代化的地铁[metro]:
一个长宽高为(20,4,6)的现代化的地铁[metro],麦克风坐标在(10,2,1.2),人声坐标在(8,3,1.4),有乘客的交谈声[the sound of passengers talking]声源坐标在(0.5,2.5,1.2)、车轮摩擦声[the sound of train wheels scraping]声源坐标在(8,1.5,1.6)。 |
明亮的地铁站[metro_station]:
一个长宽高为(8,7,5)的地铁站[metro_station],麦克风坐标在(3,4.5,1.2),人声坐标在(0.5,3.5,1.6),有广播声[the sound of broadcast]声源坐标在(3,6.5,1.2)、报站声[the sound of announcements]声源坐标在(0.5,4.5,1.2)。 |
闲置的公园[park]:
一个长宽高为(10,8,4)的公园[park],麦克风坐标在(5,3,1.2),人声坐标在(3,1.5,1.6),有鸟鸣声[the sound of birds chirping]声源坐标在(1,0.5,1.2)、树叶摩擦声[the sound of leaves rustling]声源坐标在(9,2.5,1.2)。 |
亮堂的公共广场[public_square]:
一个长宽高为(10,5,4)的公共广场[public_square],麦克风坐标在(2,1.5,1.2),人声坐标在(3,2,1.2),有跳舞声[the sound of dancing]声源坐标在(2.5,2.5,1.2)、歌声[the sound of singing]声源坐标在(3.5,2.5,1.2)。 |
复古的购物中心[shopping_mall]:
一个长宽高为(10,7,4)的复古的购物中心[shopping_mall],麦克风坐标在(5,4,1.2),人声坐标在(3,6,1.6),有顾客交谈声[the sound of customer talking]声源坐标在(3.5,4.5,1.2)、店铺播放的背景音乐声[the sound of background music]声源坐标在(0.5,3,1.2)。 |
昏暗的步行街[street_pedestrian]:
一个长宽高为(10,5,4)的步行街[street_pedestrian],麦克风坐标在(5,2.5,1.2),人声坐标在(7,1.5,1.6),有脚步声[the sound of footsteps]声源坐标在(0.5,0.5,1.2)、车辆声[the sound of vehicle]声源坐标在(8,4.5,1.2)。 |
舒适的交通街道[street_traffic]:
一个长宽高为(10,5,3)的交通街道[street_traffic],麦克风坐标在(0.5,0.5,1.2),人声坐标在(5,2.5,1.6),有车流声[the sound of traffic]声源坐标在(0.5,3,1.2)、喇叭声[the sound of horn]声源坐标在(5,0.5,2)。 |
杂乱的有轨电车[tram]:
一个长宽高为(6,3,4)的有轨电车[tram],麦克风坐标在(3.5,0.5,1.2),人声坐标在(2,1.5,1.6),有乘客说话声[the sound of passengers talking]声源坐标在(0.5,2.5,1.2)、车辆声[the sound of the tram]声源坐标在(2,0.5,2)、广播声[the sound of broadcasting]声源坐标在(4,0.5,1.2)。 |
安静的阳台[balcony]:
一个长宽高为(6,2.5,4)的阳台[balcony],麦克风坐标在(5,1.5,1.2),人声坐标在(6,1.5,2),有微风声[the sound of breeze]声源坐标在(4,0.5,2)、鸟叫声[the sound of birds chirping]声源坐标在(6,1.5,2)。 |
豪华的浴室[bathroom]:
一个长宽高为(9,6,4)的豪华浴室[bathroom],麦克风坐标在(2.5,2.5,1.2),人声坐标在(6,2,1.6),有水流声[the sound of running water]声源坐标在(2.5,2.5,2)、淋浴声[the sound of shower]声源坐标在(6,2.5,2)、蒸汽声[the sound of steam]声源坐标在(2.5,2.5,2)、音乐声[the sound of music]声源坐标在(0.5,2.5,1.2)。 |
简陋的轿车[car]:
一个长宽高为(4,2,3)的轿车[car],麦克风坐标在(1.5,0.5,1.2),人声坐标在(2.5,1.5,1.6),有引擎声[the sound of engine]声源坐标在(0.5,0.5,2)、轮胎声[the sound of tires]声源坐标在(2.5,0.5,1.2)。 |
欢乐的厨房[kitchen]:
一个长宽高为(6,4,4)的欢乐的厨房[kitchen],麦克风坐标在(2.5,1.5,1.2),人声坐标在(1,1.5,1.6),有笑声声[the sound of laughter]声源坐标在(2,2.5,1.2)、煮食声[the sound of cooking]声源坐标在(2.5,2.5,2)。 |
整洁的客厅[living_room]:
一个长宽高为(8,5,4)的整洁的客厅[living room],麦克风坐标在(4,2.5,1.2),人声坐标在(1,0.75,1.2),有说话声[the sound of speaking]声源坐标在(0.75,2.5,1.2)、电视机声[the sound of TV]声源坐标在(4,4.5,1.2)。 |
Example Analysis
By leveraging the BET prompt framework alongside generative chat models, TTA systems, and RIR filters, we can effectively implement the DGSNA. This section elucidates the specific process through which scene-based audio is generated, illustrated by the analysis of examples.
Speech Audio | ||
---|---|---|
Noise Audio 1 | Noise Audio 2 | Noise Audio 3 |
Scene-based Audio | ||
Acknowledgment
This work was supported in part by the National Science Foundation of China Under Grant U21A20478, and in part by the 2022 Industrial Technology Basic Public Service Platform Project of China (No.2022-228-219), and in part by Key La-boratory of MIIT for Intelligent Products Testing and Reliabil-ity 2023 Key Laboratory Open Project Fund (No. CEPREI2023-01).