DGSNA: Dynamic Generative Scene-based Noise Addition method

Abstract

To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the standard solution. However, existing methods offer limited coverage of real-world scenes and depend on pre-existing noise libraries and scene metadata. This paper presents prompt-based Dynamic Generative Scene-based Noise Addition (DGSNA), a novel approach driven by generative language models that integrates Dynamic Generation of Scene-based Information (DGSI) with Scene-based Noise Addition for Speech (SNAS). The DGSI module, with a BET (Background, Examples, Task) prompt framework, dynamically generates logic-compliant scene-based information, including scene dimensions, sound sources, and microphone positions, thereby addressing the challenges of scene enumeration and detailed description. Complementing this, the SNAS module employs a Time-Frequency Diffusion-based (TFD) Text-to-Audio model to synthesize scene-specific noise. By integrating this noise with clean speech via Room Impulse Response (RIR) filters, the module streamlines the traditionally labor-intensive process of replicating diverse acoustic environments. Experimental results show that DGSNA significantly enhances the robustness of speech recognition and keyword spotting models, achieving relative improvements of up to 11.32%. Furthermore, DGSNA is highly compatible with existing noise addition techniques.

**Fig. 1:** The overall framework of the proposed DGSNA method is illustrated below. Within this structure, the BET Prompt Framework for the DGSI module comprises three components: Background (B), Examples (E), and Task (T). Additionally, the TFD-based TTA component utilizes a Time-Frequency Collaborative Audio Encoder to generate scene-specific noise.

**Fig. 2:** Examples of scene dynamic generation. Initially, the B (Background) component involves the user providing a clear and detailed description of the specified scene's design and context (red font). Subsequently, the E (Examples) component requires the user to input few-shot prompts, including a text description of the target scene (blue font). Using these inputs, the generative LLM model generates scene-based information that aligns with the predefined task background (green font). Finally, the T (Task) component involves the user specifying requirements for the dynamic generation of the scene (purple font).

English Examples

Bright metro station:

An area in the metro station [metro station] with (8, 7, 5). The microphone at (3, 4.5, 1.2), the voice at (0.5, 3.5 ,1.6), the sound of broadcast [the sound of broadcast] at (3, 6.5, 1.2), the sound of announcements [the sound of announcements] at (0.5, 4.5, 1.2).

Unused park:

An area in the park [park] with (10, 8, 4). The microphone at (5, 3, 1.2), the voice at (3, 1.5, 1.6), the sound of birds chirping [the sound of birds chirping] at (1, 0.5, 1.2), the sound of leaves rustling [the sound of leaves rustling] at (9, 2.5, 1.2).

Luxurious bathroom:

An area in the bathroom [bathroom] with (9, 6, 4). The microphone at (2.5, 2.5, 1.2), the voice at (6, 2, 1.6), the sound of running water [the sound of running water] at (2.5, 2.5, 2), the sound of shower [the sound of shower] at (6, 2.5, 2), the sound of steam [the sound of steam] at (2.5, 2.5, 2), the sound of music [the sound of music] at (0.5, 2.5, 1.2).

Happy kitchen:

An area in the kitchen [kitchen] with (6, 4, 4). The microphone at (2.5, 1.5, 1.2), the voice at (1, 1.5, 1.6), the sound of laughter [the sound of laughter] at (2, 2.5, 1.2), the sound of cooking [the sound of cooking] at (2.5, 2.5, 2).

Tidy living room:

An area in the living room [living room] with (8, 5, 4). The microphone at (4, 2.5, 1.2), the voice at (1, 0.75, 1.2), the sound of speaking [the sound of speaking] at (0.75, 2.5, 1.2),the sound of TV [the sound of TV] at (4, 4.5, 1.2).

Chinese Examples

Modern metro:

An area in the metro [metro] with (20, 4, 6). The microphone at (10, 2, 1.2), the voice at (8, 3, 1.4), the sound of passengers talking [the sound of passengers talking] at (0.5, 2.5, 1.2), the sound of train wheels scraping [the sound of train wheels scraping] at (8, 1.5, 1.6).

Dark pedestrian street:

An area in the pedestrian street [pedestrian street] with (10, 5, 4). The microphone at (5, 2.5, 1.2), the voice at (7, 1.5, 1.6), the sound of footsteps [the sound of footsteps] at (0.5, 0.5, 1.2), the sound of vehicle [the sound of vehicle] at (8, 4.5, 1.2).

Comfortable traffic street:

An area in the traffic street [traffic street] with (10, 5, 3). The microphone at (0.5, 0.5, 1.2), the voice at (5, 2.5, 1.6), the sound of traffic [the sound of traffic] at (0.5, 3, 1.2), the sound of horn [the sound of horn] at (5, 0.5, 2).

Quiet balcony:

An area in the balcony [balcony] with (6, 2.5, 4). The microphone at (5, 1.5, 1.2), the voice at (6, 1.5, 2), the sound of breeze [the sound of breeze] at (4, 0.5, 2), the sound of birds chirping [the sound of birds chirping] at (6, 1.5, 2).

Example Analysis

By leveraging the BET prompt framework alongside generative LLM models, TTA models, and RIR filters, we can effectively implement the DGSNA. This section elucidates the specific process through which scene-based speech is generated, illustrated by the analysis of examples.

**Fig. 3:** Workflow for generating scene-based speech. Here are the detailed steps: (a) Original Prompt: The process begins with the input text "noisy pedestrian street." (b) BET Prompt: The BET prompt has been strategically adapted to conform to the input-output structure and historical context management strategy employed by the Qwen-7B-Chat model. (c) Scene-based Information: The adapted BET prompt is fed into the Qwen-7B-Chat model, which processes the information and dynamically generates scene-based information. (d) Filter Metrics: The generated scene-based information is assessed using predefined filter metrics to confirm its suitability for subsequent processing. (e) TTA Model: The TTA model utilizes each noise type specified in the scene-based information as a text embedding. These embeddings undergo a denoising procedure to generate precise audio prior representations. Subsequently, these representations are converted into actual audio samples through the combined use of a decoder and a vocoder. (f) Scene-based Noise: Each audio sample is generated at five distinct volume levels, mimicking real-world variations in sound intensity. (g) Original Speech: The original speech is obtained. (h) Scene-based Speech: The RIR filter uses the scene dimensions and source coordinates provided in the scene-based information to construct the RIR. Subsequently, it convolves both the generated scene-based noise and original speech with the corresponding RIR, culminating in the production of the scene-based speech.

**Fig. 4:** Visualized the scene and spectrogram of the example audio. Squares, inverted triangles, and circles each denote a unique type of sound source and their mirror sources. The number of mirror sources (e.g., 18) is calculated based on the product of the number of original sound sources (e.g., 3), the number of scene surfaces (e.g., 6), and the maximum specified order of reflection (e.g., 1). In outdoor scenes, these mirror sources are treated as additional sound sources of the same type. By including both original and mirror sources, the total number of sound sources in the scene amounts to 21. This comprehensive inclusion ensures a more detailed and accurate simulation, as it accounts for multiple interactions and the resultant acoustic effects within the scene.

	English Original Speech		Chinese Original Speech

Scene-based Noise 1	Scene-based Noise 2	Scene-based Noise 3

	English Scene-based Speech		Chinese Scene-based Speech