ThinkSound

Pioneering Audio Generation and Editing with Chain-of-Thought Reasoning

Abstract


While modern AI has made great strides in generating audio from video, creating high-fidelity sound that truly matches the nuance of visual content remains a major hurdle. Professional sound design requires complex reasoning about visual cues, acoustics, and timing—a skill that has been difficult to replicate in AI.

This paper introduces ThinkSound, a groundbreaking framework that teaches AI to "think" like a sound designer. By using Chain-of-Thought (CoT) reasoning, ThinkSound breaks down the complex task of audio generation into logical, manageable steps. This allows for not just creating sound from scratch but also for interactive, object-focused editing and refinement using simple natural language commands. To power this, we also present AudioCoT, a first-of-its-kind dataset designed to train models on this reasoning process. Our experiments show that ThinkSound sets a new state-of-the-art in both audio quality and relevance, performing exceptionally well even on complex, out-of-distribution movie scenes.

Synergy with Video Generation Models


ThinkSound seamlessly adds rich, synchronized soundscapes to videos created by leading generative models. The videos below were generated by their respective models; all audio was created by ThinkSound.

Veo + ThinkSound

Sora + ThinkSound

MovieGen + ThinkSound

V2A Comparisons on VGGSound (In-distribution)


Click on any thumbnail to load and play the video, comparing ThinkSound to other models.

CoTGround TruthThinkSoundMMAudioSee&Hear
Playing Tennis
Generate sounds of tennis hitting a racket and the ball bouncing...
Video thumbnail for Ground Truth
Video thumbnail for ThinkSound
Video thumbnail for MMAudio
Video thumbnail for See&Hear
Printer Printing
Generate a continuous printer printing sound with periodic beeps...
Video thumbnail for Ground Truth
Video thumbnail for ThinkSound
Video thumbnail for MMAudio
Video thumbnail for See&Hear
Ripping Paper
Start with a subtle tearing sound of paper being ripped...
Video thumbnail for Ground Truth
Video thumbnail for ThinkSound
Video thumbnail for MMAudio
Video thumbnail for See&Hear
Using Sewing Machines
Generate ambient sewing room sounds with consistent sewing machine hum...
Video thumbnail for Ground Truth
Video thumbnail for ThinkSound
Video thumbnail for MMAudio
Video thumbnail for See&Hear
Playing Bongo
Generate a lively percussion track featuring only rhythmic drum beats...
Video thumbnail for Ground Truth
Video thumbnail for ThinkSound
Video thumbnail for MMAudio
Video thumbnail for See&Hear
Chopping Food
Generate rhythmic chopping sounds consistent with cutting meat or vegetables...
Video thumbnail for Ground Truth
Video thumbnail for ThinkSound
Video thumbnail for MMAudio
Video thumbnail for See&Hear
People Eating Crisps
Generate audio focusing on clear, rhythmic chewing sounds...
Video thumbnail for Ground Truth
Video thumbnail for ThinkSound
Video thumbnail for MMAudio
Video thumbnail for See&Hear

V2A Comparisons on MovieGen Audio (Out-of-Distribution)


See how ThinkSound performs on challenging, out-of-distribution movie clips.

CoTThinkSoundMovie Gen AudioMMAudio
Gentle Sucking Sounds
Soft, steady background of light pacifier suckling...
Video thumbnail for ThinkSound
Video thumbnail for Movie Gen Audio
Video thumbnail for MMAudio
Harmonious Strings
Acoustic guitar strings humming and buzzing...
Video thumbnail for ThinkSound
Video thumbnail for Movie Gen Audio
Video thumbnail for MMAudio
Old TV Humming
Ambient background noise with faint static and white noise...
Video thumbnail for ThinkSound
Video thumbnail for Movie Gen Audio
Video thumbnail for MMAudio
Intense Thunder
A low wind hum and occasional crackles add to the stormy atmosphere...
Video thumbnail for ThinkSound
Video thumbnail for Movie Gen Audio
Video thumbnail for MMAudio
High-Pitched Scraping
High-pitched, sustained scraping sound of a tool on a metal rod...
Video thumbnail for ThinkSound
Video thumbnail for Movie Gen Audio
Video thumbnail for MMAudio
Clattering Metal Keys
Rhythmic sound of an old typewriter, focusing on the sharp metallic clatter...
Video thumbnail for ThinkSound
Video thumbnail for Movie Gen Audio
Video thumbnail for MMAudio
Skateboard Grinding
Steady rolling on a hard surface, with sharp scraping and grinding sounds...
Video thumbnail for ThinkSound
Video thumbnail for Movie Gen Audio
Video thumbnail for MMAudio

Interactive Step-by-Step Foley Creation


V2A Gen → Object-Focus → Audio Inpainting

Generate a cheerful ukulele melody with light strumming and harmonious vocals from two young girls singing together.
Generated audio (paired with silent video):
Now, focus only on the singing and hand movements in the selected region.
Generated audio for the selected object:
Repair the masked (noisy) segment in this audio clip. Audio Spectrogram with masked region
Repaired audio: Repaired Audio Spectrogram

V2A Gen → Object-Focus → Audio Editing

Generate gentle wind sounds with consistent warbler chirping.
Generated audio (paired with silent video):
Focus on the bird, reduce the wind noise, and make the chirping crisp and clear.
Generated audio for the selected object:
Keep the warbler chirping and add an occasional robin call for contrast. Original Audio Spectrogram
Edited audio: Edited Audio Spectrogram

Experiments


Main Results on VGGSound

ThinkSound outperforms all baselines across most objective metrics and all subjective metrics, achieving substantial improvements in audio quality and semantic alignment.

Table 1: Comparison of our ThinkSound foundation model...
MethodObjective MetricsSubjective MetricsEfficiency
FD
KLPaSSTKLPaNNsDeSync ↓CLAPcapCLAPCoTMOS-Q
MOS-A ↑ParamsTime(s) ↓
GT---0.550.280.454.37±0.214.56±0.19--
See&Hear118.952.262.301.200.320.352.75±1.082.87±0.99415M19.42
V-AURA†46.992.231.830.650.230.373.42±1.033.20±1.17695M14.00
FoleyCrafter39.152.061.891.210.410.343.08±1.212.63±0.881.20B3.84
Frieren†74.962.552.641.000.370.343.27±1.112.95±1.09159M-
V2A-Mapper†48.102.502.341.230.380.323.31±1.023.16±1.04229M-
MMAudio43.261.651.400.440.310.403.84±0.893.97±0.821.03B3.01
ThinkSound34.561.521.320.460.330.464.02±0.734.18±0.791.30B1.07
w/o CoT Reasoning39.841.591.400.480.290.413.91±0.834.04±0.751.30B0.98

Ablation Studies

We investigated the contribution of each component to validate the effectiveness of our design choices, focusing on text encoding and multi-modal integration.

Text Encoding Strategies

Table 2: Comparison of text encoder fusion strategies...
MethodFD ↓KLPaSSTKLPaNNsDeSync ↓CLAP ↑
CLIP39.841.591.400.480.41
T5 (CoT)37.651.541.350.460.44
CLIP + T534.561.521.320.460.46

Multi-Modal Integration

Table 3: Comparison of multi-modal integration mechanisms
IntegrationFD ↓KLPaSSTKLPaNNsDeSync ↓CLAP ↑
audio only37.131.581.370.500.43
linear video38.961.581.380.460.45
gated video34.561.521.320.460.46

Impact of Model Size

Table 4: Impact of model size results.
SizeFD ↓KLPaSSTKLPaNNsDeSync ↓CLAPCoT
Small40.801.641.380.460.41
Medium36.801.561.340.460.44
Large34.561.521.320.460.46

Frequently Asked Questions


ThinkSound is an advanced AI framework designed to generate and edit audio for videos. Unlike traditional models, it uses a reasoning process called Chain-of-Thought (CoT) to understand the context of a video and create highly relevant, high-quality sound, much like a professional sound designer would.

Chain-of-Thought allows the model to break down a complex task (like "create a soundtrack for this video") into smaller, logical steps. For example, it might first identify the main objects and actions, then reason about the environment's acoustics, and finally decide on the appropriate sounds and their timing. This step-by-step process leads to more accurate and contextually aware audio generation.

Three main things: 1) Its use of CoT reasoning for more intelligent sound creation. 2) Its interactivity, allowing users to edit audio, focus on specific objects, and refine the sound using natural language. 3) It's powered by AudioCoT, a unique dataset built specifically for training this kind of reasoning-based audio generation.

Yes! We have provided an interactive demo on Hugging Face Spaces, linked at the top of this page. You can also explore the source code on GitHub to run the model yourself.