The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio

Renhao Wang; Haoran Geng; Tingle Li; Feishi Wang; Gopala Anumanchipalli; Trevor Darrell; Boyi Li; Pieter Abbeel; Jitendra Malik; Alexei A. Efros

doi:10.48550/arxiv.2507.02864

Verified authors • Institutional access • DOI aware

50,000+ researchers120,000+ datasets90% satisfaction

Preprint

2025

The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio

0 Datasets

0 Files

2025

DOI: 10.48550/arxiv.2507.02864 arxiv.org/abs/2507.02864

Get instant academic access to this publication’s datasets.

Create free account How it works

Frequently asked questions

Is access really free for academics and students?

Yes. After verification, you can browse and download datasets at no cost. Some premium assets may require author approval.

How is my data protected?

Files are stored on encrypted storage. Access is restricted to verified users and all downloads are logged.

Can I request additional materials?

Yes, message the author after sign-up to request supplementary files or replication code.

Advance your research today

Join 50,000+ researchers worldwide. Get instant access to peer-reviewed datasets, advanced analytics, and global collaboration tools.

Get free academic access Learn more

✓ Immediate verification • ✓ Free institutional access • ✓ Global collaboration

Robots must integrate multiple sensory modalities to act effectively in the real world. Yet, learning such multimodal policies at scale remains challenging. Simulation offers a viable solution, but while vision has benefited from high-fidelity simulators, other modalities (e.g. sound) can be notoriously difficult to simulate. As a result, sim-to-real transfer has succeeded primarily in vision-based tasks, with multimodal transfer still largely unrealized. In this work, we tackle these challenges by introducing MultiGen, a framework that integrates large-scale generative models into traditional physics simulators, enabling multisensory simulation. We showcase our framework on the dynamic task of robot pouring, which inherently relies on multimodal feedback. By synthesizing realistic audio conditioned on simulation video, our method enables training on rich audiovisual trajectories -- without any real robot data. We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids, highlighting the potential of generative modeling to both simulate hard-to-model modalities and close the multimodal sim-to-real gap.

The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio

Frequently asked questions

Is access really free for academics and students?

How is my data protected?

Can I request additional materials?

Advance your research today

The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio

Frequently asked questions

Is access really free for academics and students?

How is my data protected?

Can I request additional materials?

Advance your research today

Access Research Data

This PDF is not available in different languages.

Jitendra Malik

Abstract

How to cite this publication

Related publications

Why join Raw Data Library?

Quality

Control

Free for Academia

Publication Details

Join Research Community