1st Workshop on Multimodalities for 3D Scenes

CVPR 2024 Workshop


Human sensory experiences such as vision, audio, touch, and smell are the natural interfaces to perceive the world around us and reason about our environments. Understanding the 3D environments around us is important for many applications such as video processing, robotics, or augmented reality. While there have been a lot of efforts in understanding 3D scenes in recent years, most works (workshops) focus on mainly using vision to understand 3D scenes. However, vision alone does not fully capture the properties of 3D scenes, e.g., the materials of objects and surfaces, the affordance of objects, and the acoustic properties. In addition, humans use language to describe 3D scenes, and understanding 3D scenes from languages is also of vital importance. We believe the future is to model and understand 3D scenes and objects with rich multi-sensory inputs, including but not limited to vision, language, audio, and touch. The goal of this workshop is to unite researchers from these different sub-communities and move towards scene understanding with multi-modalities. We want to share the recent progress of multimodal scene understanding, and also to discuss which directions the field should investigate next.

Call For Papers

Call for papers: We invite non-archival papers of up to 8 pages (in CVPR format) for work on tasks related to the intersection of multimodalities and 3D object understanding in real-world scenes. Paper topics may include but are not limited to:

  • 3D Visual Grounding
  • 3D Dense Captioning
  • 3D Question Answering
  • Audio-visual 3D scene reconstruction and mapping
  • Modeling scene acoustics from visuals
  • Material prediction with visual/audio/tactile inputs
  • Implicit multimodal neural field of 3D scenes and objects
  • Multimodal simulation of 3D scenes and objects

Submission: We encourage submissions of up to 8 pages, excluding references and acknowledgements. The submission should be in the CVPR format. Reviewing will be single-blind. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals. We welcome already published papers that are within the scope of the workshop (without re-formatting), including papers from the main CVPR conference. Please submit your paper to the following address by the deadline: multimodalities3dscenes@gmail.com Please mention in your email if your submission has already been accepted for publication (and the name of the conference).

Important Dates

Paper submission deadline April 15, 2024
Notifications to accepted papers April 22, 2024
Paper camera ready April 29, 2024
Workshop date June 17, 2024


Welcome 2:00pm - 2:05pm
Invited Talk 2:05pm - 2:30pm
Invited Talk 2:30pm - 2:55pm
Poster session / Coffee break 3:00pm - 3:25pm
Invited Talk 3:30pm - 4:00pm
Invited Talk 4:00pm - 4:25pm
Paper spotlights 4:30pm - 4:55pm
Invited Talk 5:00pm - 5:40pm
Concluding Remarks 5:40pm - 5:50pm

Invited Speakers

Xiaolong Wang is an Assistant Professor of the ECE department at the University of California, San Diego. He is affiliated with the CSE department, Center for Visual Computing, Contextual Robotics Institute, Artificial Intelligence Group, and the TILOS NSF AI Institute. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. He is particularly interested in learning visual representation from videos in a self-supervised manner and uses this representation to guide robots to learn. Xiaolong is the Area Chair of CVPR, AAAI, ICCV.

Andrew Owens is an assistant professor at The University of Michigan in the department of Electrical Engineering and Computer Science. Prior to that, he was a postdoctoral scholar at UC Berkeley. He received a Ph.D. in Electrical Engineering and Computer Science from MIT in 2016. He is a recipient of a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award.

Katerina Fragkiadaki is an Assistant Professor in the Machine Learning Department at Carnegie Mellon. Prior to joining MLD's faculty she worked as a postdoctoral researcher first at UC Berkeley working with Jitendra Malik and then at Google Research in Mountain View working with the video group. Katerina is interested in building machines that understand the stories that videos portray, and, inversely, in using videos to teach machines about the world. The penultimate goal is to build a machine that understands movie plots, and the ultimate goal is to build a machine that would want to watch Bergman over this.

Linda Smith is a cognitive scientist recognized for her work on early object name learning as a form of statistical learning. Smith co-discovered infants' few-shot learning of object names, showed that few-shot learning is itself learned, and documented the relevant experiences. Smith was born in and grew up in Portsmouth, New Hampshire. She graduated from the University of Wisconsin (Madison) with a Bachelor of Science degree in experimental psychology and from the University of Pennsylvania with a Ph.D. in experimental psychology. She joined the faculty of Indiana University (Bloomington) in 1977 and is in the Department of Psychological and Brain Sciences and the Program in Cognitive Science. She won the David E. Rumelhart Prize for theoretical contributions to cognitive science and is a member of both the National Academy of Sciences and the American Academy of Arts and Science.


Changan Chen
UT Austin
Angel X. Chang
Simon Fraser University
Alexander Richard
Meta Reality Labs Research
Kristen Grauman
UT Austin, FAIR


To contact the organizers please use language3dscenes@gmail.com


Thanks to visualdialog.org for the webpage format.