1st Workshop on Multimodalities for 3D Scenes

CVPR 2024 Workshop

June 17 (1:30 - 5:50 pm), 2024 Arch 2B, Seattle Convention Center

Introduction

Human sensory experiences such as vision, audio, touch, and smell are the natural interfaces to perceive the world around us and reason about our environments. Understanding the 3D environments around us is important for many applications such as video processing, robotics, or augmented reality. While there have been a lot of efforts in understanding 3D scenes in recent years, most works (workshops) focus on mainly using vision to understand 3D scenes. However, vision alone does not fully capture the properties of 3D scenes, e.g., the materials of objects and surfaces, the affordance of objects, and the acoustic properties. In addition, humans use language to describe 3D scenes, and understanding 3D scenes from languages is also of vital importance. We believe the future is to model and understand 3D scenes and objects with rich multi-sensory inputs, including but not limited to vision, language, audio, and touch. The goal of this workshop is to unite researchers from these different sub-communities and move towards scene understanding with multi-modalities. We want to share the recent progress of multimodal scene understanding, and also to discuss which directions the field should investigate next.

Call For Papers

Call for papers: We invite non-archival papers of up to 8 pages (in CVPR format) for work on tasks related to the intersection of multimodalities and 3D object understanding in real-world scenes. Paper topics may include but are not limited to:

3D Visual Grounding
3D Dense Captioning
3D Question Answering
Audio-visual 3D scene reconstruction and mapping
Modeling scene acoustics from visuals
Material prediction with visual/audio/tactile inputs
Implicit multimodal neural field of 3D scenes and objects
Multimodal simulation of 3D scenes and objects

Submission: We encourage submissions of up to 8 pages, excluding references and acknowledgements. The submission should be in the CVPR format. Reviewing will be single-blind. Accepted papers will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals. We welcome already published papers that are within the scope of the workshop (without re-formatting), including papers from the main CVPR conference. Please submit your paper to the following address by the deadline: multimodalities3dscenes@gmail.com Please mention in your email if your submission has already been accepted for publication (and the name of the conference).

Accepted Papers

Presentation instructions: every presenter should prepare a 5-minute video presentation of their paper. Please send the slides to multimodalities3dscenes@gmail.com before June 17th. On the day of the workshop, the presenters will be called in the order below. Each presenter will have 5 minutes to present their work, followed by Q&A from the audience. In addition, each paper may also be presented in the poster session (#68 - 77).

#1. Posterior Distillation Sampling
Juil Koo, Chanho Park, Minhyuk Sung

#2. DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, Aniket Bera

#3. Physical Property Understanding from Language-Embedded Feature Fields
Albert J. Zhai, Yuan Shen, Emily Y. Chen, Gloria X. Wang, Xinlei Wang, Sheng Wang, Kaiyu Guan, Shenlong Wang

#4. 3D-LLM: Injecting the 3D World into Large Language Models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan

#5. 3DifFusionDet: Diffusion Model for 3D Object Detection with Robust LiDAR-Camera Fusion
Xinhao Xiang, Simon Drager, Jiawei Zhang ̈

#6. Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks
Simranjit Singh, Georgios Pavlakos, Dimitrios Stamoulis

#7. Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard

Important Dates

Paper submission deadline	April 15, 2024
Notifications to accepted papers	April 30, 2024
Paper camera ready	May 12, 2024
Workshop date	June 17, 2024

Schedule

Welcome	1:50pm - 2:00pm
Invited Talk	2:00pm - 2:30pm	Linda Smith	The Natural Statistics for Learning about 3D Shape
Invited Talk	2:30pm - 3:00pm	Xiaolong Wang	Generalizable 3D Spatial Perception and Control
Poster session/ Coffee break	3:00pm - 3:30pm
Invited Talk	3:30pm - 4:00pm	Katerina Fragkiadaki	Unified 2D/3D Foundational Modelsacross Images, Language and Actions
Invited Talk	4:00pm - 4:30pm	Andrew Owens	Tactile-augmented Radiance Fields
Paper Talks	4:30pm - 5:15pm
Concluding Remarks	5:15pm - 5:20pm

Invited Speakers

Xiaolong Wang is an Assistant Professor in the ECE department at the University of California, San Diego. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. His specific interest lies in learning 3D and dynamics representations from videos and physical robotic interaction data. These comprehensive representations are utilized to facilitate the learning of human-like robot skills, with the goal of generalizing the robot to interact effectively with a wide range of objects and environments in the real physical world. He is the recipient of the NSF CAREER Award, Intel Rising Star Faculty Award, and Research Awards from Sony, Amazon, Adobe, and Cisco.

Andrew Owens Andrew Owens is an assistant professor at The University of Michigan in the department of Electrical Engineering and Computer Science. Prior to that, he was a postdoctoral scholar at UC Berkeley. He received a Ph.D. in Electrical Engineering and Computer Science from MIT in 2016. He is a recipient of an NSF CAREER Award, a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award, and a Microsoft Research Ph.D. Fellowship.

Katerina Fragkiadaki is the JPMorgan Chase Associate Professor in the Machine Learning Department in Carnegie Mellon University. She received her undergraduate diploma from Electrical and Computer Engineering in the National Technical University of Athens. She received her Ph.D. from University of Pennsylvania and was a postdoctoral fellow in UC Berkeley and Google research after that. Her work focuses on combining forms of common sense reasoning, such as spatial understanding and 3D scene understanding, with deep visuomotor learning. The goal of her work is to enable few-shot learning and continual learning for perception, action and language grounding. Her group develops methods for computer vision for mobile agents, 2D and 3D visual parsing, 2D-to-3D perception, vision-language grounding, learning of object dynamics, navigation and manipulation policies. Pioneering innovations of her group’s research include 2D-to-3D geometry-aware neural networks for 3D understanding from 2D video streams, analogy-forming networks for memory-augmented few-shot visual parsing, and language-grounding in 2D and 3D scenes with bottom-up and top-down attention. Her work has been awarded with a best Ph.D. thesis award, an NSF CAREER award, AFOSR Young Investigator award, a DARPA Young Investigator award, Google, TRI, Amazon, UPMC and Sony faculty research awards. She is a program chair for ICLR 2024.

Linda Smith is a cognitive scientist recognized for her work on early object name learning as a form of statistical learning. Smith co-discovered infants' few-shot learning of object names, showed that few-shot learning is itself learned, and documented the relevant experiences. Smith was born in and grew up in Portsmouth, New Hampshire. She graduated from the University of Wisconsin (Madison) with a Bachelor of Science degree in experimental psychology and from the University of Pennsylvania with a Ph.D. in experimental psychology. She joined the faculty of Indiana University (Bloomington) in 1977 and is in the Department of Psychological and Brain Sciences and the Program in Cognitive Science. She won the David E. Rumelhart Prize for theoretical contributions to cognitive science and is a member of both the National Academy of Sciences and the American Academy of Arts and Science.

Organizers

Changan Chen

UT Austin, FAIR

Contact

To contact the organizers please use language3dscenes@gmail.com

Acknowledgments

Thanks to visualdialog.org for the webpage format.