本工作主要贡献如下:
提出了首个为具身交互设计的3D多模态大型语言模型;
提出了3D 问答基准测试集3D MM-Vet,涵盖多种测试场景,如单视角和噪声抖动;
针对点云编码问题,提出了ReCon++点云编码器架构,在多个表征学习任务中超越现有工作。
总体架构
ShapeLLM的主要目标是通过使用大型语言模型(LLM)作为通用接口,实现交互式的3D理解,其架构主要包括一个用于3D表征学习的预训练3D编码器和一个用于3D理解的LLM。ShapeLLM使用了一个名为ReCon++的全新模型作为3D编码器,基于现有的ReCon模型[1]进行了多项改进,以满足3D理解对精确空间和多视角细节的需求,同时采用了LLaMA模型作为LLM组件。ReCon++获取的3D物体表征通过线性投影处理后输入到LLM中,以确保与LLM的兼容性。为提升在6-DoF姿态估计等任务中的低层次几何理解能力,该方法还通过线性投影3D坐标引入了绝对位置编码(APE)。
ReCon++: 提升3D表征学习的能力
图2 基准测试集数据样本
微调3D物体识别
在对ScanObjectNN[5]和ModelNet[6]两个具有挑战性的3D物体数据集进行微调后,ReCon++展现了卓越的表示迁移学习能力。通过自监督预训练并采用中间微调策略,ReCon++在ScanObjectNN的PB_T50_RS基准测试中实现了95.25%的准确率,比Transformer基线提升了16.14%,在多个基准上均超越现有工作。
图3展示了ShapeLLM-13B在处理单视角点云输入时的表现,展现出对遮挡情况的出色鲁棒性。这一特性对于实际应用至关重要,因为单视角点云可以轻松通过RGB-D摄像头获取。
图3 以单视角被遮挡点云为输入的3D多模态对话示例
图4 未见过物体的部位理解示例
-- End--
[1] Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, Li Yi. Contrast with reconstruct: Contrastive 3D representation learning guided by generative pretraining. International Conference on Machine Learning (PMLR). 28223-28243, 2023.
[2] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan. 3D-LLM: Injecting the 3D world into large language models. Advances in Neural Information Processing Systems (NeurIPS). 20482-20494, 2023.
[3] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, Ali Farhadi. Objaverse: A universe of annotated 3D objects. Conference on Computer Vision and Pattern Recognition (CVPR). 13142-13153, 2023.
[4] Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, He Wang. GAPartNet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. Conference on Computer Vision and Pattern Recognition (CVPR). 7081-7091, 2023.
[5] Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. Conference on Computer Vision and Pattern Recognition (CVPR). 1588-1597, 2019.
[6] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, Jianxiong Xiao. 3D ShapeNets: A deep representation for volumetric shapes. Conference on Computer Vision and Pattern Recognition (CVPR). 1912-1920, 2015.