点击蓝字 关注我们
论文分享 | 多模态大模型相关研究进展
我们从2024-12-31到2025-01-04的37篇文章中精选出5篇优秀的工作分享给读者。
Predicate Invention from Pixels via Pretrained Vision-Language Models CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models
1.Predicate Invention from Pixels via Pretrained Vision-Language Models
Authors: Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Tomás Lozano-Pérez, Leslie Pack Kaelbling
https://arxiv.org/abs/2501.00296
论文摘要
Our aim is to learn to solve long-horizon decision-making problems in highly-variable, combinatorially-complex robotics domains given raw sensor input in the form of images. Previous work has shown that one way to achieve this aim is to learn a structured abstract transition model in the form of symbolic predicates and operators, and then plan within this model to solve novel tasks at test time. However, these learned models do not ground directly into pixels from just a handful of demonstrations. In this work, we propose to invent predicates that operate directly over input images by leveraging the capabilities of pretrained vision-language models (VLMs). Our key idea is that, given a set of demonstrations, a VLM can be used to propose a set of predicates that are potentially relevant for decision-making and then to determine the truth values of these predicates in both the given demonstrations and new image inputs. We build upon an existing framework for predicate invention, which generates feature-based predicates operating on object-centric states, to also generate visual predicates that operate on images. Experimentally, we show that our approach — pix2pred — is able to invent semantically meaningful predicates that enable generalization to novel, complex, and long-horizon tasks across two simulated robotic environments.
论文简评
这篇论文提出了pix2pred方法,这是一种学习决策机器人中重要任务所需谓词的方法。该方法利用视觉语言模型(VLM)来发明与决策相关的谓词,并通过从原始图像数据中构建谓词来进行规划。这种方法借鉴了先前关于谓词发明工作的研究,但将其扩展到视觉谓词领域,证明了在模拟环境中的适用性。 论文的核心在于解决机器人决策中的一大挑战:基于原始视图输入进行长期决策。此外,论文还创新性地使用了视觉语言模型来生成和评估与决策相关的谓词。通过全面的评估,论文显示了所提出算法的有效性。 总的来说,这篇论文提供了一个新颖的利用视觉语言模型改进机器人决策的途径,并验证了该方法的有效性。它为未来的研究提供了一个有价值的参考框架。
2.CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
Authors: Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, Jindong Wang
https://arxiv.org/abs/2501.01282
论文摘要
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19,682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs’ multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models’ general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.
论文简评
CultureVerse是阿里巴巴云自主研发的一项大型多模态基准测试任务,旨在提升视觉语言模型(VLM)在不同国家和地区的文化理解能力。它弥补了现有VLM通常依赖于西方中心训练数据的局限性,并引入了CultureVLM系列预训练模型,旨在提高对文化的感知能力。评估强调了基于文化背景的不同模型性能之间的显著差异,凸显了多样化训练数据对于增强文化理解和认知的重要性。
通过交叉验证的方式,研究者们发现CultureVerse能够有效识别文化特征,并且随着文化维度的增加,模型的表现逐渐稳定。此外,研究人员探讨了文化多样性与跨文化学习的关系,并提出了一种新的框架来衡量文化多样性及其对模型表现的影响。这些研究成果为深入研究多模态视觉语义理解提供了有力支持,有助于促进跨文化交流和社会融合。
3.Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations
Authors: Yuxuan Zhang, Yulong Li, Zichen Yu, Feilong Tang, Zhixiang Lu, Chong Li, Kang Dang, Jionglong Su
https://arxiv.org/abs/2501.00778
论文摘要
Long-sequence causal reasoning seeks to uncover causal relationships within extended time series data but is hindered by complex dependencies and the challenges of validating causal links. To address the limitations of large-scale language models (e.g., GPT-4) in capturing intricate emotional causality within extended dialogues, we propose CauseMotion, a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion. Unlike conventional methods relying only on textual information, CauseMotion enriches semantic representations by incorporating audio-derived features—vocal emotion, emotional intensity, and speech rate—into textual modalities. By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments, thus enabling the inference of complex emotional causal chains spanning multiple conversational turns. To evaluate its effectiveness, we constructed the first benchmark dataset dedicated to long-sequence emotional causal reasoning, featuring dialogues with over 70 turns. Experimental results demonstrate that the proposed RAG-based multimodal integrated approach significantly enhances both the depth of emotional understanding and the causal inference capabilities of large-scale language models. A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4 by 1.2%. Additionally, on the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.
论文简评
这篇论文提出了一种名为CauseMotion的情感因果分析框架,旨在解决长对话中情感因果关系推断的问题,这一问题在情感分析领域具有重要意义且亟待解决。该文提出了利用多模态数据和Retrieval-Augmented Generation(RAG)技术来增强对话中情感因果关系解释的方法,并通过构建一个新的基准数据集验证了这种方法的有效性。此外,文中引入了一个专门用于长序列情感因果推理的新数据集,这对于未来研究和模型训练具有重要的价值。综上所述,本文不仅解决了长期存在的研究问题,而且提供了新的解决方案,对于推动情感因果分析领域的进展具有积极意义。
4.Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion
Authors: Hebin Wang, Yangning Li, Yinghui Li, Hai-Tao Zheng, Wenhao Jiang, Hong-Gee Kim
https://arxiv.org/abs/2501.00330
论文摘要
The rapid development of multimodal large language models (MLLMs) has brought significant improvements to a wide range of tasks in real-world applications. However, LLMs still exhibit certain limitations in extracting implicit semantic information. In this paper, we apply MLLMs to the Multi-modal Entity Set Expansion (MESE) task, which aims to expand a handful of seed entities with new entities belonging to the same semantic class, with multi-modal information provided for each entity. We explore the capabilities of MLLMs to understand implicit semantic information at the entity-level granularity through the MESE task, introducing a listwise ranking method LUSAR that maps local scores to global rankings. Our LUSAR demonstrates significant improvements in MLLM's performance on the MESE task, marking the first use of generative MLLM for ESE tasks and extending the applicability of listwise ranking.
论文简评
综上所述,该篇论文主要研究了多模态大型语言模型(MLLMs)的隐含语义能力,并通过Multimodal Entity Set Expansion(MESE)任务来验证这一点。作者提出了一种名为LUSAR的新方法,旨在提高MLLMs在推断隐含含义时对种子实体集进行扩展的能力。实验结果表明,该方法在MESE任务中取得了显著的性能提升。因此,这篇论文为解决MLLMs在处理隐含信息时面临的挑战提供了新的视角,并提出了一个有效的解决方案。
5.GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models
Authors: Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai
https://arxiv.org/abs/2412.21036
论文摘要
Multimodal large language models (MLLMs) have achieved significant advancements in integrating visual and linguistic understanding. While existing benchmarks evaluate these models in context-rich, real-life scenarios, they often overlook fundamental perceptual skills essential for environments deviating from everyday realism. In particular, geometric perception, the ability to interpret spatial relationships and abstract visual patterns, remains underexplored. To address this limitation, we introduce GePBench, a novel benchmark designed to assess the geometric perception capabilities of MLLMs. Results from extensive evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in such tasks. Additionally, we demonstrate that models trained with data sourced from GePBench show notable improvements on a wide range of downstream tasks, underscoring the importance of geometric perception as a foundation for advanced multimodal applications. Our code and datasets will be publicly available.
论文简评
GePBench是一个创新的基准测试,专门用于评估多模态大型语言模型(MLLLMs)的几何感知能力,旨在填补在评估这些模型在需要空间推理和形状理解的任务能力方面的空白。本文展示了一个包含20K张图像和250K道选择题的数据集,涵盖几何感知的各个方面。通过广泛的实验,研究发现,基于GePBench训练显著提高了下游任务的性能,对比现有模型。论文的关键发现表明,提升几何感知能力对提高实际应用中的表现至关重要。总体而言,这项研究为如何增强MLLM的几何感知能力提供了有价值的见解,使其在应对复杂几何问题时更为有效。
我们欢迎您在评论区中留下宝贵的建议!包括但不限于:
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!
END