2024-12-23 论文分享 | 多模态大模型最新进展

文摘 2024-12-23 10:08 安徽

点击蓝字关注我们

论文分享 | 多模态大模型相关研究进展

我们从2024-12-16到2024-12-23的19篇文章中精选出5篇优秀的工作分享给读者。

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models
FedPIA -- Permuting and Integrating Adapters Leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning

1.GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Authors: Saumya Saxena, Blake Buchanan, Chris Paxton, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer

https://arxiv.org/abs/2412.14480

论文摘要

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose \ourmethod, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task-relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. Through experiments in simulation on the HM-EQA dataset and in the real world in home and office environments, we demonstrate that our method outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps. Videos and code are available on our https://saumyasaxena.github.io/grapheqa/

论文简评

本文提出的GraphEQA是一种全新的方法，用于基于实时3D米氏场景图和任务相关的图像对视觉语言模型（VLM）进行地心探索和规划。这种方法结合了全局语义信息与局部视觉上下文，从而显著提高了未见环境中的EQA任务的成功率和效率。在模拟和实际场景中进行了详尽的实验，展示了GraphEQA的有效性，实现了更高的成功率和更少的规划步骤。这些创新性和实践性的研究为未来的研究提供了有价值的参考。

2.HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model

Authors: Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

https://arxiv.org/abs/2412.14613

论文摘要

Vision-language models (VLMs) have shown impressive abilities in text and image understanding. However, existing metrics for evaluating the text generated by VLMs focus exclusively on overall quality, leading to two limitations: 1) it is challenging to identify which aspects of the text need improvement from the overall score; 2) metrics may overlook specific evaluation criteria when predicting an overall score. To address these limitations, we propose HarmonicEval, a reference-free evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four vision-language tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.

论文简评

这篇关于文本生成任务的评估论文，提出了一个创新性的方法——HarmonicEval，用于评价由视觉语言模型（VLM）生成的结果。HarmonicEval通过综合各个准则的评分来计算整体得分，并且引入了一个包含18,000个专家人类判断的多任务多准则人机评测数据集（MMHE）。实验结果显示，HarmonicEval与人工评判相比能够提供更详细的评分信息，其性能也优于传统指标。因此，该文章不仅填补了现有评估标准中缺乏具体评分项的空白，还提供了更全面、细致的人机对比基准。此外，该研究为自动评估方法与人工评判结果之间的匹配性提供了宝贵的数据支持，有助于提升模型的透明度和可解释性。综上所述，HarmonicEval是一个具有重要价值的研究成果，对于促进人工智能技术的发展有着积极的意义。

3.EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

Authors: Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, Salman Khan

https://arxiv.org/abs/2412.15190

论文摘要

Automated analysis of vast Earth observation data via interactive Vision-Language Models (VLMs) can unlock new opportunities for environmental monitoring, disaster response, and resource management. Existing generic VLMs do not perform well on Remote Sensing data, while the recent Geo-spatial VLMs remain restricted to a fixed resolution and few sensor modalities. In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks, including classification, detection, captioning, question answering, visual reasoning, and visual grounding. To achieve this, we introduce an extensive instruction tuning dataset comprising over 11.11M instruction pairs covering RGB, Synthetic Aperture Radar (SAR), and multispectral modalities such as Near-Infrared (NIR) and infrared. Furthermore, EarthDial handles bi-temporal and multi-temporal sequence analysis for applications like change detection. Our extensive experimental results on 43 downstream applications demonstrate that EarthDial outperforms existing generic and domain-specific models, achieving better generalization across various EO tasks.

论文简评

《EarthDial》这一论文提出了一种专为地球观测（EO）数据设计的视觉语言模型。其旨在通过转换多感官观察来实现对话式的交互。此外，作者还提出了一个包含超过11170000个样本的大规模指令调用数据集，并成功展示了它在37项下游任务上的出色性能，相比现有模型，在多个任务上均表现出色。这种显著的提升将有助于增强对EO数据的应用分析以及相关应用。

4.Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

Authors: Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, Huaping Liu

https://arxiv.org/abs/2412.14058

论文摘要

Foundation Vision Language Models (VLMs) exhibit strong capabilities in multi-modal representation learning, comprehension, and reasoning. By injecting action components into the VLMs, Vision-Language-Action Models (VLAs) can be naturally formed and also show promising performance. Existing work has demonstrated the effectiveness and generalization of VLAs in multiple scenarios and tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since existing VLAs differ in their backbones, action-prediction formulations, data distributions, and training recipes. This leads to a missing piece for a systematic understanding of the design choices of VLAs. In this work, we disclose the key factors that significantly influence the performance of VLA and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we need VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets, and toolkits, along with detailed training and evaluation recipes at: robovlms.github.io.

论文简评

综上所述，该论文以机器人操作任务中视觉与语言相结合的行动建模问题为核心，提出了基于视觉语言模型（VLM）的RoboVLM框架，并通过大量的实验研究了VLA性能的关键影响因素。这个框架为系统性地评估和整合VLMs到VLAs提供了方法论支持，从而深入探讨了视觉、语言和动作之间的相互作用对VLA性能的影响。此外，通过跨多个设置的实验，文章还展示了其在不同架构和数据集上的效果。整体来看，该文是一个关于如何更有效地利用视觉和语言信息来改善机器人的操作能力的重要工作，对于未来的研究具有重要的参考价值。

5.FedPIA -- Permuting and Integrating Adapters Leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning

Authors: Pramit Saha, Divyanshu Mishra, Felix Wagner, Konstantinos Kamnitsas, J. Alison Noble

https://arxiv.org/abs/2412.14424

论文摘要

Large Vision-Language Models (VLMs), possessing millions or billions of parameters, typically require large text and image datasets for effective fine-tuning. However, collecting data from various sites, especially in healthcare, is challenging due to strict privacy regulations. An alternative is to fine-tune these foundation models on end-user devices, such as in medical clinics and hospitals, without sending data to a server. These local clients typically have limited computing power and small datasets, which are not enough for fully fine-tuning large VLMs on their own. A naive solution to these scenarios is to leverage parameter-efficient fine-tuning (PEFT) strategies such as adapters and apply federated learning (FL) algorithms to combine the learned adapter weights, thereby respecting the resource limitations and data privacy of the clients. However, this approach does not fully leverage the knowledge from multiple adapters trained on diverse data distributions and for diverse tasks. The adapters are adversely impacted by data heterogeneity and task heterogeneity across clients, resulting in sub-optimal convergence. To this end, we propose a novel framework calledFedPIA that improves upon the naive combinations of FL and PEFT by introducingPermutation andIntegration of the localAdapters in the server and globalAdapters in the clients, exploiting Wasserstein barycenters for improved blending of client-specific and client-agnostic knowledge. This layer-wise permutation helps to bridge the gap in the parameter space of local and global adapters before integration. We conduct over 2000 client-level experiments utilizing 48 medical image datasets across five different medical vision-language FL task settings encompassing visual question answering as well as image and report-based multi-label disease detection. Our experiments involving diverse client settings, ten different modalities, and two VLM backbones demonstrate that FedPIA consistently outperforms the state-of-the-art PEFT-FL baselines. Code/FL setup:https://github.com/PramitSaha/Fed-PEFT.

论文简评

FedPIA是一种新颖的方法，用于在联邦学习环境中对视觉语言模型进行微调。它通过引入一个机制来使用Wasserstein barycenter对适配器进行置换和集成，以解决数据和任务异质性问题。这种方法在各种医疗数据集上的表现优于最先进的方法。实验结果表明，该方法的有效性得到了强有力的实证证据支持。

我们欢迎您在评论区中留下宝贵的建议！包括但不限于：

可以提出推文中论文简评的不足！
可以分享最近更值得推荐的论文并给出理由！

END

2024-12-23 论文分享 | 多模态大模型最新进展

论文分享 | 多模态大模型相关研究进展

1.GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

论文摘要

论文简评

2.HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model

论文摘要

论文简评

3.EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

论文摘要

论文简评

4.Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

论文摘要

论文简评

5.FedPIA -- Permuting and Integrating Adapters Leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning

论文摘要

论文简评

推荐阅读