2025-01-08 论文分享 | 多模态大模型最新进展

文摘 2025-01-08 10:21 安徽

点击蓝字关注我们

论文分享 | 多模态大模型相关研究进展

我们从2025-01-04到2025-01-08的20篇文章中精选出5篇优秀的工作分享给读者。

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

1.AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

https://arxiv.org/abs/2501.02135

论文摘要

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks primarily focus on the visual aspect and do not examine holistic audio-visual (AV) understanding. Moreover, there are currently no benchmarks that investigate the capabilities of AVLLMs in calibrating their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrust), comprising 1,500 samples spanning over 20 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: adversarialsuite, compositionalsuite, and missingmodalitysuite. Using our benchmark, we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models significantly fall short of achieving human-like comprehension, providing valuable insights for future research directions. To address the limitations in existing approaches, we propose a robust, model-agnostic calibrated audio-visual preference optimization-based training strategy called CAVPref, achieving gains of up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this domain.

论文简评

本文详细评述了一项关于评估音频视觉大型语言模型可靠性和鲁棒性的基准—AVTrust。在对抗性攻击、组合推理和模态依赖性三个维度的考察下，文中提出了一种通用训练策略CAVPref。实验结果表明，应用提出的改进方法后，模型在不同任务上的性能显著提升。这一研究为音频视觉理解提供了有价值的参考框架，并有望推动该领域的发展。

2.Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

Authors: Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, Sanjeev Arora

https://arxiv.org/abs/2501.02669

论文摘要

While Vision Language Models (VLMs) are impressive in tasks such as visual question answering (VQA) and image captioning, their ability to apply multi-step reasoning to images has lagged, resulting in perceptions of modality imbalance or brittleness.

Towards a systematic study of these issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning (AVR), comprising three tasks: tableread, gridnav, and visualanalogy. Each has two levels of difficulty, simple and hard, with even the simple versions challenging for frontier VLMs. We seek strategies for training on the simple version of tasks that improve performance on the corresponding hard tasks, i.e., simpletohard. This synthetic framework, where each task also has a text-only version, allows quantification of modality imbalance and how it is affected by training strategy. Ablation studies highlight the importance of explicit image-to-text conversion in promoting simpletohard when using auto-regressive training. We also report results from a mechanistic study of this phenomenon, including a measure of gradient alignment that seems to identify training strategies that enhance better simpletohard (Code is available at https://github.com/princeton-pli/VLM_S2H/).

论文简评

这篇论文提出了针对视觉语言模型（VLM）中的模态不平衡问题的方法论，并通过引入合成任务直接比较训练策略的效果。研究发现，使用图像推理并通过文本转换在更难的任务上进行训练可以显著提高性能，从而缓解模态不平衡。这项研究深入探讨了VLM中的这一关键问题，并提出了一种有效的解决方案，通过合成任务模拟不同模态下的数据分布以改善模型表现。实验结果为理解和优化不同训练策略提供了宝贵的见解，尤其是在处理更复杂和高级别的任务时。这项研究成果对于提升VLM在应用领域的表现具有重要意义。

3.Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison

Authors: Tsz Kin Lam, Marco Gaido, Sara Papi, Luisa Bentivogli, Barry Haddow

https://arxiv.org/abs/2501.02370

论文摘要

Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech—the most common form of communication. To integrate speech into LLMs, one promising approach is dense feature prepending (DFP), which prepends the projected speech representations to the textual representations, allowing end-to-end training with the speech encoder. However, DFP typically requires connecting a text decoder to a speech encoder, leading to questions about the importance of a sophisticated speech encoder for DFP and how its performance compares with a standard encoder-decoder (i.e., cross-attention) architecture. To perform a controlled architectural comparison, we train all models from scratch rather than using large pretrained models, and we use comparable data and parameter settings, testing speech-to-text recognition (ASR) and translation (ST) on MuST-C v1.0 and CoVoST2 datasets. We study the influence of a speech encoder in DFP. More importantly, we compare DFP and cross-attention under various configurations, such as CTC compression, sequence-level knowledge distillation, generation speed, and GPU memory footprint on monolingual, bilingual, and multilingual models. Despite the prevalence of DFP over cross-attention, our overall results do not indicate a clear advantage of DFP.

论文简评

该论文旨在探究大型语言模型（LLM）与语音处理相结合的方法，通过比较两种架构：密集特征前缀（DFP）和交叉注意力（CA）。研究者使用MuST-C和CoVoST2数据集进行了控制性实证研究，以评估不同配置及其对性能的影响。结果表明，在生成速度和内存消耗方面，交叉注意力的表现优于DFP，但整体性能上则没有明显优势。

论文的整体内容丰富且详尽，涵盖了多个配置和不同数据集的对比实验，为研究者提供了宝贵的参考依据。此外，实验设计严谨、控制条件合理，有助于确保实验结果的准确性。因此，这篇论文是当前研究领域中一个值得关注的重要成果。

4.MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Authors: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang

https://arxiv.org/abs/2501.02955

论文摘要

In recent years, significant advancements have been made in leveraging multimodal large language models (LLMs) for video understanding. However, a crucial capability—fine-grained motion comprehension—remains underexplored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion understanding of video LLMs. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLMs, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io.

论文简评

MotionBench是一个旨在评估视觉语言模型（VLM）中细粒度运动理解能力的基准测试。它突出了现有模型在处理运动水平理解方面的困难，并提出通过编码器融合的新方法来改进视频特征表示。该基准包含来自不同视频来源的8,052个问题，覆盖多种类别。这项研究强调了对现实世界应用中精细运动理解的关键性。

重点在于，MotionBench的引入为评估视频领域中的细粒度运动理解提供了新的视角，有助于提高模型性能。此外，其收集的多样化视频类型和问题类型极大地丰富了代表性，使模型能够更好地适应各种实际场景。综上所述，MotionBench在促进视频领域模型的发展方面发挥了重要作用，是实现高质量运动理解的重要里程碑。

5.FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

Authors: Hui Lin, Chao Zhang, Danfeng Hong, Kexin Dong, Congcong Wen

https://arxiv.org/abs/2501.02461

论文摘要

Remote sensing image classification is essential for various applications, including agricultural monitoring, urban planning, and land use classification. However, remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework implements a dual-prompt mechanism comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results on the Fed-RSIC dataset demonstrate the effectiveness and superiority of FedRSCLIP in addressing the challenges of federated remote sensing image classification.

论文简评

该论文提出了FedRSCLIP框架，这是一种利用视觉语言模型（VLMs）对遥感图像进行分类的联邦学习框架。主要挑战在于通信成本和数据异质性，因此，作者使用提示学习和双提示机制来解决这些问题。通过实验验证，在新建的Fed-RSIC数据集上，该方法表现出令人印象深刻的成绩，显示了其优越性。这一创新性贡献不仅为联邦学习提供了新的视角，也为遥感图像识别带来了新的思路。总体而言，该文提供了一个综合解决方案，并展示了其在实际应用中的潜力。

我们欢迎您在评论区中留下宝贵的建议！包括但不限于：

可以提出推文中论文简评的不足！
可以分享最近更值得推荐的论文并给出理由！

END

2025-01-08 论文分享 | 多模态大模型最新进展

论文分享 | 多模态大模型相关研究进展

1.AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

论文摘要

论文简评

2.Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

论文摘要

论文简评

3.Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison

论文摘要

论文简评

4.MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

论文摘要

论文简评

5.FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models

论文摘要

论文简评

推荐阅读