点击蓝字 关注我们
论文分享 | 多模态大模型相关研究进展
我们从2024-11-29到2024-12-03的34篇文章中精选出5篇优秀的工作分享给读者。
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation Multi-Label Contrastive Learning: A Comprehensive Study SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
1.LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Authors: Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu
https://arxiv.org/abs/2412.00364
论文摘要
It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for rec ognizing unseen objects in images for semantic segmenta tion. Existing open-vocabulary approaches leverage vision language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive ob ject attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel level representation, which is crucial for semantic segmen tation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained vi sual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to gen erate enriched language prompts with diverse visual at tributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual fea ture extraction, the SAM model is adopted as a supple ment to the CLIP visual encoder through a proposed learn able weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art per formance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.
论文简评
这篇关于Open-Vocabulary Semantic Segmentation(OvSS)的研究论文旨在解决传统方法在处理开放词汇模型时所面临的挑战。作者提出的LMSeg框架利用大型语言模型(LLMs)和SAM模型来增强像素级和文本特征的一致性,以解决现有方法依赖固定模板进行文本提示和难以处理像素级别的问题。实验结果表明,相较于现有的研究,LMSeg取得了显著的进步,在多个基准数据集上获得了更好的性能。该方法的创新之处在于它使用大规模的语言模型来改善文本提示,这是一种新颖且有效的解决方案。这些研究成果对于推动这一领域的研究发展具有重要意义。总的来说,本文提出了一个有效的方法,并证明了其在实际应用中的潜力。
2.Multi-Label Contrastive Learning: A Comprehensive Study
Authors: Alexandre Audibert, Aurélien Gauffre, Massih-Reza Amini
https://arxiv.org/abs/2412.00101
论文摘要
Multi-label classification, which involves assigning multiple labels to a single input, has emerged as a key area in both research and industry due to its wide-ranging applications. Designing effective loss functions is crucial for optimizing deep neural networks for this task, as they significantly influence model performance and efficiency. Traditional loss functions, which often maximize likelihood under the assumption of label independence, may struggle to capture complex label relationships. Recent research has turned to super vised contrastive learning, a method that aims to create a structured representation space by bringing similar instances closer together and pushing dissimilar ones apart. Although contrastive learning offers a promising approach, applying it to multi-label classification presents unique challenges, particularly in managing label interactions and data structure. In this paper, we conduct an in-depth study of contrastive learning loss for multi-label classification across diverse settings. These include datasets with both small and large numbers of labels, datasets with varying amounts of training data, and applications in both computer vision and natural language processing. Our empirical results indicate that the promising outcomes of contrastive learning are attributable not only to the consideration of label interactions but also to the robust opti mization scheme of the contrastive loss. Furthermore, while the supervised contrastive loss function faces challenges with datasets containing a small number of labels and ranking based metrics, it demonstrates excellent performance, particularly in terms of Macro-F1, on datasets with a large number of labels. Finally, through gradient analysis of standard contrastive loss in multi-label classifica tion, along with insights from previous work, we develop a new competitive loss function that removes certain gradient components to prevent undesirable behavior and improve performance.
论文简评
这篇论文主要探讨对比学习在多标签分类任务中的作用,并提出了一种基于梯度的正则化的新损失函数。该研究旨在提高多标签分类任务的表现,并提供了针对各种数据集的大量实证结果。然而,尽管论文强调了其对这一领域的重要性和贡献,但其成果似乎缺乏显著的创新性或革命性的见解,因此可能不会被视为突破性工作。尽管如此,论文提出的新的损失函数及其在实际应用中的效果值得进一步探索和验证。总的来说,本文为多标签分类领域的研究提供了有益的视角和方法论。
3.SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
Authors: Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo
https://arxiv.org/abs/2412.00114
论文摘要
Large vision-language models (LVLMs) have shown remark able capabilities in interpreting visual content. While exist ing works demonstrate these models’ vulnerability to delib erately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversar ial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent ty pographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formu late effective adversarial text, strategically plan its place ment, and provide detailed instructions for natural integra tion within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical envi ronments, demonstrating its practical implications. Exten sive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulner abilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.
论文简评
SceneTAP是作者提出的一种创新方法,旨在解决视觉语言模型(VLMs)在对抗攻击中存在的脆弱性。该论文提出了一个全面的框架,名为SceneTAP,专注于将对抗文本无缝地集成到图像中。作者的目标是通过将这些文本元素无缝地融入图像,来误导VLMs,从而强调了文本与视觉内容整合的重要性。
SceneTAP的独特之处在于其将大型语言模型与对抗规划技术相结合。通过利用这些策略,作者能够生成既有效又视觉自然的攻击。针对不同数据集的大量实验展示了该方法在各种场景中表现出的稳健性,包括实际应用。总体来看,这篇论文代表了对VLMs的对抗攻击领域的重大进展,为研究人员和实践者提供了宝贵的工具。
4.MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image
Authors: Shezheng Song, Chengxiang He, Shasha Li, Shan Zhao, Chengyu Wang, Tianwei Yan, Xiaopeng Li, Qian Wan, Jun Ma, Jie Yu, Xiaoguang Mao
https://arxiv.org/abs/2412.00060
论文摘要
Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.
论文简评
这篇论文是关于评估多模态大型语言模型(MLLMs)在多对象情感分析能力的一项研究。它提供了一组大约1,000张标注图像,其中MLLMs需要评估单个物体的情感。该文声称创新了基于距离的目标标注方法,并提出了一种新的评分机制,但其原创性和影响性受到质疑。该论文的关键贡献在于解决多模态大型语言模型在多对象情感分析方面缺乏评估标准这一重要问题。所使用的数据集覆盖了各种实际复杂情况,为未来的研究提供了宝贵资源。此外,探索物体之间距离对情感分析性能影响的见解也具有前瞻性,可能会导致模型设计上的改进。总的来说,《MOSABench》不仅填补了现有评估标准的空白,而且通过多样化的样本展示了模型的实际应用价值,为相关领域的研究提供了有益的参考。然而,值得注意的是,对于这些创新是否真正带来了显著的改进,以及它们如何与现有基准测试进行比较,仍需进一步研究来验证。
5.SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Authors: Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu
https://arxiv.org/abs/2412.00174
论文摘要
Human beings are social animals. How to equip 3D au tonomous characters with similar social intelligence that can perceive, understand and interact with humans re mains an open yet foundamental problem. In this pa per, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: 1) Social VLA Architecture: We pro pose a unified social VLA framework to generate multi modal response (speech and motion) based on the user’s multimodal input to drive the character for social interac tion. 2) Interactive Multimodal Data: We present Syn MSI, asynthetic multimodal social interaction dataset gen erated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. 3) Immersive VRInterface: We develop a VR interface that enables users to immersively interact with these characters driven by var ious architectures. Extensive quantitative experiments and user study demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.
论文简评
这篇论文旨在探讨一种名为SOLAMI的社会视觉语言行动建模框架,其目标是通过融合基于用户输入的运动和语音响应来实现与3D自主角色的沉浸式交互。该文提出的端到端模型整合了基于用户输入的动作和语音反馈,并提出了一个合成数据集——SynMSI,用于支持该模型的训练。
值得注意的是,引入SynMSI数据集对于克服多模态互动中的数据稀缺性是一个重大贡献。此外,提出的一种VR界面展示了SOLAMI模型的实际应用前景。综上所述,SOLAMI框架不仅解决了人机交互的关键问题,还为未来的研究提供了新的视角和实践基础。
我们欢迎您在评论区中留下宝贵的建议!包括但不限于:
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!
END