点击蓝字 关注我们
论文分享 | 多模态大模型相关研究进展
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation Multi-Label Contrastive Learning: A Comprehensive Study SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
1.LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Authors: Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu
It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for rec ognizing unseen objects in images for semantic segmenta tion. Existing open-vocabulary approaches leverage vision language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive ob ject attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel level representation, which is crucial for semantic segmen tation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained vi sual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to gen erate enriched language prompts with diverse visual at tributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual fea ture extraction, the SAM model is adopted as a supple ment to the CLIP visual encoder through a proposed learn able weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art per formance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.
这篇关于Open-Vocabulary Semantic Segmentation(OvSS)的研究论文旨在解决传统方法在处理开放词汇模型时所面临的挑战。作者提出的LMSeg框架利用大型语言模型(LLMs)和SAM模型来增强像素级和文本特征的一致性,以解决现有方法依赖固定模板进行文本提示和难以处理像素级别的问题。实验结果表明,相较于现有的研究,LMSeg取得了显著的进步,在多个基准数据集上获得了更好的性能。该方法的创新之处在于它使用大规模的语言模型来改善文本提示,这是一种新颖且有效的解决方案。这些研究成果对于推动这一领域的研究发展具有重要意义。总的来说,本文提出了一个有效的方法,并证明了其在实际应用中的潜力。
2.Multi-Label Contrastive Learning: A Comprehensive Study
Authors: Alexandre Audibert, Aurélien Gauffre, Massih-Reza Amini
Multi-label classification, which involves assigning multiple labels to a single input, has emerged as a key area in both research and industry due to its wide-ranging applications. Designing effective loss functions is crucial for optimizing deep neural networks for this task, as they significantly influence model performance and efficiency. Traditional loss functions, which often maximize likelihood under the assumption of label independence, may struggle to capture complex label relationships. Recent research has turned to super vised contrastive learning, a method that aims to create a structured representation space by bringing similar instances closer together and pushing dissimilar ones apart. Although contrastive learning offers a promising approach, applying it to multi-label classification presents unique challenges, particularly in managing label interactions and data structure. In this paper, we conduct an in-depth study of contrastive learning loss for multi-label classification across diverse settings. These include datasets with both small and large numbers of labels, datasets with varying amounts of training data, and applications in both computer vision and natural language processing. Our empirical results indicate that the promising outcomes of contrastive learning are attributable not only to the consideration of label interactions but also to the robust opti mization scheme of the contrastive loss. Furthermore, while the supervised contrastive loss function faces challenges with datasets containing a small number of labels and ranking based metrics, it demonstrates excellent performance, particularly in terms of Macro-F1, on datasets with a large number of labels. Finally, through gradient analysis of standard contrastive loss in multi-label classifica tion, along with insights from previous work, we develop a new competitive loss function that removes certain gradient components to prevent undesirable behavior and improve performance.
3.SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
Authors: Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo
Large vision-language models (LVLMs) have shown remark able capabilities in interpreting visual content. While exist ing works demonstrate these models’ vulnerability to delib erately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversar ial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent ty pographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formu late effective adversarial text, strategically plan its place ment, and provide detailed instructions for natural integra tion within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical envi ronments, demonstrating its practical implications. Exten sive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulner abilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.
4.MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image
Authors: Shezheng Song, Chengxiang He, Shasha Li, Shan Zhao, Chengyu Wang, Tianwei Yan, Xiaopeng Li, Qian Wan, Jun Ma, Jie Yu, Xiaoguang Mao
Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.
5.SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Authors: Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu
Human beings are social animals. How to equip 3D au tonomous characters with similar social intelligence that can perceive, understand and interact with humans re mains an open yet foundamental problem. In this pa per, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: 1) Social VLA Architecture: We pro pose a unified social VLA framework to generate multi modal response (speech and motion) based on the user’s multimodal input to drive the character for social interac tion. 2) Interactive Multimodal Data: We present Syn MSI, asynthetic multimodal social interaction dataset gen erated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. 3) Immersive VRInterface: We develop a VR interface that enables users to immersively interact with these characters driven by var ious architectures. Extensive quantitative experiments and user study demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!