点击蓝字 关注我们
论文分享 | 多模态大模型相关研究进展
我们从2024-12-05到2024-12-16的70篇文章中精选出5篇优秀的工作与读者分享。
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM Evaluating Model Perception of Color Illusions in Photorealistic Scenes Synthetic Vision: Training Vision-Language Models to Understand Physics Causal Graphical Models for Vision-Language Compositional Understanding RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts
1.Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Authors: Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
https://arxiv.org/abs/2412.09530
论文摘要
The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we’ve seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of compara ble datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from pro prietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic vi sual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed Dynamic-VLM achieves state-of-the-art results across various video tasks and shows impressive general ization, setting new baselines in multi-image understanding. Notably, Dynamic-VLM delivers an absolute improvement of 2.7% over LLaVA-OneVision on VideoMME and 10.7% on MuirBench.
论文简评
这篇关于动态视觉令牌压缩机制(Dynamic-VLM)的研究论文主要探讨了如何优化视频处理流程以提高计算效率和性能。文中提出了一种新的动态视觉令牌压缩架构,旨在解决长视频处理过程中遇到的挑战。此外,作者还提出了一个基于开源模型生成的合成视频文本数据集,以增强视频理解任务中的训练效果。
该文的主要贡献在于引入了一种新颖的视觉令牌压缩方法,并展示了其在多个基准测试中的有效性。此压缩机制能够显著减少视频处理的时间成本,为大规模视频处理提供了可能性。同时,利用大型合成数据集,研究者得以更深入探索视频与语言之间的关系,进一步推动了该领域的研究进展。
综上所述,这篇论文不仅在理论层面上做出了创新贡献,也在实践中取得了良好的效果。它强调了高效视频处理的关键技术,对于加速处理流程、提升计算资源利用率具有重要意义。同时,其提出的合成数据集为未来的研究提供了有价值的数据来源,有助于拓展视频理解和语义分析的边界。因此,该文是当前视频领域的一项重要研究成果,值得学术界和业界的高度关注。
2.Evaluating Model Perception of Color Illusions in Photorealistic Scenes
Authors: Lingjun Mao, Zineng Tang, Alane Suhr
https://arxiv.org/abs/2412.06184
论文摘要
We study the perception of color illusions by vision-language models. Color illusion, where a person's visual system perceives color differently from actual color, is well-studied in human vision. However, it remains underexplored whether vision-language models (VLMs), trained on large-scale human data, exhibit similar perceptual biases when confronted with such color illusions. We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. Our experiments show that all studied VLMs exhibit perceptual biases similar to human vision. Finally, we train a model to distinguish both human perception and actual pixel differences.
论文简评
该论文主要探讨了视觉语言模型(VLMs)对颜色幻觉的感知,并提出了一个自动框架来生成现实主义颜色幻觉(RCID)数据集。研究发现,VLMs对色彩幻觉的响应方式与人类相似,特别是在对幻觉的反应方面。此外,该研究还提出了一种新的数据集,并提供了关于VLM对幻觉敏感性的深入见解。
论文重要的贡献在于提供了一个大规模资源来研究颜色幻觉,这对于理解视觉模型的能力至关重要。通过实验对比不同类型的图像(幻觉与非幻觉),研究人员揭示了VLM模型在应对幻觉时表现出的差异。这些发现为深入探索视觉模型的感知能力提供了宝贵的资料。
综上所述,该论文不仅在理论层面上丰富了对颜色幻觉的研究,也在实际应用中具有重要意义。它不仅为未来的研究者提供了丰富的数据来源,还为实际应用领域提供了有价值的参考信息。因此,该论文是一个非常有价值的研究成果,值得被广泛讨论和引用。
3.Synthetic Vision: Training Vision-Language Models to Understand Physics
Authors: Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, Rahul G. Krishnan
https://arxiv.org/abs/2412.08619
论文摘要
Physical reasoning, which involves the interpretation, understanding, and prediction of object behavior in dynamic environments, remains a significant challenge for current Vision-Language Models (VLMs). In this work, we propose two methods to enhance VLMs' physical reasoning capabilities using simulated data. First, we fine-tune a pre-trained VLM using question-answer (QA) pairs generated from simulations relevant to physical reasoning tasks. Second, we introduce Physics Context Builders (PCBs), specialized VLMs fine-tuned to create scene descriptions enriched with physical properties and processes. During physical reasoning tasks, these PCBs can be leveraged as context to assist a Large Language Model (LLM) to improve its performance. We evaluate both of our approaches using multiple benchmarks, including a new stability detection QA dataset called Falling Tower, which includes both simulated and real-world scenes, and CLEVRER. We demonstrate that a small QA fine-tuned VLM can significantly outperform larger state-of-the-art foundational models. We also show that integrating PCBs boosts the performance of foundational LLMs on physical reasoning tasks. Using the real-world scenes from the Falling Tower dataset, we also validate the robustness of both approaches in Sim2Real transfer. Our results highlight the utility that simulated data can have in the creation of learning systems capable of advanced physical reasoning.
论文简评
该篇论文针对当前视觉语言模型(VLMs)在物理推理能力上的局限性提出了一种新的解决方案。首先,作者通过模拟生成的数据对VLM进行了微调,以提高它们理解物理交互的能力。这种方法的优势在于,它有效地利用了模拟数据,为VLM提供了一个增强物理推理能力的新途径。其次,作者提出了基于物理场景描述的物理学建模工具(Physics Context Builders - PCBs),这些工具能够提供更加丰富的情景描述,进一步提升VLM对物理世界的理解和认知。
论文中的实验结果表明,采用这两种方法的模型在多个基准测试中都取得了显著进步,尤其是在使用新的物理环境构建器(Falling Tower)时。此外,实验还展示了这些改进对于提高VLM在其他任务中的表现的重要性,如物体识别、路径规划等。
总的来说,这篇论文提供了两个创新性的解决方案来改善VLM在物理推理方面的不足,其中模拟生成的数据和基于物理场景描述的建模工具是主要亮点。这些方法的成功验证了理论研究的价值,并有望在未来的发展中被广泛应用于各个领域,特别是在机器人学、自动驾驶等领域。
4.Causal Graphical Models for Vision-Language Compositional Understanding
Authors: Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara
https://arxiv.org/abs/2412.09353
论文摘要
Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of human language, often modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks that require a deeper understanding of the different entities of a sentence (subject, verb, etc.) and their mutual relationships. In this paper, we model the dependency relations among textual and visual tokens using aCausal Graphical Model (CGM), built using adependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence while discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all state-of-the-art compositional approaches by a large margin, and it also improves over methods trained on much larger datasets. Our code and our models are available at: https://github.com/aimagelab/COGT
论文简评
这篇论文提出了一种利用因果图模型(Causal Graphical Models,CGMs)以增强视觉语言模型(VLMs)来改善语义构建能力的方法。作者认为他们的方法可以更有效地基于依赖解析中提取的语法依赖进行词性预测,从而显著提高在多个组合基准上的表现。实验结果表明,相较于现有方法,该方法取得了显著的优势。
论文的主要创新在于将CGMs整合到VLMs中解决语义构建问题,并通过广泛的研究证明了这种方法的有效性。此外,研究者特别关注了当前VLMs面临的一个限制——缺乏对语义结构的理解,因此他们专注于提升其对语义结构的理解能力。综上所述,这篇论文提出了一个新颖而具有挑战性的研究方向,为未来发展提供了有价值的思路。
5.RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts
Authors: Xu Liu, Zhouhui Lian
https://arxiv.org/abs/2412.05679
论文摘要
Remote Sensing Vision-Language Models (RS VLMs) have made significant progress in remote sensing (RS) image comprehension tasks. While performing well in multi-modal reasoning and multi-turn conversations, existing models lack pixel-level understanding and struggle with multi-image inputs. In this work, we propose RSUniVLM, a unified, end-to-end RS VLM designed for comprehensive vision understanding across multiple granularities, including image-level, region-level, and pixel-level tasks. RSUniVLM also performs effectively in multi-image analysis tasks, including change detection and change captioning. To enhance the model's ability to capture visual information across different levels without increasing model size, we design a novel architecture called Granularity-oriented Mixture of Experts. This approach constrains the model to about 1 billion parameters. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both the RS and general domains, encompassing tasks such as object localization, visual question answering, and semantic segmentation. Extensive experiments have been conducted to validate the superiority of the proposed RSUniVLM across various RS tasks. Code and model will be available at \href{https://github.com/xuliu-cyber/RSUniVLM}{here}.
论文简评
《RSUniVLM:一个统一的视觉语言模型》是关于远程感知任务的重要研究论文。该文提出了一种新的Granularity-oriented Mixture of Experts(G-MoE)架构,旨在解决远程感知领域中的挑战。通过这个新框架,作者能够同时处理图像、区域和像素级别的任务。论文的主要优点在于提出了一种更高效且参数量更小的模型来应对这一问题,并提供了大规模的指导跟随数据集用于训练。此外,作者还进行了广泛的实验,证明了提出模型在多个不同任务上的有效性。总体而言,《RSUniVLM》展示了对远程感知领域的巨大贡献,为未来的研究提供了新视角。
我们欢迎您在评论区中留下宝贵的建议!包括但不限于:
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!
END