2024-12-25 论文分享 | 多模态大模型最新进展

文摘   2024-12-25 10:34   安徽  

点击蓝字 关注我们

论文分享 | 多模态大模型相关研究进展


  1. Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding
  2. Linguistic Features Extracted by GPT-4 Improve Alzheimer's Disease Detection based on Spontaneous Speech
  3. Retention Score: Quantifying Jailbreak Risks for Vision Language Models
  4. Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition
  5. Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

1.Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding

Authors: Junyi Ye, Ankan Dash, Wenpeng Yin, Guiling Wang



Flowcharts are typically presented as images, driving the trend of using vision language models (VLMs) for end-to-end flowchart understanding. However, two key challenges arise: (i) Limited controllability—users have minimal influence over the downstream task, as they can only modify input images, while the train ing of VLMs is often out of reach for most researchers. (ii) Lack of explainabil ity—it is difficult to trace VLM errors to specific causes, such as failures in vi sual encoding or reasoning. We propose TEXTFLOW, addressing aforementioned issues with two stages: (i) VISION TEXTUALIZER—which generates textual rep resentations from flowchart images; and (ii) TEXTUAL REASONER—which per forms question-answering based on the text representations. TEXTFLOW offers three key advantages: (i) users can select the type of text representations (e.g., GRAPHVIZ, MERMAID, PLANTUML), or further convert them into executable graph object to call tools, enhancing performance and controllability; (ii) it im proves explainability by helping to attribute errors more clearly to visual or tex tual processing components; and (iii) it promotes the modularization of the solu tion, such as allowing advanced LLMs to be used in the REASONER stage when VLMs underperform in end-to-end fashion. Experiments on the FlowVQA and FlowLearn benchmarks demonstrate TEXTFLOW’s state-of-the-art performance as well as its robustness. All code is publicly available.



2.Linguistic Features Extracted by GPT-4 Improve Alzheimer's Disease Detection based on Spontaneous Speech

Authors: Jonathan Heitz, Gerold Schneider, Nicolas Langer



Alzheimer's Disease (AD) is a significant and growing public health concern. Investigating alterations in speech and language patterns offers a promising path towards cost-effective and non-invasive early detection of AD on a large scale. Large language models (LLMs), such as GPT, have enabled powerful new possibilities for semantic text analysis. In this study, we leverage GPT-4 to extract five semantic features from transcripts of spontaneous patient speech. The features capture known symptoms of AD, but they are difficult to quantify effectively using traditional methods of computational linguistics. We demonstrate the clinical significance of these features and further validate one of them (“Word-Finding Difficulties”) against a proxy measure and human raters. When combined with established linguistic features and a Random Forest classifier, the GPT-derived features significantly improve the detection of AD. Our approach proves effective for both manually transcribed and automatically generated transcripts, representing a novel and impactful use of recent advancements in LLMs for AD speech analysis.



3.Retention Score: Quantifying Jailbreak Risks for Vision Language Models

Authors: Zaitang Li, Pin-Yu Chen, Tsung-Yi Ho



The emergence of Vision-Language Models (VLMs) marks a significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has made VLMs vulnerable to advanced adversarial attacks, raising concerns about their reliability. The objective of this paper is to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain robustness against adversarial input perturbations, we propose a novel metric called \textbf{Retention Score}. The Retention Score is a multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in the visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using a conditional diffusion model. These pairs are then predicted for toxicity scores by the VLM alongside a toxicity judgment classifier. By calculating the margin in toxicity scores, we can quantify the robustness of the VLM in an attack-agnostic manner. Our work has four main contributions. First, we demonstrate that the Retention Score can serve as a certified robustness metric. Second, we show that most VLMs with visual components are less robust against jailbreak attacks than their corresponding plain VLMs. Additionally, we evaluate black-box VLM APIs and find that security settings in Google Gemini significantly affect their scores and robustness. Moreover, the robustness of GPT4V is similar to the medium settings of Gemini. Finally, our approach offers a time-efficient alternative to existing adversarial attack methods and provides consistent model robustness rankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.


本文提出了一个名为Retention Score的新框架,旨在评估Vision-Language Model(VLM)的破解风险。该方法引入了一种多模态评估指标,通过使用合成图像-文本对量化VLM对抗输入扰动的鲁棒性。贡献在于提供了Retention Score的理论基础,并进行了广泛的实验展示了其在不同VLM模型上的有效性。

这篇文章的核心是提出了一种新的评估标准——Retention Score,它为VLM的安全性提供了新的视角。此外,作者讨论了如何利用Retention Score来评估不同的VLM模型,从而更好地理解和应对这些模型面临的威胁。这种创新的方法可能成为未来研究的重要方向,因为它可能提供一种更高效、更灵活的方式来检测和防御VLM的安全问题。总体来说,这篇论文为我们提供了一个新的视角来审视VLM的安全性和挑战,值得进一步深入探究。

4.Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Authors: Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland, Ozlem Kalinli, Mike Seltzer



While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issues and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach achieved a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.





5.Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

Authors: Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, Jingdong Wang



With the rapid advancement of Multimodal Large Language Models (MLLMs), a variety of benchmarks have been introduced to evaluate their capabilities. While most evaluations have focused on complex tasks such as scientific comprehension and visual reasoning, little attention has been given to assessing their fundamental image classification abilities. In this paper, we address this gap by thoroughly revisiting MLLMs with an in-depth analysis of image classification. Specifically, building on established datasets, we examine a broad spectrum of scenarios, from general classification tasks (e.g., ImageNet, ObjectNet) to more fine-grained categories such as bird and food classification. Our findings reveal that the most recent MLLMs can match or even outperform CLIP-style vision-language models on several datasets, challenging the previous assumption that MLLMs are bad at image classification VLMClassifier. To understand the factors driving this improvement, we conduct an in-depth analysis of the network architecture, data selection, and training recipe used in public MLLMs. Our results attribute this success to advancements in language models and the diversity of training data sources. Based on these observations, we further analyze and attribute the potential reasons to conceptual knowledge transfer and enhanced exposure of target concepts, respectively. We hope our findings will offer valuable insights for future research on MLLMs and their evaluation in image classification tasks.


综上所述,本文通过研究Multimodal Large Language Models(MLLMs)在图像分类能力上的表现,证明了近期MLLMs确实可以与或超越CLIP式模型在各种数据集上的性能。该文分析了影响这一提升的关键因素,包括网络架构、数据选择以及训练方法等。此外,它还提出现有关于MLLMs在图像分类方面的假设可能需要更新的观点。总之,这篇论文为评估MLLMs在图像分类任务中的表现提供了有价值的视角,并挑战了对这类模型的一般性看法。


  • 可以提出推文中论文简评的不足!
  • 可以分享最近更值得推荐的论文并给出理由!



2024-12-24 论文分享 | 大语言模型最新进展
2024-12-23 论文分享 | 多模态大模型最新进展
2024-12-20 论文分享 | 智能体最新进展
2024-12-19 论文分享 | 大语言模型最新进展
