2024-12-25 论文分享 | 多模态大模型最新进展

文摘   2024-12-25 10:34   安徽  

点击蓝字 关注我们

论文分享 | 多模态大模型相关研究进展

我们从2024-12-23到2024-12-25的45篇文章中精选出5篇优秀的工作分享给读者。

  1. Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding
  2. Linguistic Features Extracted by GPT-4 Improve Alzheimer's Disease Detection based on Spontaneous Speech
  3. Retention Score: Quantifying Jailbreak Risks for Vision Language Models
  4. Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition
  5. Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

1.Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding

Authors: Junyi Ye, Ankan Dash, Wenpeng Yin, Guiling Wang

https://arxiv.org/abs/2412.16420

论文摘要

Flowcharts are typically presented as images, driving the trend of using vision language models (VLMs) for end-to-end flowchart understanding. However, two key challenges arise: (i) Limited controllability—users have minimal influence over the downstream task, as they can only modify input images, while the train ing of VLMs is often out of reach for most researchers. (ii) Lack of explainabil ity—it is difficult to trace VLM errors to specific causes, such as failures in vi sual encoding or reasoning. We propose TEXTFLOW, addressing aforementioned issues with two stages: (i) VISION TEXTUALIZER—which generates textual rep resentations from flowchart images; and (ii) TEXTUAL REASONER—which per forms question-answering based on the text representations. TEXTFLOW offers three key advantages: (i) users can select the type of text representations (e.g., GRAPHVIZ, MERMAID, PLANTUML), or further convert them into executable graph object to call tools, enhancing performance and controllability; (ii) it im proves explainability by helping to attribute errors more clearly to visual or tex tual processing components; and (iii) it promotes the modularization of the solu tion, such as allowing advanced LLMs to be used in the REASONER stage when VLMs underperform in end-to-end fashion. Experiments on the FlowVQA and FlowLearn benchmarks demonstrate TEXTFLOW’s state-of-the-art performance as well as its robustness. All code is publicly available.

论文简评

总的来说,这篇论文提出的TEXTFLOW框架对于理解和处理流程图具有重要的意义。它采用双阶段架构,旨在解决传统VLM方法中存在的控制能力不足和解释性差的问题。通过引入多种文本表示格式(如GRAPHVIZ、MERMAID和PLANTUML),使对流程图的描述更加灵活和多样。实验结果表明,TEXTFLOW在FlowVQA和FlowLearn数据集上的表现达到了当前最佳水平,显示出其潜在的应用价值和发展潜力。因此,本文提出的理论与实践相结合的方法值得进一步研究和发展。

2.Linguistic Features Extracted by GPT-4 Improve Alzheimer's Disease Detection based on Spontaneous Speech

Authors: Jonathan Heitz, Gerold Schneider, Nicolas Langer

https://arxiv.org/abs/2412.15772

论文摘要

Alzheimer's Disease (AD) is a significant and growing public health concern. Investigating alterations in speech and language patterns offers a promising path towards cost-effective and non-invasive early detection of AD on a large scale. Large language models (LLMs), such as GPT, have enabled powerful new possibilities for semantic text analysis. In this study, we leverage GPT-4 to extract five semantic features from transcripts of spontaneous patient speech. The features capture known symptoms of AD, but they are difficult to quantify effectively using traditional methods of computational linguistics. We demonstrate the clinical significance of these features and further validate one of them (“Word-Finding Difficulties”) against a proxy measure and human raters. When combined with established linguistic features and a Random Forest classifier, the GPT-derived features significantly improve the detection of AD. Our approach proves effective for both manually transcribed and automatically generated transcripts, representing a novel and impactful use of recent advancements in LLMs for AD speech analysis.

论文简评

本文研究了使用GPT-4从自发性言语中提取语义特征以早期检测阿尔茨海默病(AD)。研究表明,这些特征与传统语义特征相结合时可以提高分类性能,尤其是与其他经典机器学习方法结合时。此外,通过临床验证,证明提取的特征具有临床意义,能够区分AD患者与健康对照组。综上所述,本研究为基于LLMs的AD诊断提供了新颖的研究思路,并展示了其在提高AD诊断准确率方面的潜在效果。

3.Retention Score: Quantifying Jailbreak Risks for Vision Language Models

Authors: Zaitang Li, Pin-Yu Chen, Tsung-Yi Ho

https://arxiv.org/abs/2412.17544

论文摘要

The emergence of Vision-Language Models (VLMs) marks a significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has made VLMs vulnerable to advanced adversarial attacks, raising concerns about their reliability. The objective of this paper is to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain robustness against adversarial input perturbations, we propose a novel metric called \textbf{Retention Score}. The Retention Score is a multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in the visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using a conditional diffusion model. These pairs are then predicted for toxicity scores by the VLM alongside a toxicity judgment classifier. By calculating the margin in toxicity scores, we can quantify the robustness of the VLM in an attack-agnostic manner. Our work has four main contributions. First, we demonstrate that the Retention Score can serve as a certified robustness metric. Second, we show that most VLMs with visual components are less robust against jailbreak attacks than their corresponding plain VLMs. Additionally, we evaluate black-box VLM APIs and find that security settings in Google Gemini significantly affect their scores and robustness. Moreover, the robustness of GPT4V is similar to the medium settings of Gemini. Finally, our approach offers a time-efficient alternative to existing adversarial attack methods and provides consistent model robustness rankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.

论文简评

本文提出了一个名为Retention Score的新框架,旨在评估Vision-Language Model(VLM)的破解风险。该方法引入了一种多模态评估指标,通过使用合成图像-文本对量化VLM对抗输入扰动的鲁棒性。贡献在于提供了Retention Score的理论基础,并进行了广泛的实验展示了其在不同VLM模型上的有效性。

这篇文章的核心是提出了一种新的评估标准——Retention Score,它为VLM的安全性提供了新的视角。此外,作者讨论了如何利用Retention Score来评估不同的VLM模型,从而更好地理解和应对这些模型面临的威胁。这种创新的方法可能成为未来研究的重要方向,因为它可能提供一种更高效、更灵活的方式来检测和防御VLM的安全问题。总体来说,这篇论文为我们提供了一个新的视角来审视VLM的安全性和挑战,值得进一步深入探究。

4.Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Authors: Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland, Ozlem Kalinli, Mike Seltzer

https://arxiv.org/abs/2412.16464

论文摘要

While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issues and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach achieved a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.

论文简评

本文提出了一种名为Transducer-Llama的新模型,旨在将大型语言模型(LLM)集成到流式自动语音识别(ASR)系统中。它引入了词汇适应技术,并提出了弱至强的LSTM交换策略,以提高ASR系统的性能和效率。实验结果表明,与基准模型相比,通识性错误率(WER)有所下降。

该文中提出的词汇适应技术和弱至强的LSTM交换策略为解决ASR领域面临的挑战提供了创新性的解决方案。词汇适应技术旨在使LLM更好地匹配ASR词典,而弱至强的LSTM交换策略则通过增强训练效率来改善ASR系统的表现。这些改进使得Transducer-Llama在实际应用中表现优于其竞争对手,尤其是在降低WER方面取得了显著效果。

综上所述,Transducer-Llama不仅解决了流式ASR系统中的技术难题,还展示了强大的性能提升潜力,为未来的研究和发展提供了宝贵的参考。通过对实验数据的深入分析,我们进一步认识到这项研究的重要性及其潜在的应用前景。

5.Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

Authors: Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, Jingdong Wang

https://arxiv.org/abs/2412.16418

论文摘要

With the rapid advancement of Multimodal Large Language Models (MLLMs), a variety of benchmarks have been introduced to evaluate their capabilities. While most evaluations have focused on complex tasks such as scientific comprehension and visual reasoning, little attention has been given to assessing their fundamental image classification abilities. In this paper, we address this gap by thoroughly revisiting MLLMs with an in-depth analysis of image classification. Specifically, building on established datasets, we examine a broad spectrum of scenarios, from general classification tasks (e.g., ImageNet, ObjectNet) to more fine-grained categories such as bird and food classification. Our findings reveal that the most recent MLLMs can match or even outperform CLIP-style vision-language models on several datasets, challenging the previous assumption that MLLMs are bad at image classification VLMClassifier. To understand the factors driving this improvement, we conduct an in-depth analysis of the network architecture, data selection, and training recipe used in public MLLMs. Our results attribute this success to advancements in language models and the diversity of training data sources. Based on these observations, we further analyze and attribute the potential reasons to conceptual knowledge transfer and enhanced exposure of target concepts, respectively. We hope our findings will offer valuable insights for future research on MLLMs and their evaluation in image classification tasks.

论文简评

综上所述,本文通过研究Multimodal Large Language Models(MLLMs)在图像分类能力上的表现,证明了近期MLLMs确实可以与或超越CLIP式模型在各种数据集上的性能。该文分析了影响这一提升的关键因素,包括网络架构、数据选择以及训练方法等。此外,它还提出现有关于MLLMs在图像分类方面的假设可能需要更新的观点。总之,这篇论文为评估MLLMs在图像分类任务中的表现提供了有价值的视角,并挑战了对这类模型的一般性看法。


我们欢迎您在评论区中留下宝贵的建议!包括但不限于:

  • 可以提出推文中论文简评的不足!
  • 可以分享最近更值得推荐的论文并给出理由!

END

推荐阅读

2024-12-24 论文分享 | 大语言模型最新进展
2024-12-23 论文分享 | 多模态大模型最新进展
2024-12-20 论文分享 | 智能体最新进展
2024-12-19 论文分享 | 大语言模型最新进展

智荐阁
介绍生成式大模型与推荐系统领域的前沿进展,包括但不限于:大语言模型、推荐系统、智能体学习、强化学习、生成式推荐、引导式推荐、推荐智能体、智能体推荐
 最新文章