2024-11-26 论文分享 | 大语言模型最新进展

文摘 2024-11-26 10:28 安徽

点击蓝字关注我们

论文分享 | 大语言模型相关研究进展

我们从2024-11-20到2024-11-26的39篇文章中精选出5篇优秀的工作分享给读者。

ChatBCI: A P300 Speller BCI Leveraging Large Language Models for Improved Sentence Composition in Realistic Scenarios
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
What Makes a Scene? Scene Graph-based Evaluation and Feedback for Controllable Generation

1.ChatBCI: A P300 Speller BCI Leveraging Large Language Models for Improved Sentence Composition in Realistic Scenarios

Authors: Jiazhen Hong, Weinan Wang, Laleh Najafizadeh

https://arxiv.org/abs/2411.15395

论文摘要

P300 speller brain computer interfaces (BCIs) allow users to compose sentences by selecting target keys on a graphical user interface (GUI) through the detection of P300 component in their electroencephalogram (EEG) signals following visual stimuli. Most existing P300 speller BCIs require users to spell all or the initial letters of the intended word letter by letter. Consequently, a large number of keystrokes are required to write a sentence, which can be time-consuming, increasing the user’s cognitive load and fatigue. Therefore, there is a need for more efficient and user-friendly methods for faster and practical sentence composition.

In this work, we introduce ChatBCI, a P300 speller BCI that leverages the zero-shot learning capabilities of large language models (LLMs) to suggest words from user-spelled initial letters or predict the subsequent word(s), reducing keystrokes and accelerating sentence composition. ChatBCI retrieves word suggestions through remote queries to the GPT-3.5 API. A new GUI that displays GPT-3.5 word suggestions as extra keys is designed. Stepwise linear discriminant analysis (SWLDA) is used for P300 classification.

Seven subjects completed two online spelling tasks: 1) copy-spelling a self-composed sentence using ChatBCI, and 2) improvising a sentence using ChatBCI’s word suggestions. Results demonstrate that in Task 1, on average, ChatBCI outperforms letter-by-letter BCI spellers, reducing time and keystrokes by and , respectively, and increasing the information transfer rate by . In Task 2, ChatBCI achieves keystroke savings and a record characters/min typing speed.

Overall, ChatBCI, by employing remote LLM queries, enhances sentence composition in realistic scenarios, significantly outperforming traditional spellers without requiring local model training or storage. ChatBCI’s (multi-) word predictions, combined with its new GUI, pave the way for developing next-generation speller BCIs that are efficient and effective for real-time communication, especially for users with communication and motor disabilities.

论文简评

ChatBCI是阿里云团队研发的一种P300 speller BCI系统，该系统利用大型语言模型（LLM）来提高句子构造效率，通过提供单词建议和预测来减少打字量，从而加快输入速度。实验结果显示，与传统逐个字母拼写器相比，该系统的性能有了显著提升，尤其是在缩短打字时间方面。此外，引入了新的指标如按键节约缺口比率（KS-DR），为分析系统性能提供了有价值的参考。综上所述，ChatBCI系统在用户体验、效率以及新引入的性能评估方法等方面都表现出色，具有重要的创新意义。

2.ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Authors: M. Arda Aydın, Efe Mert Çırpar, Elvin Abdinli, Gozde Unal, Yusuf H. Sahin

https://arxiv.org/abs/2411.12044

论文摘要

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.

论文简评

ITACLIP是一种创新方法，旨在通过利用大型语言模型（LLMs）来改进语义分割。这种由作者们开发的方法相较于传统方法具备若干关键优势，例如利用预训练的CLIP（对比语言-图像预训练模型）来执行语义分类任务。通过引入架构上的改变，并利用从大型语言模型生成的文本信息扩充数据集，作者们在开放词汇设定方面取得了显著的改进。该研究的成果凸显了这种方法在公认的分割数据集上超越现有方法的有效性，从而为该领域增添了重大价值。总体而言，ITACLIP作为一种很有前景的解决方案脱颖而出，它将大型语言模型的能力与图像处理中的既有技术相结合，为该领域未来的研究提供了一个强有力的框架。

3.Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

Authors: Haojie Zheng, Tianyang Xu, Hanchi Sun, Shu Pu, Ruoxi Chen, Lichao Sun

https://arxiv.org/abs/2411.12591

论文摘要

Multimodal large language models (MLLMs) have advanced the integration of visual and linguistic modalities, establishing themselves as the dominant paradigm for visual-language tasks. Current approaches like chain of thought (CoT) reasoning have augmented the cognitive capabilities of large language models (LLMs), yet their adaptation to MLLMs is hindered by heightened risks of hallucination in cross-modality comprehension. In this paper, we find that the thinking while looking paradigm in current multimodal CoT approaches—where reasoning chains are generated alongside visual input—fails to mitigate hallucinations caused by misleading images. To address these limitations, we propose the Visual Inference Chain (VIC) framework, a novel approach that constructs reasoning chains using textual context alone before introducing visual input, effectively reducing cross-modal biases and enhancing multimodal reasoning accuracy. Comprehensive evaluations demonstrate that VIC significantly improves zero-shot performance across various vision-related tasks, mitigating hallucinations while refining the reasoning capabilities of MLLMs. Our anonymized code repository can be found at https://github.com/Terry-Xu-666/visual_inference_chain.

论文简评

在这篇关于视觉推理链（Visual Inference Chain, VIC）框架的论文中，作者提出了一种新的方法来减少多模态大型语言模型（Multimodal Large Language Models, MLLMs）中的幻觉，即通过分离推理与视觉输入。这种方法与传统的“思考时看”的模式形成鲜明对比，倡导的是先思考再看的战略。研究者们证明了这种方法不仅提高了推理准确性，而且在多个基准上显著减少了幻觉问题，特别是通过HallusionBench和MMVP等指标展示出了其有效性。

这篇论文的关键在于它解决了当前多模态推理中面临的严重问题——幻觉，这一问题对于确保大型语言模型的可靠性至关重要。通过对多种基准数据集的实验分析，研究者们发现该方法能够有效改善推理精度，并在解决幻觉问题方面显示出卓越效果。这些结果表明，尽管MMLMs可能面临一些挑战，基于VIC的框架为提高它们的性能提供了重要的工具。总的来说，这篇论文展示了对多模态推理领域的重要贡献，尤其是在减少幻觉方面的创新解决方案。

4.VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Authors: Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

https://arxiv.org/abs/2411.14794

论文摘要

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

论文简评

该篇论文《VideoEspresso：一个用于提升视频推理的大型数据集》提出了一个旨在通过新颖的自动构建方法生成高质量的VideoQA对，重点关注多模态标注和链式思考的新型视频问答（Video Question Answering）对的数据集。论文详细介绍了其全面的评估框架，并展示了与现有模型相比的强大性能。论文强调了使用语义感知的方法来选择帧和生成问答对的重要性，这一创新性方法有望成为未来领域工作的先驱。

总的来说，本文的主要内容是介绍了一个新的大规模视频推理数据集VideoEspresso，以及它如何通过一种新颖且具有前瞻性的方法——基于语义感知的帧选择和问答对生成——来提高视频推理的质量。此外，论文还提供了详细的评估框架，并对比了与其他现有模型的表现，从而为视频推理领域的研究提供了一种有效的工具和基准。

5.What Makes a Scene? Scene Graph-based Evaluation and Feedback for Controllable Generation

Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen

https://arxiv.org/abs/2411.15435

论文摘要

While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.

论文简评

Scene-Bench是一个评估从场景图生成的图像一致性的基本工具。研究者们引入了MegaSG，一个包含超过一百万幅图像的综合数据集，这些图像都带有场景图的注释，为测试各种场景合成方法提供了一个强有力的平台。此外，他们还开发了SGScore，这是一种新颖的评估指标，旨在测量生成场景中物体存在与关系的准确性。这种创新方法不仅评估输出质量，还通过反馈循环突出改进的重点。所提出的方法在指导更准确一致的场景生成技术发展方面展现出巨大潜力。凭借其广泛的数据集和先进的评估方法，Scene-Bench为推动场景合成领域的进展奠定了基础。

我们欢迎您在评论区中留下宝贵的建议！包括但不限于：

可以提出推文中论文简评的不足！
可以分享最近更值得推荐的论文并给出理由！

END

2024-11-26 论文分享 | 大语言模型最新进展

论文分享 | 大语言模型相关研究进展

1.ChatBCI: A P300 Speller BCI Leveraging Large Language Models for Improved Sentence Composition in Realistic Scenarios

论文摘要

论文简评

2.ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

论文摘要

论文简评

3.Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

论文摘要

论文简评

4.VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

论文摘要

论文简评

5.What Makes a Scene? Scene Graph-based Evaluation and Feedback for Controllable Generation

论文摘要

论文简评

推荐阅读