点击蓝字 关注我们
论文分享 | 大语言模型相关研究进展
我们从2024-12-24到2024-12-27的48篇文章中精选出5篇优秀的工作分享给读者。
SCBench: A Sports Commentary Benchmark for Video LLMs Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering Self-guided Knowledgeable Network of Thoughts: Amplifying Reasoning with Large Language Models Deliberative Alignment: Reasoning Enables Safer Language Models Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models
1.SCBench: A Sports Commentary Benchmark for Video LLMs
Authors: Kuangzhi Ge, Lingjun Chen, Kevin Zhang, Yulin Luo, Tianyu Shi, Liaoyuan Fan, Xiang Li, Guanqun Wang, Shanghang Zhang
https://arxiv.org/abs/2412.17637
论文摘要
\noindent Recently, significant advances have been made in Video Large Language Models (Video LLMs) in both academia and industry. However, methods to evaluate and benchmark the performance of different Video LLMs, especially their fine-grained, temporal visual capabilities, remain very limited. On one hand, current benchmarks use relatively simple videos (e.g., subtitled movie clips) where the model can understand the entire video by processing just a few frames. On the other hand, their datasets lack diversity in task format, comprising only QA or multi-choice QA, which overlooks the models' capacity for generating in-depth and precise texts. Sports videos, which feature intricate visual information, sequential events, and emotionally charged commentary, present a critical challenge for Video LLMs, making sports commentary an ideal benchmarking task. Inspired by these challenges, we propose a novel task: sports video commentary generation, and develop SCBench for Video LLMs. To construct such a benchmark, we introduce (1) SCORES, a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method, and (2) CommentarySet, a dataset consisting of 5,775 annotated video clips and ground-truth labels tailored to our metric. Based on SCBench, we conduct comprehensive evaluations on multiple Video LLMs (e.g. VILA, Video-LLaVA, etc.) and chain-of-thought baseline methods. Our results found that InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04. Our work provides a fresh perspective for future research, aiming to enhance models' overall capabilities in complex visual understanding tasks. Our dataset will be released soon.
论文简评
SCBench是一项针对体育解说生成的视频大型语言模型(Video LLMs)性能评估的基准测试。该文介绍了CommentarySet这一专门用于体育解说生成的数据集,包含6种运动共5775个标注视频片段。通过开发一个名为Scored Out of Context Sentences (SCORES)的新六维度指标,论文提出了一个更精细的评价方法,以解决当前基准测试面临的局限性,并重点关注细粒度和时序性的视觉能力。本文的研究对现有研究领域具有重要意义,因为很少有研究深入探讨Video LLMs的具体问题,因此它是一个值得进一步探索的方向。总的来说,这篇论文为Video LLMs的评估提供了新的视角和方法,有望推动这一领域的发展。
2.Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering
Authors: Zhongjian Hu, Peng Yang, Bing Li, Fengyuan Liu
https://arxiv.org/abs/2412.16936
论文摘要
Recently, Large Language Models (LLMs)havebeenusedforknowledge-basedVisualQuestion Answering (VQA). Despite the encouraging results of previous studies, prior methods prompt LLMstopredictanswersdirectly, neglecting intermediate thought processes. We argue that prior methods do not sufficiently activate the capacities of LLMs. We propose a framework called PLRH that Prompts LLMs with Rationale Heuristics for knowledge-based VQA. The PLRH prompts LLMs with Chain of Thought (CoT) to generate rationale heuristics, i.e., intermediate thought processes, and then leverages the rationale heuristics to inspire LLMs to predict answers. Experiments show that our approach outperforms the existing baselines by more than 2.2 and 2.1 on OK-VQA and A-OKVQA, respectively.
论文简评
本文提出了一种名为PLRH的新框架,它利用基于链式思考的推理来指导大型语言模型(LLM)解决基于图像和外部知识的问题。该框架分为三个阶段生成和利用理由,以增强LLM在回答基于图像和外部知识的问题时的推理能力。实验结果表明,PLRH在流行的VQA数据集上表现优于现有基准方法。这项研究强调,通过引入基于链式思考的理由,可以有效地提高LLM在VQA任务中的推理能力,并显示了PLRH在提升LLM能力方面的潜在潜力。总的来说,这篇论文提供了对知识型视觉问答系统中基于链式思考的理由生成及其应用的深入分析,为未来的研究提供了良好的起点。
3.Self-guided Knowledgeable Network of Thoughts: Amplifying Reasoning with Large Language Models
Authors: Chao-Chi Chen, Chin-Yuan Yeh, Hsi-Wen Chen, De-Nian Yang, Ming-Syan Chen
https://arxiv.org/abs/2412.16533
论文摘要
We introduce Knowledgeable Network of Thoughts (kNoT): a prompt scheme that advances the capabilities of large lan guage models (LLMs) beyond existing paradigms like Chain of-Thought (CoT), Tree of Thoughts (ToT), and Graph of Thoughts (GoT). The key innovation of kNoT is the LLM Workflow Template (LWT), which allows for an executable plan to be specified by LLMs for LLMs. LWT allows these plans to be arbitrary networks, where single-step LLM oper ations are nodes, and edges correspond to message passing between these steps. Furthermore, LWT supports selection of individual elements through indexing, facilitating kNoT to produce intricate plans where each LLM operation can be limited to elementary operations, greatly enhancing reliabil ity over extended task sequences. We demonstrate that kNoT significantly outperforms the state of the art on six use cases, while reducing the need for extensive prompt engineering. For instance, kNoT finds 92% accuracy for sorting 32 num bers over 12% and 31% for ToT and GoT, while utilizing up to 84.4% and 87.3% less task-specific prompts, respectively.
论文简评
这篇论文主要探讨了一种名为Knowledgeable Network of Thoughts (kNoT)的新提示框架,旨在为大型语言模型(LLM)提供一种增强其多步骤任务推理能力的方法。它的目标是减少对于特定任务的特定提示所需的手动配置,并且提供了比现有方法如Chain-of-Thought和Tree of Thoughts更具灵活性的更直接的方法。通过利用LLMs来通过结构化工作流模板(LWT)自行制定执行计划,kNoT希望提高解决问题的准确性与效率。
论文的关键发现集中在以下几点:首先,kNoT框架提供了一种新颖的方法,使大型语言模型能够自主生成结构化计划;其次,实验表明,在多个使用场景中,kNoT的表现优于现有的提示策略,特别是在准确性与效率方面表现明显优越;此外,提出降低人工劳动需求对于有效利用LLMs具有重大意义,这是本文提出的解决方案所蕴含的重要价值。总的来说,这篇论文提出了一个创新性的新思路,有助于解决当前大型语言模型在处理复杂任务时遇到的问题,同时展示其在实际应用中的优越性。
4.Deliberative Alignment: Reasoning Enables Safer Language Models
Authors: Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Heylar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, Amelia Glaese
https://arxiv.org/abs/2412.16339
论文摘要
As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Align ment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, with out requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.
论文简评
《Deliberative Alignment: A New Paradigm for Enhancing Safety and Robustness in Large Language Models》这篇论文介绍了一种训练方法,旨在使大型语言模型遵循安全规范,并抵抗攻击性提示。该方法旨在提高模型的安全性,增强其对逆境的抵抗能力,使其更好地适应离散分布场景。通过与OpenAI模型的大量实验验证,证明了该方法的有效性和性能改进。通过对安全措施的改善和对抗性提示抵抗能力提升,这篇论文揭示了一个潜在方向,即通过引入推理安全政策来指导模型行为,从而实现更好的安全性。
5.Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models
Authors: Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, Jianye Hao
https://arxiv.org/abs/2412.17963
论文摘要
Large language models (LLMs) possess vast semantic knowledge but often struggle with complex reasoning tasks, particularly in relational reasoning problems such as kinship or spatial reasoning. In this paper, we present Path-of-Thoughts (PoT), a novel framework designed to tackle relation reasoning by decomposing the task into three key stages: graph extraction, path identification, and reasoning. Unlike previous approaches, PoT efficiently extracts a task-agnostic graph that identifies crucial entities, relations, and attributes within the problem context. Subsequently, PoT identifies relevant reasoning chains within the graph corresponding to the posed question, facilitating inference of potential answers. Experimental evaluations on four benchmark datasets, which demand long reasoning chains, demonstrate that PoT surpasses state-of-the-art baselines by a significant margin (maximum 21.3%) without necessitating fine-tuning or extensive LLM calls. Furthermore, as opposed to prior neuro-symbolic methods, PoT exhibits improved resilience against LLM errors by leveraging the compositional nature of graphs.
论文简评
这篇关于Path-of-Thoughts(PoT)框架的研究论文旨在解决大型语言模型(LLMs)在关系推理时面临的挑战,并通过分解推理过程为用户提供一种结构化的解决方案。该框架利用了图的提取、路径识别以及逻辑推理等步骤,增强了LLMs在处理复杂关系问题的能力。作者声称,在基准数据集上,PoT的表现优于现有的方法,并且其对LLM提取错误的鲁棒性得到了显著提高。
论文的关键点在于其提出的创新性综合图提取与推理任务的方法,这与当前神经符号主义AI领域的趋势相吻合。实验结果表明,PoT在多个数据集上的性能有所提升,证明该框架的有效性和可行性。这一研究对于推动人工智能领域的发展具有重要意义,特别是在处理复杂关系问题方面,可以为未来的研究提供新的视角和思路。总体而言,这篇论文展现了强大的理论基础和实际应用价值,值得进一步深入探讨和研究。
我们欢迎您在评论区中留下宝贵的建议!包括但不限于:
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!
END