点击蓝字 关注我们
论文分享 | 大语言模型相关研究进展
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering I'm Spartacus, No, I'm Spartacus: Measuring and Understanding LLM Identity Confusion MARS: Unleashing the Power of Variance Reduction for Training Large Models Squeezed Attention: Accelerating Long Context Length LLM Inference TEESlice: Protecting Sensitive Neural Network Models in Trusted Execution Environments When Attackers have Pre-Trained Models
1.Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
Authors:Nghia Trung Ngo, Chien Van Nguyen, Franck Dernoncourt, Thien Huu Nguyen
https://arxiv.org/abs/2411.09213
论文摘要
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large lan guage models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primar ily focus on the standard retrieve-answer setting, they over look many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medi cal question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and ro bustness. We introduce Medical Retrieval-Augmented Gen eration Benchmark (MedRGB) that provides various supple mentary elements to four medical QA datasets for testing LLMs’ ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models’ limited ability to handle noise and misinfor mation in the retrieved documents. We further analyze the LLMs’ reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.
论文简评
这篇论文深入探讨了医疗检索增强生成(Medical Retrieval-Augmented Generation, MedRG)这一新兴研究领域,并提出了一个全新的评估框架(MedRGB),旨在为医疗问题回答任务提供更全面、更具挑战性的测试标准。该框架特别注重对系统处理噪声信息能力的考量,从而揭示了当前人工智能系统在医疗应用中的局限性。通过广泛的应用实验,论文展示了MedRGB在评估不同模型性能时的重要性和有效性,为未来的研究提供了宝贵的经验。总的来说,这篇文章对于理解和改进医疗AI系统的性能具有重要的理论价值和实践意义。
2.I'm Spartacus, No, I'm Spartacus: Measuring and Understanding LLM Identity Confusion
Authors:Kun Li, Shichao Zhuang, Yue Zhang, Minghui Xu, Ruoxi Wang, Kaidi Xu, Xinwen Fu, Xiuzhen Cheng
https://arxiv.org/abs/2411.10683
论文摘要
Large Language Models (LLMs) excel in diverse tasks such as text generation, data analysis, and software development, making them indispensable across domains like education, business, and creative industries. However, the rapid proliferation of LLMs (with over 560 companies developing or deploying them as of 2024) has raised concerns about their originality and trustworthiness. A notable issue, termed ”identity confusion,” has emerged, where LLMs misrepresent their origins or identities. This study systematically examines identity confusion through three research questions: (1) How prevalent is identity confusion among LLMs? (2) Does it arise from model reuse, plagiarism, or hallucination? (3) What are the security and trust-related impacts of identity confusion? To address these, we developed an automated tool combin ing documentation analysis, self-identity recognition testing, and output similarity comparisons—established methods for LLM fingerprinting—and conducted a structured survey via Credamo to assess its impact on user trust. Our analysis of 27 LLMs revealed that 25.93% exhibit identity confusion. Output similarity analysis confirmed that these issues stem from hallucinations rather than replication or reuse. Survey results further highlighted that identity confusion significantly erodes trust, particularly in critical tasks like education and professional use, with declines exceeding those caused by logi cal errors or inconsistencies. Users attributed these failures to design flaws, incorrect training data, and perceived plagiarism, underscoring the systemic risks posed by identity confusion to LLM reliability and trustworthiness.
论文简评
本文旨在探讨大型语言模型(Large Language Models, LLMs)中出现的身份混淆现象,即模型误认为自己的起源或身份,从而影响用户信任的问题。研究提出了三个研究问题:身份混淆的普遍存在性、其原因(模型重复使用、抄袭、生成幻觉等)、以及这种混淆对安全性的潜在威胁。为了分析这一现象,研究人员利用自动化工具进行了定量分析,并通过结构化的调查问卷量化了用户信任的变化情况。结果显示,有25.93%的评估对象出现了身份混淆现象,且这一问题严重侵蚀了用户的信任,尤其是在高风险的应用场景下尤为显著。这些发现对于理解LLMs的安全性和可靠性具有重要意义,同时也为相关领域的实践提供了宝贵的参考。总之,本文的研究成果不仅丰富了LLMs安全性研究的理论框架,也为实际应用中的LLMs提供了有效的风险管理策略。
3.MARS: Unleashing the Power of Variance Reduction for Training Large Models
Authors:Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu
https://arxiv.org/abs/2411.10438
论文摘要
Training deep neural networks—and more recently, large models—demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
论文简评
该篇论文提出了一种名为MARS的新优化框架,旨在将变差降低技术集成到适应性梯度方法中,以训练大型模型,特别是大型语言模型如GPT-2。它引入了基于AdamW、Lion和Shampoo的三种MARS实例,并通过实验结果展示了其优于AdamW的优势。论文的主要优点在于:首先,它解决了大型模型优化中的一个重要问题,即如何有效使用变差降低技术;其次,它提供了实验证据表明MARS在特定任务上比AdamW表现出色;最后,它提出了一种有效地结合预条件梯度方法与变差降低的技术框架。这些特点使MARS成为一种非常有吸引力的研究方向。
4.Squeezed Attention: Accelerating Long Context Length LLM Inference
Authors:Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
https://arxiv.org/abs/2411.09688
论文摘要
Emerging Large Language Model (LLM) applications require long input prompt in order to perform complex downstream tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations to process user inputs quickly, as they are received. In this work, we propose SQUEEZED ATTENTION as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. To accomplish this, we first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During inference, we compare query tokens from the user input with the centroids to predict which of the keys from the fixed context are semantically relevant and need to be loaded during inference. We then compute exact attention using only these important keys from the fixed context. This method maintains model accuracy while significantly reducing bandwidth and computational costs, as exact attention is computed with only a subset of the fixed context tokens. We also extend our method to use a hierarchical centroid lookup to identify important keys, which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length. To realize our method’s efficiency benefits, we implement optimized Triton kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4× speedups during both the prefill and generation phases for long-context inference. Furthermore, we have extensively evaluated our method on various long-context benchmarks including LongBench, where it achieves a 3.1× reduction in KV cache budget without accuracy loss. For applications where small accuracy degradation is allowed, we can achieve up to an 8× reduction with less than 0.5 point accuracy gap for the LLaMA-2-7B-32K, LWM-Text-Chat-1M, and Longchat-7B-v1.5-32K models. Our code is available at https://github.com/SqueezeAILab/SqueezedAttention.
论文简评
该论文旨在提出一种名为SQUEEZED ATTENTION的方法,用于优化大型语言模型(LLM)处理长上下文的推理效率。该方法通过使用固定上下文作为提示来减少推理过程中的计算成本,并利用基于质心的关键聚类和检索技术实现这一目标。作者实现了优化后的Triton内核,以显著提高性能,并在多个基准测试数据集上保持了准确率。
论文的主要贡献是提出一种解决方案,有效应对大型语言模型处理长上下文时所面临的挑战,如计算复杂度和内存消耗问题。此外,论文通过详细的实验验证展示了其提出的优化策略的有效性,尤其是在速度提升和内存节省方面取得了令人印象深刻的成果。
总的来说,这篇论文为解决大规模语言模型推理过程中遇到的性能瓶颈提供了有价值的见解,并通过实证研究证明了其有效性。它不仅提高了算法的效率,也为未来的研究提供了新的方向。
5.TEESlice: Protecting Sensitive Neural Network Models in Trusted Execution Environments When Attackers have Pre-Trained Models
Authors:Ding Li, Ziqi Zhang, Mengyu Yao, Yifeng Cai, Yao Guo, Xiangqun Chen
https://arxiv.org/abs/2411.09945
论文摘要
Trusted Execution Environments (TEE) are used to safeguard on-device models. However, directly employing TEEs to secure the entire DNN model is challenging due to the limited computational speed. Utilizing GPU can accelerate DNN’s computation speed but commercial widely-available GPUs usually lack security protection. To this end, scholars introduce TEE-shielded DNN partition (TSDP), a method that protects privacy-sensitive weights within TEEs and offloads insensitive weights to GPUs. Nevertheless, current methods do not consider the presence of a knowledgeable adversary who can access abundant publicly available pre-trained models and datasets. This paper investigates the security of existing methods against such a knowledgeable adversary and reveals their inability to fulfill their security promises. Consequently, we introduce a novel partition before training strategy, which effectively separates privacy-sensitive weights from other components of the model. Our evaluation demonstrates that our approach can offer full model protection with a computational cost reduced by a factor of 10. In addition to traditional CNN models, we also demonstrate the scalability to large language models. Our approach can compress the private functionalities of the large language model to lightweight slices and achieve the same level of protection as the shielding-whole-model baseline.
论文简评
该论文探讨了一种名为TEESlice的新框架,旨在利用预训练模型增强可信执行环境(TEE)中神经网络模型的安全性,以对抗知情者的攻击。TEESlice采用了一个分区-先训练策略,有效地将隐私敏感权重与非敏感组件分开,从而显著降低计算开销,同时保持高安全性和模型准确性。论文的关键创新在于提出了独特的分区-先训练策略,这种策略不仅减少了计算开销,还确保模型的安全性和准确性。实验结果表明,TEESlice能够提供黑盒级别的保护,并且其成本效益比传统方法更优。此外,论文还展示了TEESlice在处理大规模语言模型方面的适应性和适用性,表明其在当前人工智能应用中的重要性。总的来说,该论文提出了一种有效的方法来提高神经网络模型的安全性,使其在可信执行环境中更好地对抗知情者的攻击。
我们欢迎您在评论区中留下宝贵的建议!包括但不限于:
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!
END