点击蓝字 关注我们
论文分享 | 智能体相关研究进展
我们从2024-11-22到2024-11-28的32篇文章中精选出5篇优秀的工作分享给读者。
Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium ShowUI: One Vision-Language-Action Model for GUI Visual Agent Why the Agent Made that Decision: Explaining Deep Reinforcement Learning with Vision Masks Agent-Based Modelling Meets Generative AI in Social Network Simulations Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios
1.Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium
Authors: Zeyang Li, Navid Azizan
https://arxiv.org/abs/2411.15036
论文摘要
Multi-agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real-world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all-time safety assurance. Moreover, these methods often overlook the feasibility issue——where the system will inevitably violate state constraints within certain regions of the constraint set——resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with state-wise constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control-theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi-agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high-dimensional systems, we propose Multi-Agent Dual Actor-Critic (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.
论文简评
这篇关于安全多智能体强化学习(MARL)的论文提出了一个全新的理论框架来确保多智能体安全强化学习(MARL),该框架能够约束状态,并促使行为者达到纳什均衡。论文提出了一种名为Multi-Agent Dual Policy Iteration算法(MADAC),并通过实证研究证明了其优越性,表明其性能优于现有方法。
论文的主要优点在于它有效地解决了MARL中的安全性挑战。通过引入可控不变集,论文强调了使用这些集可以增强对安全性的控制,从而更好地满足MARL的需求。此外,论文还展示了MADAC的有效性,在多个实验中取得了显著的成果,显示了其在安全MARL领域的领先地位。
总的来说,这篇论文不仅提供了有效的理论支持,而且通过实证研究验证了其效果,为MARL的安全性研究开辟了新的方向。因此,它对于推进安全MARL的研究具有重要意义。
2.ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
https://arxiv.org/abs/2411.17465
论文摘要
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, rely ing on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Vi sual Token Selection to reduce computational costs by for mulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flex ibly unifies diverse needs within GUI tasks, enabling ef fective management of visual-action history in navigation or pairing multi-turn query-action sequences per screen shot to enhance training efficiency; (iii) Small-scale High quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redun dant visual tokens during training and speeds up the perfor mance by 1.4×. Navigation experiments across web [12], mobile [36], and online [40] environments further under score the effectiveness and potential of our model in ad vancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.
论文简评
这篇论文《ShowUI:GUI自动化中的视觉语言-动作模型》提出了一个名为ShowUI的创新性模型,旨在解决GUI自动化中面临的挑战。该模型通过整合视觉和语言输入来提高任务性能,特别是在处理图形用户界面(GUI)任务时。ShowUI利用UI引导的视觉令牌选择以及交织的动作流媒体技术,显著提高了效率和有效性。
论文的核心贡献在于对视觉和语言交互模式的深入研究,为GUI自动化提供了新的解决方案。此外,实验结果表明,ShowUI在零样本地标的理解和导航任务上表现出色,显示了其潜在的应用价值。然而,值得注意的是,虽然实验结果令人鼓舞,但仍然需要进一步的研究以验证模型的长期稳定性和泛化能力。
总的来说,这篇论文是一个值得关注的研究成果,它不仅展示了视觉和语言在GUI自动化领域的潜力,也为未来的研究方向提供了启示。
3.Why the Agent Made that Decision: Explaining Deep Reinforcement Learning with Vision Masks
Authors: Rui Zuo, Zifan Wang, Simon Khan, Garrett Ethan Katz, Qinru Qiu
https://arxiv.org/abs/2411.16120
论文摘要
Due to the inherent lack of transparency in deep neural networks, it is challenging for deep reinforcement learning (DRL) agents to gain trust and acceptance from users, especially in safety-critical applications such as medical diagnosis and military operations. Existing methods for explaining an agent's decision either require retraining the agent using models that support explanation generation or rely on perturbation-based techniques to reveal the significance of different input features in the decision-making process. However, retraining the agent may compromise its integrity and performance, while perturbation-based methods have limited performance and lack knowledge accumulation or learning capabilities. Moreover, since each perturbation is performed independently, the joint state of the perturbed inputs may not be physically meaningful. To address these challenges, we introduce VisionMask, a standalone explanation model trained end-to-end to identify the most critical regions in the agent's visual input that can explain its actions. VisionMask is trained in a self-supervised manner without relying on human-generated labels. Importantly, its training does not alter the agent model, hence preserving the agent's performance and integrity. We evaluate VisionMask on Super Mario Bros (SMB) and three Atari games. Compared to existing methods, VisionMask achieves a 14.9% higher insertion accuracy and a 30.08% higher F1-Score in reproducing original actions from the selected visual explanations. We also present examples illustrating how VisionMask can be used for counterfactual analysis.
论文简评
《VisionMask:一种自监督解释模型》是关于深度强化学习代理中视觉输入的关键区域识别的研究。该研究提出了一种新方法——VisionMask,旨在为DRL代理提供更透明和可信的决策过程,而无需改变代理模型。通过对比实验,研究人员发现VisionMask在Super Mario Bros和Atari游戏中表现优于现有方法,提高了对目标任务的理解和执行能力。这项工作强调了深度强化学习中的一个关键问题——提高代理的可解释性,并且展示了自监督学习在解决这一问题上的潜在价值。总之,VisionMask为探索深度强化学习的未来方向提供了新的视角,同时也展现了其在实际应用中的潜力。
4.Agent-Based Modelling Meets Generative AI in Social Network Simulations
Authors: Antonino Ferraro, Antonio Galli, Valerio La Gatta, Marco Postiglione, Gian Marco Orlando, Diego Russo, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato
https://arxiv.org/abs/2411.16031
论文摘要
Agent-Based Modelling (ABM) has emerged as an essential tool for simulating social networks, encompassing diverse phenomena such as information dissemination, influence dynamics, and community formation. However, manually configuring varied agent interactions and information flow dynamics poses challenges, often resulting in oversim plified models that lack real-world generalizability. Integrating modern Large Language Models (LLMs) with ABM presents a promising av enue to address these challenges and enhance simulation fidelity, lever aging LLMs’ human-like capabilities in sensing, reasoning, and behavior. In this paper, we propose a novel framework utilizing LLM-empowered agents to simulate social network users based on their interests and per sonality traits. The framework allows for customizable agent interac tions resembling various social network platforms, including mechanisms for content resharing and personalized recommendations. We validate our framework using a comprehensive Twitter dataset from the 2020 US election, demonstrating that LLM-agents accurately replicate real users’ behaviors, including linguistic patterns and political inclinations. These agents form homogeneous ideological clusters and retain the main themes of their community. Notably, preference-based recommendations significantly influence agent behavior, promoting increased engagement, network homophily and the formation of echo chambers. Overall, our f indings underscore the potential of LLM-agents in advancing social me dia simulations and unraveling intricate online dynamics.
论文简评
这篇论文提出了一个框架,利用大型语言模型(Large Language Models, LLMs)与基于代理的建模(Agent-Based Modeling, ABM)相结合,以增强社会网络模拟的准确性。该研究的重点在于根据用户的兴趣和性格特征来模拟用户的行为,并通过Twitter数据集展示了这个框架的能力。研究表明,使用LLM-代理可以复制真实用户的交互行为,包括社区形成和参与策略等。
论文的关键点包括:1)LLMs与ABM的结合为模拟社交网络提供了新颖的方法,可能提高代理行为的准确性;2)该框架允许个性化互动,这些互动可能与各种社交媒体平台相匹配;3)使用真实世界的数据增强了模拟的真实性,并对理解在线动态具有重要意义。总的来说,这篇论文提出了一种创新的框架,旨在提升社会网络模拟的准确性和可靠性,为理解在线动态提供新的视角。
5.Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios
Authors: Shaochen Xu, Yifan Zhou, Zhengliang Liu, Zihao Wu, Tianyang Zhong, Huaqin Zhao, Yiwei Li, Hanqi Jiang, Yi Pan, Junhao Chen, Jin Lu, Wei Zhang, Tuo Zhang, Lu Zhang, Dajiang Zhu, Xiang Li, Wei Liu, Quanzheng Li, Andrea Sikora, Xiaoming Zhai, Zhen Xiang, Tianming Liu
https://arxiv.org/abs/2411.14461
论文摘要
Artificial Intelligence (AI) has become essential in modern healthcare, with large language models (LLMs) offering promising advances in clinical decision-making. Traditional model-based approaches, including those leveraging in-context demonstrations and those with specialized medical fine-tuning, have demonstrated strong performance in medical language processing but struggle with real-time adaptability, multi-step reasoning, and handling complex medical tasks. Agent-based AI systems address these limitations by incorporating reasoning traces, tool selection based on context, knowledge retrieval, and both short- and long-term memory. These additional features enable the medical AI agent to handle complex medical scenarios where decision-making should be built on real-time interaction with the environment. Therefore, unlike conventional model-based approaches that treat medical queries as isolated questions, medical AI agents approach them as complex tasks and behave more like human doctors. In this paper, we study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent’s overall reasoning and action generation. In particular, we consider the emergent o1 model and examine its impact on agents' reasoning, tool-use adaptability, and real-time information retrieval across diverse clinical scenarios, including high-stakes settings such as intensive care units (ICUs). Our findings demonstrate o1’s ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools that support better patient outcomes and decision-making efficacy in clinical practice.
论文简评
本文旨在探讨如何将o1模型整合到医疗AI代理中,以增强临床场景中的决策能力。传统LLM方法存在局限性,而o1的优势在于其推理能力、工具适应性和实时信息检索能力。该研究强调了o1在诊断准确性和一致性方面的作用,尤其是在高风险环境中(如ICU)的重要性。通过分析各种AI代理框架,本文展示了o1如何与它们相结合,并提供了关于如何将o1集成到这些框架中的详细说明。总之,本文深入探讨了o1在医疗AI领域的应用潜力及其对提高决策效率的影响。
我们欢迎您在评论区中留下宝贵的建议!包括但不限于:
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!
END