点击蓝字 关注我们
论文分享 | 智能体相关研究进展
我们从2024-12-18到2024-12-20的13篇文章中精选出5篇优秀的工作分享给读者。
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents ChinaTravel: A Real-World Benchmark for Language Agents in Chinese Travel Planning Exploring Multi-Modal Integration with Tool-Augmented LLM Agents for Precise Causal Discovery TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks Heuristic Planner for Communication-Constrained Multi-Agent Multi-Goal Path Planning
1.Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
Authors: Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li
https://arxiv.org/abs/2412.13194
论文摘要
The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and ahousehold humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent’s skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator(PAE), an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena. Our results show that PAE significantly improves the zero-shot generalization capability of VLM Internet agents (more than 30% relative improvement) to both unseen tasks and websites. Our model also achieves an absolute advantage of over 10% (from 22.6% to 33.0%) comparing to other state-of-the-art open source VLM agents including Qwen2VL-72B. To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in https://yanqval.github.io/PAE/ .
论文简评
该论文提出了一种名为Proposer-Agent-Evaluator(PAE)的框架,旨在使基础模型代理能够自主发现和实践Web导航任务所需的技能。这个框架整合了一个基于上下文的任务提案者、一个代理策略以及一个自我评估器,这些数据显示出显著的零样本泛化能力,相较于现有模型有明显改进。论文中的关键点体现在两方面:首先,解决了自主代理中技能发现的关键挑战,无需手动指定任务;其次,实验结果表明,在完成任务方面,PAE与最先进的代理相比取得了显著的优越性。论文结构清晰,有效地融合了任务提案、代理策略和评估器,展现出良好的设计思路和实施效果。总体而言,这篇论文为探索自主代理在Web导航任务中的技能发掘提供了新的视角和方法论。
2.ChinaTravel: A Real-World Benchmark for Language Agents in Chinese Travel Planning
Authors: Jie-Jing Shao, Xiao-Wen Yang, Bo-Wen Zhang, Baizhi Chen, Wen-Da Wei, Lan-Zhe Guo, Yu-feng Li
https://arxiv.org/abs/2412.13682
论文摘要
Recent advances in LLMs, particularly in language reasoning and tool integration, have rapidly sparked the real-world development of Language Agents. Among these, travel planning represents a prominent domain, combining academic challenges with practical value due to its complexity and market demand. However, existing benchmarks fail to reflect the diverse, real-world requirements crucial for deployment. To address this gap, we introduce ChinaTravel, a benchmark specifically designed for authentic Chinese travel planning scenarios. We collect travel requirements from questionnaires and propose a compositionally generalizable domain-specific language that enables a scalable evaluation process, covering feasibility, constraint satisfaction, and preference comparison. Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a constraint satisfaction rate of 27.9%, significantly surpassing purely neural models at 2.6%. Moreover, we identify key challenges in real-world travel planning deployments, including open language reasoning and unseen concept composition. These findings highlight the significance of ChinaTravel as a pivotal milestone for advancing language agents in complex, real-world planning scenarios. Webpage: https://www.lamda.nju.edu.cn/shaojj/chinatravel
论文简评
中国旅行(ChinaTravel)是一个旨在评估中文旅行规划语言代理性能的基准,其主要目标是解决现有基准存在的不足,重点关注实际需求,并引入一种特定于旅行规划的语言来验证约束条件。通过实证研究结果表明,在这个领域,神经符号代理优于纯粹的神经模型。该基准不仅解决了现有基准中的一个显著问题,还考虑了对真实世界需求的关注,以及对人类生成查询的采用,从而增强了其现实性和适用性。此外,它利用了一种特定于旅行规划的语言来实现逻辑约束,这使得评估过程更加灵活且可扩展。总之,中国旅行为评估中文旅行规划语言提供了新的视角和标准,有望推动这一领域的进一步发展和创新。
3.Exploring Multi-Modal Integration with Tool-Augmented LLM Agents for Precise Causal Discovery
Authors: ChengAo Shen, Zhengzhang Chen, Dongsheng Luo, Dongkuan Xu, Haifeng Chen, Jingchao Ni
https://arxiv.org/abs/2412.13667
论文摘要
Causal inference is an imperative foundation for decision-making across domains, such as smart health, AI for drug discovery and AIOps. Traditional statistical causal discovery methods, while well-established, predominantly rely on observational data and often overlook the se mantic cues inherent in cause-and-effect rela tionships. The advent of Large Language Mod els (LLMs) has ushered in an affordable way of leveraging the semantic cues for knowledge driven causal discovery, but the development of LLMsfor causal discovery lags behind other areas, particularly in the exploration of multi modality data. To bridge the gap, we intro duce MATMCD, a multi-agent system powered by tool-augmented LLMs. MATMCD has two key agents: a Data Augmentation agent that re trieves and processes modality-augmented data, and a Causal Constraint agent that integrates multi-modal data for knowledge-driven infer ence. Delicate design of the inner-workings ensures successful cooperation of the agents. Our empirical study across seven datasets sug gests the significant potential of multi-modality enhanced causal discovery.
论文简评
该论文旨在探讨如何通过整合多模态数据与预训练语言模型(LLMs)来提高因果发现能力。作者提出了一个基于机器学习的多智能体系统(MATMCD),包含了一个数据增强代理(Data Augmentation Agent)和一个因果约束代理(Causal Constraint Agent)。通过对多个数据集的实验分析,表明MATMCD在提升因果发现性能方面取得了显著的效果,这一成果可能对后续研究产生积极影响。此外,作者提到尽管MATMCD的有效性得到了验证,但其方法论的清晰度以及创新性不足之处仍需要进一步探讨和完善。总体而言,该研究为跨领域研究提供了有益的启示,展示了跨模态学习在因果推理中的潜在应用价值。
4.TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Authors: Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig
https://arxiv.org/abs/2412.14161
论文摘要
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows and for economic policy to understand the effects that the adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce \tac, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal websites and data that mimics a small software company environment and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs) and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture of task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.
论文简评
《TheAgentCompany》是一个旨在评估AI代理在模拟软件公司环境中自动化任务效果的基准。它的目标是测量大型语言模型在执行与工作相关任务的自动化有效性。通过提供一个自定义且可复现的环境来评估代理的表现,该基准为这一领域提供了结构化的方法论。所设计的任务反映了真实的工作场景,并涵盖了多种岗位角色。总的来说,《TheAgentCompany》作为一项重要的研究项目,对于推动人工智能技术在职业任务自动化方面的进展具有重要意义。
5.Heuristic Planner for Communication-Constrained Multi-Agent Multi-Goal Path Planning
Authors: Jáchym Herynek, Stefan Edelkamp
https://arxiv.org/abs/2412.13719
论文摘要
In robotics, coordinating a group of robots is an essential task. This work presents the communication-constrained multi-agent multi-goal path planning problem and proposes a graph-search based algorithm to address this task. Given a fleet of robots, an environment represented by a weighted graph, and a sequence of goals, the aim is to visit all the goals without breaking the communication constraints between the agents, minimizing the completion time. The resulting paths produced by our approach show how the agents can coordinate their individual paths, not only with respect to the next goal but also with respect to all future goals, all the time keeping the communication within the fleet intact.
论文简评
该论文主要探讨了通信受限多智能体多目标路径规划(CC-PP)问题,并提出了一种基于图搜索的算法,以支持智能体在遵守通信约束的同时协调其路径。研究的主要目的是最小化完成时间,同时访问按照指定顺序的一组目标。本文通过引入新的复杂度层,即考虑通信限制因素到多智能体路径规划中,为相关领域提供了创新解决方案。此外,通过提供详细的实例说明,论文有助于澄清所提出的策略及其适用性。综上所述,该论文对多智能体路径规划的研究具有重要的理论价值和应用前景。
我们欢迎您在评论区中留下宝贵的建议!包括但不限于:
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!
END