点击蓝字 关注我们
论文分享 | 智能体相关研究进展
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Aviary: training language agents on challenging scientific tasks Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering Plancraft: an evaluation dataset for planning with LLM agents Minimax-Optimal Multi-Agent Robust Reinforcement Learning
1.OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Authors: Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu
Graphical User Interface (GUI) agents pow ered by Vision-Language Models (VLMs) have demonstrated human-like computer control ca pability. Despite their utility in advancing dig ital automation, a critical bottleneck persists: collecting high-quality trajectory data for train ing. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diver sity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS Genesis enables agents first to perceive envi ronments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A tra jectory reward model is then employed to en sure the quality of the generated trajectories. Wedemonstrate that training GUI agents with OS-Genesis significantly improves their per formance on highly challenging online bench marks. In-depth analysis further validates OS Genesis’s efficiency and its superior data qual ity and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at OS-Genesis Homepage.
2.Aviary: training language agents on challenging scientific tasks
Authors: Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White
Solving complex real-world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non-standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature-sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language-grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi-step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference-time compute, we show that language agents backed by open-source, non-frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.
3.Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering
Authors: Wei Zhou, Mohsen Mesgar, Annemarie Friedrich, Heike Adel
Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain, and utilizing closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework that requires neither closed-source models nor fine-tuning. In MACT, a planning agent and a coding agent that also make use of tools collaborate to answer questions. Our experiments on four TQA benchmarks show that MACT outperforms previous state-of-the-art systems on three out of four benchmarks and that it performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. We conduct extensive analyses to prove the effectiveness of MACT's multi-agent collaboration in TQA.
这篇文章提出了一个名为MACT的框架,用于解决复杂表格问题回答(Table Question Answering - TQA)。该框架利用多代理协作和工具使用,同时避免依赖于封闭源模型或微调。该框架由规划代理和编码代理组成,可以有效处理多步骤和多类别推理任务。作者声称,MACT在多个基准测试中优于前几代最先进的系统,并提供了一种更有效的TQA解决方案。
4.Plancraft: an evaluation dataset for planning with LLM agents
Authors: Gautier Dagan, Frank Keller, Alex Lascarides
We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as an oracle planner and oracle RAG information extractor, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and strategies on our task and compare their performance to a handcrafted planner. We find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and we offer suggestions on how to improve their capabilities.
5.Minimax-Optimal Multi-Agent Robust Reinforcement Learning
Authors: Yuchen Jiao, Gen Li
Multi-agent robust reinforcement learning, also known as multi-player robust Markov games (RMGs), is a crucial framework for modeling competitive interactions under environmental uncertainties, with wide applications in multi-agent systems. However, existing results on sample complexity in RMGs suffer from at least one of three obstacles: restrictive range of uncertainty level or accuracy, the curse of multiple agents, and the barrier of long horizons, all of which cause existing results to significantly exceed the information-theoretic lower bound. To close this gap, we extend the Q-FTRL algorithm to the RMGs in finite-horizon setting, assuming access to a generative model. We prove that the proposed algorithm achieves an ε-robust coarse correlated equilibrium (CCE) with a sample complexity (up to log factors) of , where denotes the number of states, is the number of actions of the -th agent, is the finite horizon length, and is uncertainty level. We also show that this sample complexity is minimax optimal by combining an information-theoretic lower bound. Additionally, in the special case of two-player zero-sum RMGs, the algorithm achieves an ε-robust Nash equilibrium (NE) with the same sample complexity.
可以提出推文中论文简评的不足! 可以分享最近更值得推荐的论文并给出理由!