2025-01-01 论文分享 | 智能体最新进展

文摘 2025-01-01 10:15 安徽

点击蓝字关注我们

论文分享 | 智能体相关研究进展

我们从2024-12-26到2025-01-01的33篇文章中精选出5篇优秀的工作分享给读者。

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Aviary: training language agents on challenging scientific tasks
Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering
Plancraft: an evaluation dataset for planning with LLM agents
Minimax-Optimal Multi-Agent Robust Reinforcement Learning

1.OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Authors: Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu

https://arxiv.org/abs/2412.19723

论文摘要

Graphical User Interface (GUI) agents pow ered by Vision-Language Models (VLMs) have demonstrated human-like computer control ca pability. Despite their utility in advancing dig ital automation, a critical bottleneck persists: collecting high-quality trajectory data for train ing. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diver sity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS Genesis enables agents first to perceive envi ronments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A tra jectory reward model is then employed to en sure the quality of the generated trajectories. Wedemonstrate that training GUI agents with OS-Genesis significantly improves their per formance on highly challenging online bench marks. In-depth analysis further validates OS Genesis’s efficiency and its superior data qual ity and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at OS-Genesis Homepage.

论文简评

OS-Genesis是面向GUI界面代理的一种新颖方法，旨在自动合成高质量、多样的轨迹数据集。该研究提出了一种反向任务合成过程，通过允许代理探索环境并从后向前推导有意义的任务来增强轨迹数据的质量。实验结果显示，这种方法在挑战性基准上的表现优于现有方法，为GUI自动化带来了潜在的巨大进步。

该论文主要关注解决训练GUI界面代理时面临的挑战，并提出了一个综合的方法论，其中包括交互驱动的探索和反向任务合成。通过详细的实验结果，研究人员证实了这种方法的有效性和性能提升潜力，为未来的研究提供了有价值的参考。

2.Aviary: training language agents on challenging scientific tasks

Authors: Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White

https://arxiv.org/abs/2412.21154

论文摘要

Solving complex real-world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non-standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature-sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language-grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi-step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference-time compute, we show that language agents backed by open-source, non-frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.

论文简评

该篇论文提出了一个名为Aviary的框架，用于利用语言决策过程（LDPs）训练复杂的科学任务的语言代理。它实现了五个环境，包括专注于分子克隆和蛋白质稳定性的环境，并展示了使用较小的开源模型在推理成本较低的情况下可以达到较高的性能。

论文的关键亮点在于其创新性以及对当前科学任务中融合大规模语言模型挑战的有效解决方案的关注。通过实现特定的环境，特别是那些与当前生物学研究相关的环境，论文强调了实际应用方面的进步。此外，论文还提供了具体的结果，表明小规模的开源模型可以在性能上与大型前沿模型竞争，从而揭示了经济高效解决方案的可能性。

总的来说，这篇论文是关于如何利用小型开源模型来解决科学任务中的复杂问题的一个有趣探索。它的成功之处在于提供了一个新的视角，即在不牺牲性能的前提下，可以降低科学研究的成本。这一发现对于推动科学的进步具有重要意义，值得进一步深入探讨和应用。

3.Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

Authors: Wei Zhou, Mohsen Mesgar, Annemarie Friedrich, Heike Adel

https://arxiv.org/abs/2412.20145

论文摘要

Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain, and utilizing closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework that requires neither closed-source models nor fine-tuning. In MACT, a planning agent and a coding agent that also make use of tools collaborate to answer questions. Our experiments on four TQA benchmarks show that MACT outperforms previous state-of-the-art systems on three out of four benchmarks and that it performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. We conduct extensive analyses to prove the effectiveness of MACT's multi-agent collaboration in TQA.

论文简评

这篇文章提出了一个名为MACT的框架，用于解决复杂表格问题回答（Table Question Answering - TQA）。该框架利用多代理协作和工具使用，同时避免依赖于封闭源模型或微调。该框架由规划代理和编码代理组成，可以有效处理多步骤和多类别推理任务。作者声称，MACT在多个基准测试中优于前几代最先进的系统，并提供了一种更有效的TQA解决方案。

文章的关键点在于：1）采用多代理协作的方法，在表问答领域具有创新性；2）不依赖于精细调整或闭源模型，提高了可访问性和可复现性；3）实验结果表明，MACT在多个表问答基准测试中的表现良好，显示出潜在的应用价值；4）通过实验验证了协作代理之间的良好支持关系。

综上所述，本文为读者提供了对MACT框架及其研究背景、实现方法和应用前景的一般性介绍。尽管没有直接提及具体的实验数据，但其概述展示了该框架的优势和潜力。对于研究人员来说，这篇文章可能是一个有用的参考文献，以了解当前的研究动态和技术趋势。

4.Plancraft: an evaluation dataset for planning with LLM agents

Authors: Gautier Dagan, Frank Keller, Alex Lascarides

https://arxiv.org/abs/2412.21033

论文摘要

We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as an oracle planner and oracle RAG information extractor, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and strategies on our task and compare their performance to a handcrafted planner. We find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and we offer suggestions on how to improve their capabilities.

论文简评

这篇论文《Plancraft：一个多模态评估数据集》深入探讨了如何通过设置可解决与不可解的任务来全面评估语言模型（LLM）在Minecraft游戏中的造物能力。这项研究不仅关注任务完成情况，还特别强调了识别不可解任务的能力。此外，使用Minecraft这样的知名环境增强了评估的实用性。最后，该研究对当前模型面临的挑战进行了详尽的比较和分析。总体而言，Plancraft提供了对语言模型造物能力的一个综合评估框架，对于提高未来的研究和开发具有重要意义。

5.Minimax-Optimal Multi-Agent Robust Reinforcement Learning

Authors: Yuchen Jiao, Gen Li

https://arxiv.org/abs/2412.19873

论文摘要

Multi-agent robust reinforcement learning, also known as multi-player robust Markov games (RMGs), is a crucial framework for modeling competitive interactions under environmental uncertainties, with wide applications in multi-agent systems. However, existing results on sample complexity in RMGs suffer from at least one of three obstacles: restrictive range of uncertainty level or accuracy, the curse of multiple agents, and the barrier of long horizons, all of which cause existing results to significantly exceed the information-theoretic lower bound. To close this gap, we extend the Q-FTRL algorithm to the RMGs in finite-horizon setting, assuming access to a generative model. We prove that the proposed algorithm achieves an ε-robust coarse correlated equilibrium (CCE) with a sample complexity (up to log factors) of , where denotes the number of states, is the number of actions of the -th agent, is the finite horizon length, and is uncertainty level. We also show that this sample complexity is minimax optimal by combining an information-theoretic lower bound. Additionally, in the special case of two-player zero-sum RMGs, the algorithm achieves an ε-robust Nash equilibrium (NE) with the same sample complexity.

论文简评

本文主要探讨了一种新的多智能体强化学习（RMGs）算法，该算法通过扩展Q-FTRL算法来实现最小化样本复杂度的目标，从而解决了多智能体强化学习中的一大挑战：对抗性、鲁棒性和效率问题。通过对ε-鲁棒粗相关均衡（CCE）和纳什均衡（NE）的研究，证明了这种新方法能够达到最优的样本复杂度，并且为RMGs提供了有力的理论支持。此外，作者还详细分析了该算法的性能，给出了其上界和下界的理论证明，进一步提高了研究的可信度和实用性。总体来说，本文不仅提出了一个创新性的算法，也为未来的研究提供了重要的参考。

我们欢迎您在评论区中留下宝贵的建议！包括但不限于：

可以提出推文中论文简评的不足！
可以分享最近更值得推荐的论文并给出理由！

END

2025-01-01 论文分享 | 智能体最新进展

论文分享 | 智能体相关研究进展

1.OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

论文摘要

论文简评

2.Aviary: training language agents on challenging scientific tasks

论文摘要

论文简评

3.Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

论文摘要

论文简评

4.Plancraft: an evaluation dataset for planning with LLM agents

论文摘要

论文简评

5.Minimax-Optimal Multi-Agent Robust Reinforcement Learning

论文摘要

论文简评

推荐阅读