用 PydanticAI 构建智能体应用：评估驱动开发的新时代

文摘 2024-12-22 14:12 安徽

理想情况下，我们可以在开发代理型（Agentic）应用程序时对其进行评估，而不是事后再进行评估。但是，要做到这一点，我们需要能够模拟正在开发的智能体（Agent，也有人称为代理）的内部和外部依赖关系。我对 PydanticAI 感到非常兴奋，因为它从头开始支持依赖注入。这是第一个允许我以评估驱动的方式构建代理应用程序的框架。

在本文中，我们将讨论开发一个简单智能体时所面临的核心挑战，并展示如何使用 PydanticAI 以评估驱动的方式进行开发。

开发生成型AI应用的挑战

绝大多数的 GenAI（生成型 AI）开发者一直都在等待一个能够支持完整开发生命周期的智能体框架。我们都希望从 DSPy、Langchain、LangGraph 和 Autogen 中找到那个“合适的”框架。

但是，软件开发者在开发基于 LLM（大模型）的应用时，通常会面临一些核心挑战。对于构建简单的概念验证（PoC）项目来说，这些挑战可能并不会成为障碍，但如果我们要构建用于生产的 LLM 驱动的应用时，这些问题就会浮出水面，成为制约因素。

这些挑战有哪些呢？

非确定性：与大多数软件API不同，调用 LLM 即使输入完全相同，返回的输出每次也可能不同。面对这样的应用，你应该如何进行测试呢？
LLM 限制：像 GPT-4、Claude 和 Gemini 等基础模型，受限于它们的训练数据（例如，无法访问企业的机密信息）、能力（例如，无法调用企业 API 或数据库），并且无法进行规划或推理。
LLM 灵活性：即便你决定只使用某个供应商的 LLM，比如 Anthropic，你也可能发现每个步骤都需要不同的 LLM。比如，某个步骤可能需要低延迟的小型语言模型（如 Haiku），另一个步骤可能需要强大的代码生成能力（如 Sonnet），还有一个步骤则可能需要卓越的上下文理解能力（如 Opus）。
变化速度：生成型AI技术发展迅速。近期，许多进展出现在基础模型的能力上。如今，基础模型不再仅仅是根据用户的提示生成文本。它们现在是多模态的，能够生成结构化输出，并具备记忆功能。然而，如果你尝试以不依赖特定 LLM 的方式来构建应用，往往会失去对底层 API 的访问，这样就无法启用这些新特性。

为了帮助解决第一个问题——非确定性问题，我们的软件测试需要结合评估框架。我们永远无法做到让软件在每次运行时都100%正常工作；相反，我们需要能够设计一个系统，在其正常工作的情况下，也能应对一些出错的情况。我们需要建立防护措施和人工监督来捕捉异常，并实时监控系统，发现回归。实现这一能力的关键是评估驱动开发（Evaluation-Driven Development, EDD）（这是Lak Lakshmanan提出的术语），它是软件测试驱动开发（TDD）的扩展。

针对挑战 2 中提到的 LLM 限制，目前的解决方法是使用智能体架构（如 RAG），为 LLM 提供工具访问权限，并采用反思（Reflection）、反应（ReACT）和思维链（Chain of Thought）等模式。因此，我们的框架需要具备协调智能体的能力。然而，评估能够调用外部工具的智能体是很困难的。我们需要能够为这些外部依赖注入代理，以便单独测试它们，并在构建过程中进行评估。

为了应对挑战 3，智能体需要能够调用不同类型基础模型的能力。我们的智能体框架需要在智能体工作流的单个步骤级别上具备 LLM 无关性。为了应对变化速度的问题（挑战 #4），我们需要保留对基础模型 API 的低级访问权限，并移除那些不再需要的代码部分。

那么，是否有一个框架能够满足所有这些要求呢？长期以来，答案是没有。最接近的解决方案是使用 Langchain、pytest 的依赖注入功能和 deepeval，搭配类似下面的结构：

from unittest.mock import patch, Mockfrom deepeval.metrics import GEval
llm_as_judge = GEval(    name="Correctness",    criteria="Determine whether the actual output is factually correct based on the expected output.",    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],    model='gpt-3.5-turbo')

@patch('lg_weather_agent.retrieve_weather_data', Mock(return_value=chicago_weather))def eval_query_rain_today():    input_query = "Is it raining in Chicago?"    expected_output = "No, it is not raining in Chicago right now."    result = lg_weather_agent.run_query(app, input_query)    actual_output = result[-1]
    print(f"Actual: {actual_output}   Expected: {expected_output}")    test_case = LLMTestCase(        input=input_query,        actual_output=actual_output,        expected_output=expected_output    )
    llm_as_judge.measure(test_case)    print(llm_as_judge.score)

完整示例请见这里：

https://github.com/lakshmanok/lakblogs/blob/main/genai_agents/eval_weather_agent.py

基本上，我们需要为每次 LLM 调用构建一个 Mock 对象（如上例中的 chicago_weather），并在需要模拟智能体工作流的某一部分时，用硬编码的对象替换对 LLM 的调用（如上例中的 retrieve_weather_data）。依赖注入散布在各个地方，我们需要大量的硬编码对象，调用的工作流变得极其难以追踪。值得注意的是，如果没有依赖注入，我们将无法测试这样的函数：显然，外部服务会返回当前的天气数据，但没有办法确定像“现在是否下雨”这样的问题的正确答案。

那么……是否有一个支持依赖注入的智能体框架，它是 Pythonic 的，提供低级 LLM 访问权限，具有模型无关性，支持逐步评估开发，并且易于使用且易于理解？

差不多。PydanticAI 满足前 3 个要求；虽然第四个要求（低级 LLM 访问）无法完全实现，但设计上并不排除这一可能性。在本文的接下来的部分，我将向你展示如何使用它，以评估驱动的方式开发智能体应用。

1. 第一个 PydanticAI 应用

让我们从构建一个简单的 PydanticAI 应用开始。这个应用将使用一个 LLM 来回答有关山脉的问题：

agent = llm_utils.agent()question = "What is the tallest mountain in British Columbia?"print(">> ", question)answer = agent.run_sync(question)print(answer.data)

在上面的代码中，我创建了一个智能体（稍后我会展示如何创建），然后调用 run_sync，传入用户提示，获取 LLM 的回应。run_sync 是一种使智能体调用 LLM 并等待响应的方式。还有其他方式，比如异步运行查询或流式获取响应。（如果你想跟随代码，可以查看完整代码）。

运行上述代码，你将获得类似以下的结果：

>>  What is the tallest mountain in British Columbia?The tallest mountain in British Columbia is **Mount Robson**, at 3,954 metres (12,972 feet).

要创建智能体，你需要先创建一个模型，然后告诉智能体在所有步骤中使用该模型。

import pydantic_aifrom pydantic_ai.models.gemini import GeminiModel
def default_model() -> pydantic_ai.models.Model:    model = GeminiModel('gemini-1.5-flash', api_key=os.getenv('GOOGLE_API_KEY'))    return model
def agent() -> pydantic_ai.Agent:    return pydantic_ai.Agent(default_model())

default_model() 的想法是使用一个相对便宜但快速的模型（如 Gemini Flash）作为默认模型。然后，你可以根据需要在特定步骤中使用不同的模型，通过将不同的模型传递给 run_sync() 来实现。

PydanticAI 支持的模型看起来比较简单，但最常用的模型——目前前沿的 OpenAI、Groq、Gemini、Mistral、Ollama 和 Anthropic 的模型都得到了支持。通过 Ollama，你可以访问 Llama3、Starcoder2、Gemma2 和 Phi3 等模型。

2. 使用 Pydantic 处理结构化输出

上一节中的示例返回的是自由文本。在大多数智能体工作流中，你可能希望 LLM 返回结构化数据，以便能够直接在程序中使用这些数据。

考虑到这个 API 是 Pydantic 提供的，返回结构化输出是相当简单的。只需将期望的输出定义为一个数据类：

from dataclasses import dataclass
@dataclassclass Mountain:    name: str    location: str    height: float

完整代码在这里：

https://github.com/lakshmanok/lakblogs/blob/main/pydantic_ai_mountains/2_zero_shot_structured.py

当你创建智能体时，需要告诉它期望的输出类型：

agent = Agent(llm_utils.default_model(),                  result_type=Mountain,                  system_prompt=(                      "You are a mountaineering guide, who provides accurate information to the general public.",                      "Provide all distances and heights in meters",                      "Provide location as distance and direction from nearest big city",                  ))

另请注意，系统提示中用于指定单位等信息。

对三个问题进行测试后，我们得到以下结果：

>>  Tell me about the tallest mountain in British Columbia?Mountain(name='Mount Robson', location='130km North of Vancouver', height=3999.0)>>  Is Mt. Hood easy to climb?Mountain(name='Mt. Hood', location='60 km east of Portland', height=3429.0)>>  What's the tallest peak in the Enchantments?Mountain(name='Mount Stuart', location='100 km east of Seattle', height=3000.0)

但是，这个智能体到底有多可靠呢？罗布森山的高度正确吗？斯图尔特山真的是“魔法谷”地区最高的峰吗？这些信息有可能是智能体“幻觉”出来的！

你无法仅凭肉眼判断智能体应用的好坏，除非你将智能体与参考答案进行评估。你不能仅凭直觉来判断。遗憾的是，这也是许多 LLM 框架的短板——它们让你在开发 LLM 应用时很难进行评估。

3. 用参考答案进行评估护

当你开始与参考答案进行评估时，PydanticAI 的优势就开始显现了。由于一切都非常 Pythonic，你可以非常简单地构建自定义评估指标。

例如，下面是我们如何评估返回的 Mountain 对象的三个标准，并创建一个综合评分的示例：

def evaluate(answer: Mountain, reference_answer: Mountain) -> Tuple[float, str]:    score = 0    reason = []    if reference_answer.name in answer.name:        score += 0.5        reason.append("Correct mountain identified")        if reference_answer.location in answer.location:            score += 0.25            reason.append("Correct city identified")        height_error = abs(reference_answer.height - answer.height)        if height_error < 10:            score += 0.25 * (10 - height_error)/10.0        reason.append(f"Height was {height_error}m off. Correct answer is {reference_answer.height}")    else:        reason.append(f"Wrong mountain identified. Correct answer is {reference_answer.name}")
    return score, ';'.join(reason)

完整代码在这里：

https://github.com/lakshmanok/lakblogs/blob/main/pydantic_ai_mountains/3_eval_against_reference.py

现在，我们可以使用一个问题集和参考答案来运行这个评估：

questions = [    "Tell me about the tallest mountain in British Columbia?",    "Is Mt. Hood easy to climb?",    "What's the tallest peak in the Enchantments?"]
reference_answers = [    Mountain("Robson", "Vancouver", 3954),    Mountain("Hood", "Portland", 3429),    Mountain("Dragontail", "Seattle", 2690)]
total_score = 0for l_question, l_reference_answer in zip(questions, reference_answers):    print(">> ", l_question)    l_answer = agent.run_sync(l_question)    print(l_answer.data)    l_score, l_reason = evaluate(l_answer.data, l_reference_answer)    print(l_score, ":", l_reason)    total_score += l_score
avg_score = total_score / len(questions)

运行后，我们得到以下结果：

>>  Tell me about the tallest mountain in British Columbia?Mountain(name='Mount Robson', location='130 km North-East of Vancouver', height=3999.0)0.75 : Correct mountain identified;Correct city identified;Height was 45.0m off. Correct answer is 3954>>  Is Mt. Hood easy to climb?Mountain(name='Mt. Hood', location='60 km east of Portland, OR', height=3429.0)1.0 : Correct mountain identified;Correct city identified;Height was 0.0m off. Correct answer is 3429>>  What's the tallest peak in the Enchantments?Mountain(name='Dragontail Peak', location='14 km east of Leavenworth, WA', height=3008.0)0.5 : Correct mountain identified;Height was 318.0m off. Correct answer is 2690Average score: 0.75

罗布森山的高度偏差了 45 米；龙尾峰的高度偏差了 318 米。你会怎么修正这个问题？

没错，你可以使用 RAG 架构，或者为智能体配备一个能够提供正确高度信息的工具。我们就采用后者，并看看如何通过 Pydantic 来实现。

注意，评估驱动开发让我们看到了如何改进智能体应用的路径。

4a. 使用工具

PydanticAI 支持多种方式为智能体提供工具。在这里，我注解了一个函数，当智能体需要获取山脉的高度时就会调用这个函数：

agent = Agent(llm_utils.default_model(),              result_type=Mountain,              system_prompt=(                  "You are a mountaineering guide, who provides accurate information to the general public.",                  "Use the provided tool to look up the elevation of many mountains."                  "Provide all distances and heights in meters",                  "Provide location as distance and direction from nearest big city",              ))@agent.tooldef get_height_of_mountain(ctx: RunContext[Tools], mountain_name: str) -> str:    return ctx.deps.elev_wiki.snippet(mountain_name)

完整代码在这里：

https://github.com/lakshmanok/lakblogs/blob/main/pydantic_ai_mountains/4_use_tool.py

不过，这个函数做了一些特别的事情。它从智能体的运行时上下文中提取了一个名为 elev_wiki 的对象。这个对象是在调用 run_sync 时传入的：

class Tools:    elev_wiki: wikipedia_tool.WikipediaContent    def __init__(self):        self.elev_wiki = OnlineWikipediaContent("List of mountains by elevation")
tools = Tools()  # Tools or FakeTools
l_answer = agent.run_sync(l_question, deps=tools) # note how we are able to inject

由于运行时上下文可以传递给每次智能体调用或工具调用，因此我们可以利用它在 PydanticAI 中实现依赖注入。你将在接下来的部分看到这一点。

这个 Wiki 工具本身只是在线查询 Wikipedia，并提取页面内容，将相关的山脉信息传递给智能体：

import wikipedia
class OnlineWikipediaContent(WikipediaContent):    def __init__(self, topic: str):        print(f"Will query online Wikipedia for information on {topic}")        self.page = wikipedia.page(topic)
    def url(self) -> str:        return self.page.url
    def html(self) -> str:        return self.page.html()

代码在这里：

https://github.com/lakshmanok/lakblogs/blob/main/pydantic_ai_mountains/wikipedia_tool.py

事实上，当我们运行它时，得到了正确的高度数据：

Will query online Wikipedia for information on List of mountains by elevation>>  Tell me about the tallest mountain in British Columbia?Mountain(name='Mount Robson', location='100 km west of Jasper', height=3954.0)0.75 : Correct mountain identified;Height was 0.0m off. Correct answer is 3954>>  Is Mt. Hood easy to climb?Mountain(name='Mt. Hood', location='50 km ESE of Portland, OR', height=3429.0)1.0 : Correct mountain identified;Correct city identified;Height was 0.0m off. Correct answer is 3429>>  What's the tallest peak in the Enchantments?Mountain(name='Mount Stuart', location='Cascades, Washington, US', height=2869.0)0 : Wrong mountain identified. Correct answer is DragontailAverage score: 0.58

4b. 依赖注入一个模拟服务

在开发或测试过程中，每次都等待 Wikipedia 的 API 调用是个糟糕的主意。相反，我们希望模拟 Wikipedia 的响应，这样我们就可以更快地开发，并且能够确保获得预期的结果。

实现这一点非常简单。我们创建一个 Wikipedia 服务的假对象（Fake counterpart）：

class FakeWikipediaContent(WikipediaContent):    def __init__(self, topic: str):        if topic == "List of mountains by elevation":            print(f"Will used cached Wikipedia information on {topic}")            self.url_ = "https://en.wikipedia.org/wiki/List_of_mountains_by_elevation"            with open("mountains.html", "rb") as ifp:                self.html_ = ifp.read().decode("utf-8")
    def url(self) -> str:        return self.url_
    def html(self) -> str:        return self.html_

然后，在开发过程中，将这个假对象注入到智能体的运行时上下文中：

class FakeTools:    elev_wiki: wikipedia_tool.WikipediaContent    def __init__(self):        self.elev_wiki = FakeWikipediaContent("List of mountains by elevation")
tools = FakeTools()  # Tools or FakeTools
l_answer = agent.run_sync(l_question, deps=tools) # note how we are able to inject

这一次，当我们运行时，评估将使用缓存的 Wikipedia 内容：

Will used cached Wikipedia information on List of mountains by elevation>>  Tell me about the tallest mountain in British Columbia?Mountain(name='Mount Robson', location='100 km west of Jasper', height=3954.0)0.75 : Correct mountain identified;Height was 0.0m off. Correct answer is 3954>>  Is Mt. Hood easy to climb?Mountain(name='Mt. Hood', location='50 km ESE of Portland, OR', height=3429.0)1.0 : Correct mountain identified;Correct city identified;Height was 0.0m off. Correct answer is 3429>>  What's the tallest peak in the Enchantments?Mountain(name='Mount Stuart', location='Cascades, Washington, US', height=2869.0)0 : Wrong mountain identified. Correct answer is DragontailAverage score: 0.58

仔细查看上面的输出——与零样本（zero shot）示例的错误不同。在第 2 节中，LLM 选出了温哥华作为距离罗布森山最近的城市，并且把龙尾峰选为魔法谷地区的最高峰。这些答案恰好是正确的。而现在，它选出了贾斯珀和斯图尔特山。我们需要做更多的工作来修正这些错误——但至少，评估驱动开发为我们指明了改进的方向。

当前的局限性

PydanticAI 是一个非常新的框架，仍然有一些地方可以改进：

缺乏对模型的低级访问：例如，不同的基础模型支持上下文缓存、提示缓存等功能，而 PydanticAI 中的模型抽象没有提供设置这些功能的方法。理想情况下，我们应该能找到一种基于 kwargs 的方式来进行这些设置。
需要创建两个版本的智能体依赖：在很多情况下，开发过程中我们需要同时创建一个真实的依赖和一个假依赖。这种做法非常常见。如果能为工具注解或提供一种简单的方式，在真实和假服务之间切换，将会大大简化开发流程。
开发与运行时日志需求不同：在开发过程中，我们不需要那么多日志。但是当我们开始运行智能体时，通常会希望记录提示词和响应内容，有时还需要记录中间响应。当前的做法似乎依赖于一个商业产品 Logfire。理想的情况是，能有一个开源的、与云平台无关的日志框架，能够与 PydanticAI 库集成。

这些问题可能已经有解决方案，或者在你阅读这篇文章时已经实现。如果你有新的发现，欢迎在评论中分享给未来的读者。

总的来说，我很喜欢 PydanticAI——它提供了一种非常干净且符合 Python 风格的方式，以评估驱动的方式构建智能体应用。

下一步计划

这篇文章不仅描述了开发过程，还介绍了一个新库，所以你从实际运行示例中能获得更大的收益。以下是这篇文章中使用的 PydanticAI 示例的 GitHub 仓库链接：

https://github.com/lakshmanok/lakblogs/tree/main/pydantic_ai_mountains

按照 README 中的说明来试试看。

PydanticAI 文档：https://ai.pydantic.dev/
使用 Mock 对象修补 Langchain 工作流：
https://github.com/lakshmanok/lakblogs/blob/main/genai_agents/eval_weather_agent.py

文章来源：PyTorch研习社

PyTorch研习社

打破知识壁垒，做一名知识的传播者