SmartFlowAI
点击上方蓝字关注我们
作者:企鹅火烈鸟🦩
全文约 3400 字,预计阅读时间 10 分钟
本文翻译并修改自:Using LLM-as-a-judge 🧑⚖️ for an automated and versatile evaluation
前言
大型语言模型(LLM)的评估通常是一项艰巨的任务:鉴于其通用的能力,赋予它们的任务通常应根据非常宽泛且定义不严格的要求进行判断。例如,助手对问题的回答可能是:
没有基于上下文 重复,重复,重复 语法错误 过长且用词过多,导致话语或书面内容变得过于详细和冗长 不连贯 ……
标准不胜枚举。而且,即使我们有一个合理的,其中每一项是否达标也都很难衡量:“设计一个基于规则的程序来评估输出是极具挑战性的。基于输出与参考答案之间的相似性的传统评估指标(例如 ROUGE、BLEU)对于这些问题也不起作用。”
✅一个以人类方式评估输出的强大解决方案,无需耗费大量人力时间,那就是将大语言模型用作评判者。
这个想法很简单:让一个大型语言模型为你进行评分。🤖
上代码!
下面我们直接上代码环节!你需要仔细设置它才能获得好的结果。
!pip install huggingface_hub datasets pandas tqdm -q
import re
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_dataset
from huggingface_hub import InferenceClient, notebook_login
tqdm.pandas() # load tqdm's pandas support
pd.set_option("display.max_colwidth", None)
notebook_login()
repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
llm_client = InferenceClient(
model=repo_id,
timeout=120,
)
# Test your LLM client
llm_client.text_generation(prompt="How are you today?", max_new_tokens=20)
1. 准备创建和评估我们的大语言模型评估器
假设你想给一个大型语言模型一个特定的任务,比如回答开放式问题。困难在于,衡量答案的质量是很困难的,例如,完全的字符串匹配会将许多正确但措辞不同的答案标记为错误。你可以让人类标注员来评判输出结果,但这对他们来说非常耗时,而且如果你想更新模型或问题,你就得重新再来一遍。
✅在这种情况下,你可以设置一个大型语言模型作为评判者。
但是要使用大语言模型作为评判者,你首先需要评估它对你的模型输出的评分有多可靠。
➡️所以第一步将是……创建一个人工评估数据集。但是你只能为少数示例获得人工标注——大约 30 个。这足以很好地了解性能。并且每次你想要测试你的作为评判者的大语言模型时,你都能够重复使用这个数据集。
在我们的案例中,我们将使用feedbackQA,它包含每个问题/答案对的两个人工评估和分数:使用 30 个示例的样本将代表你的小型评估数据集可能的样子。
ratings = load_dataset("McGill-NLP/feedbackQA")["train"]
ratings = pd.DataFrame(ratings)
ratings["review_1"] = ratings["feedback"].apply(lambda x: x["rating"][0])
ratings["explanation_1"] = ratings["feedback"].apply(lambda x: x["explanation"][0])
ratings["review_2"] = ratings["feedback"].apply(lambda x: x["rating"][1])
ratings["explanation_2"] = ratings["feedback"].apply(lambda x: x["explanation"][1])
ratings = ratings.drop(columns=["feedback"])
# Map scores to numeric values
conversion_dict = {"Excellent": 4, "Acceptable": 3, "Could be Improved": 2, "Bad": 1}
ratings["score_1"] = ratings["review_1"].map(conversion_dict)
ratings["score_2"] = ratings["review_2"].map(conversion_dict)
计算分数baseline永远是个好主意:在这里,例如可以是两位人类评分者之间的一致性,通过他们给出的分数的皮尔逊相关性来衡量。
print("Correlation between 2 human raters:")
print(f"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}")
Correlation between 2 human raters:
0.563
两位人类评分者之间的这种相关性不是那么好。如果你的人类评分真的很糟糕,这可能意味着评分标准不够清晰。这意味着我们的“真实值”包含噪声:因此我们不能期望任何算法评估能与之非常接近。然而,我们可以减少这种噪声:
通过将平均得分作为我们的真实值,而不是任何单一得分,我们应该可以消除一些不规则性。 通过只选择人类评审员意见一致的样本。
在这里,我们将选择最后一个选项,并且只保留两位人类评审员意见一致的示例。
# Sample examples
ratings_where_raters_agree = ratings.loc[ratings["score_1"] == ratings["score_2"]]
examples = ratings_where_raters_agree.groupby("score_1").sample(7, random_state=1214)
examples["human_score"] = examples["score_1"]
# Visualize 1 sample for each score
display(examples.groupby("human_score").first())
2. 创建我们的大模型评判器
我们使用一个基本提示构建我们的大型语言模型评估器,其中包含以下元素:
任务描述 类型描述:minimum、maximum、值类型(这里是float) 输出格式说明 一个答案的开头,尽可能引导大型语言模型
JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.
Provide your feedback as follows:
Feedback:::
Total rating: (your rating, as a float between 0 and 10)
Now here are the question and answer.
Question: {question}
Answer: {answer}
Feedback:::
Total rating: """
examples["llm_judge"] = examples.progress_apply(
lambda x: llm_client.text_generation(
prompt=JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
max_new_tokens=1000,
),
axis=1,
)
def extract_judge_score(answer: str, split_str: str = "Total rating:") -> int:
try:
if split_str in answer:
rating = answer.split(split_str)[1]
else:
rating = answer
digit_groups = [el.strip() for el in re.findall(r"\d+(?:.\d+)?", rating)]
return float(digit_groups[0])
except Exception as e:
print(e)
return None
examples["llm_judge_score"] = examples["llm_judge"].apply(extract_judge_score)
# Rescale the score given by the LLM on the same scale as the human score
examples["llm_judge_score"] = (examples["llm_judge_score"] / 10) + 1
print("Correlation between LLM-as-a-judge and the human raters:")
print(f"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}")
Correlation between LLM-as-a-judge and the human raters:
0.567
这还不错,但考虑到两个随机、独立变量之间的皮尔逊相关系数为 0!我们可以轻松做得更好。🔝
3. 提升大模型评判器
正如 Aparna Dhinakaran 所示,大型语言模型在评估连续范围内的输出方面表现不佳。本文为我们提供了一些构建更好提示的最佳实践:
⏳ 通过在最终答案之前添加“评估”字段,留出更多思考时间。 🔢 使用像 1-4 或 1-5 这样的小整数范围,而不是像我们之前使用的大浮点数范围。 👩🏫 提供一个指示性范围以作指导。
我们甚至添加了一个激励因素来激励大型语言模型!
IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.
Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question
Provide your feedback as follows:
Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)
You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
Now here are the question and answer.
Question: {question}
Answer: {answer}
Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """
examples["llm_judge_improved"] = examples.progress_apply(
lambda x: llm_client.text_generation(
prompt=IMPROVED_JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
max_new_tokens=500,
),
axis=1,
)
examples["llm_judge_improved_score"] = examples["llm_judge_improved"].apply(extract_judge_score)
print("Correlation between LLM-as-a-judge and the human raters:")
print(f"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}")
Correlation between LLM-as-a-judge and the human raters:
0.843
相关性仅通过对提示进行一些微调就提高了近 30%(其中几个百分点是由于我无耻地向大语言模型提供提示导致的,在此我声明这在法律上不具约束力)。非常令人印象深刻!👏
让我们展示一些我们的大语言模型评判的错误来分析它们:
errors = pd.concat(
[
examples.loc[examples["llm_judge_improved_score"] > examples["human_score"]].head(1),
examples.loc[examples["llm_judge_improved_score"] < examples["human_score"]].head(2),
]
)
display(
errors[
[
"question",
"answer",
"human_score",
"explanation_1",
"llm_judge_improved_score",
"llm_judge_improved",
]
]
)
分歧很小:总体而言,我们的系统似乎已经达到了良好的性能水平!
4. 如何让我们的大模型评判器走的更远?
🎯你永远无法达到 100%: 首先让我们注意到,我们人类的基本事实肯定存在一些噪声,所以即使有一个完美的语言模型评判者,一致性/相关性也永远不会达到 100%。
🧭提供参考: 如果你能为每个问题都获得一个参考答案,那么你一定要在提示中把这个答案提供给评判语言模型,以获得更好的结果!
▶️提供少样本示例: 在提示中添加一些问题和基本事实评估的少样本示例可以提高结果。(我在这里试过了,在这种情况下它没有提高结果,所以我跳过了,但它可能对你的数据集有效!)
➕加法尺度: 当判断可以拆分为原子标准时,使用加法尺度可以进一步提高结果:见下文👇。
ADDITIVE_PROMPT = """
(...)
- Award 1 point if the answer is related to the question.
- Give 1 additional point if the answer is clear and precise.
- Provide 1 further point if the answer is true.
- One final point should be awarded if the answer provides additional resources to support the user.
...
"""
使用结构化生成:
使用结构化生成,你可以配置 LLM 评判器,使其直接以包含Evaluation和Total rating字段的 JSON 格式提供输出,这使得解析更加容易
结论
今天就到这里,这就是关于使用大模型进行自动且多功能评估的全部了。
往期 · 推荐
🌠 番外:我们期待与读者共同探讨如何在 AI 的辅助下,更好地发挥人类的潜力,以及如何培养和维持那些 AI 难以取代的核心技能。通过深入分析和实践,我们可以更清晰地认识到 AI 的辅助作用,并在 AI 时代下找到人类的独特价值和发展空间。“机智流”公众号后台聊天框回复“cc”,加入机智流大模型交流群!
一起“点赞”三连👇