通过指标真的能判断哪个AI模型更好吗?| 经济学人精读(199)

财富   2024-08-03 16:56   上海  

关注本公众号,回复“进群”进经济学人打卡群

概要:经常看到新的AI大模型宣称自己的某项指标超过所有同行,这种声明似乎和实际表现并不完全一致。本文深入讨论了AI模型衡量指标方面的不同问题。(生词在留言区)

GPT, Claude, Llama? How to tell which AI model is best

那些AI模型衡量指标真的可靠吗?

Beware model-makers marking their own homework

警惕AI模型自己给自己打分

8月 01, 2024 09:21 上午


WHEN META, the parent company of Facebook, announced its latest open-source large language model (LLM) on July 23rd, it claimed that the most powerful version of Llama 3.1 had “state-of-the-art capabilities that rival the best closed-source models” such as GPT-4o and Claude 3.5 Sonnet. Meta’s announcement included a table, showing the scores achieved by these and other models on a series of popular benchmarks with names such as MMLU, GSM8K and GPQA.

在7月23日宣布其最新的开源大语言模型(LLM)时,Facebook的母公司Meta声称其功能最强大的Llama 3.1版本具有与GPT-4o和Claude 3.5 Sonnet 等“最佳闭源模型相媲美的最前沿能力”。Meta的公告中包含了一个表格,表格中显示了各个大语言模型在MMLU、GSM8K和GPQA等一系列流行基准测试上的得分。

On MMLU, for example, the most powerful version of Llama 3.1 scored 88.6%, against 88.7% for GPT-4o and 88.3% for Claude 3.5 Sonnet, rival models made by OpenAI and Anthropic, two AI startups, respectively. Claude 3.5 Sonnet had itself been unveiled on June 20th, again with a table of impressive benchmark scores. And on July 24th, the day after Llama 3.1’s debut, Mistral, a French AI startup, announced Mistral Large 2, its latest LLM, with—you’ve guessed it—yet another table of benchmarks. Where do such numbers come from, and can they be trusted?

例如,在MMLU上,功能最强大的Llama 3.1得分为88.6%,而GPT-4o为88.7%,Claude 3.5 Sonnet为88.3%,后面两个模型是由OpenAI和Anthropic两家人工智能初创公司制造。Claude 3.5 Sonnet在6月20日发布时,同样附带了一张令人印象深刻的基准测试得分表。而在7月24日,即Llama 3.1首次亮相的第二天,法国人工智能初创公司Mistral宣布了其最新的大语言模型Mistral Large 2——你猜对了——又是一张基准测试得分表。这些得分从何而来,它们可信吗?

Having accurate, reliable benchmarks for AI models matters, and not just for the bragging rights of the firms making them. Benchmarks “define and drive progress”, telling model-makers where they stand and incentivising them to improve, says Percy Liang of the Institute for Human-Centred Artificial Intelligence at Stanford University. Benchmarks chart the field’s overall progress and show how AI systems compare with humans at specific tasks. They can also help users decide which model to use for a particular job and identify promising new entrants in the space, says Clémentine Fourrier, a specialist in evaluating LLMs at Hugging Face, a startup that provides tools for AI developers.

拥有准确、可靠的人工智能模型基准测试非常重要,这不仅仅是为了大语言模型开发者能够互相竞争。基准测试能够“定义并推动进步”,告诉模型制造者他们在竞争中所处的位置,并激励他们不断改进,斯坦福大学以人为本人工智能研究所的Percy Liang说。基准测试记录了AI领域的整体进步,并展示了与人类相比,AI系统在特定任务上的表现。基准测试还可以帮助用户决定为特定工作使用哪个特定模型,并识别领域内具有潜力的新进入者,为人工智能开发人员提供工具的初创公司Hugging Face的LLM评估专家Clémentine Fourrier说。

But, says Dr Fourrier, benchmark scores “should be taken with a pinch of salt”. Model-makers are, in effect, marking their own homework—and then using the results to hype their products and talk up their company valuations. Yet all too often, she says, their grandiose claims fail to match real-world performance, because existing benchmarks, and the ways they are applied, are flawed in various ways.

但是,Fourrier博士表示,基准测试分数在接受的时候应该持有“半信半疑”的态度。模型制造者实际上是在给自己打分——然后使用测试结果来炒作自己产品,抬高公司估值。她还表示,这种宏伟的声明往往与现实世界表现不符,因为现有的基准测试以及其应用方式在各种方面都存在缺陷。

One problem with benchmarks such as MMLU (massive multi-task language understanding) is that they are simply too easy for today’s models. MMLU was created in 2020 and consists of 15,908 multiple-choice questions, each with four possible answers, across 57 topics including maths, American history, science and law. At the time, most language models scored little better than 25% on MMLU, which is what you would get by picking answers at random; OpenAI’s GPT-3 did best, with a score of 43.9%. But since then, models have improved, with the best now scoring between 88% and 90%.

对于MMLU(大量多任务语言理解)这样的基准测试,其中一个问题是,对于当今的模型来说,它们太简单了。MMLU是在2020年创建的,包含15908个多项选择题,每个问题有四个选项,涵盖57个主题,包括数学、美国历史、科学和法律。当时,大多数语言模型在MMLU上的得分略高于25%,相当于随机选择得到的结果;OpenAI的GPT-3得分最高,为43.9%。但自那以后,模型大幅改进,现在最好的得分在88%到90%之间。

This means it is difficult to draw meaningful distinctions from their scores, a problem known as “saturation” (see chart). “It’s like grading high-school students on middle-school tests,” says Dr Fourrier. More difficult benchmarks have been devised—MMLU-Pro has tougher questions and ten possible answers rather than four. GPQA is like MMLU at PhD level, on selected science topics; today’s best models tend to score between 50% and 60% on it. Another benchmark, MuSR (multi-step soft reasoning), tests reasoning ability using, for example, murder-mystery scenarios. When a person reads such a story and works out who the killer is, they are combining an understanding of motivation with language comprehension and logical deduction. AI models are not so good at this kind of “soft reasoning” over multiple steps. So far, few models score better than random on MuSR.

这意味着从分数中很难区分哪个大模型更好,这被称为“饱和”(见图表)。“这就像用初中生的考试来给高中生打分,” Fourrier博士说。现在已经设计出了更难的基准测试——MMLU-Pro,题目更难,并且有十个选项。GPQA就像是科学领域博士水平的MMLU;当今最好的模型的得分通常在50%到60%之间。另一个基准测试MuSR(多步骤软推理),使用侦探推理场景等来测试推理能力。当一个人读了侦探推理故事并推断出谁是凶手时,他们结合了对动机的理解、语言理解和逻辑演绎。人工智能模型并不擅长这种跨越多个步骤的“软推理”。到目前为止,很少有模型在MuSR上的得分高于随机水平。

订阅以下专栏后仅需10元,继续阅读剩余50%的内容:



一天一则经济学人
挑战读完1000篇经济学人,更新10000个经济学人精选词汇(1w私域粉丝,商务合作请联系ecotranslation)
 最新文章