【AI学习】OpenAI o1-mini简介(原文+中文翻译)

文摘   2024-09-14 00:01   新加坡  


▌锅头导读

2024年9月12日,Open AI发布 OpenAI o1-mini,这是一种经济高效的推理模型。o1-mini 擅长 STEM,尤其是数学和编码,几乎与 OpenAI o1 在 AIME 和 Codeforces 等评估基准上的性能相当。
本文是锅头了解OpenAI o1-mini模型的笔记记录,从原文学习,尽可能让自己不被社会舆论传播和浮夸解说误导或带偏,也供有“求真”需求的同学学习参考。



OpenAI o1-mini原文+中文翻译

We're releasing OpenAI o1-mini, a cost-efficient reasoning model. o1-mini excels at STEM, especially math and coding—nearly matching the performance of OpenAI o1 on evaluation benchmarks such as AIME and Codeforces. We expect o1-mini will be a faster, cost-effective model for applications that require reasoning without broad world knowledge.

Today, we are launching o1-mini to tier 5 API users(opens in a new window) at a cost that is 80% cheaper than OpenAI o1-preview. ChatGPT Plus, Team, Enterprise, and Edu users can use o1-mini as an alternative to o1-preview, with higher rate limits and lower latency (see Model Speed).

我们将发布 OpenAI o1-mini,这是一种经济高效的推理模型。o1-mini 擅长 STEM,尤其是数学和编码,几乎与 OpenAI o1 在 AIME 和 Codeforces 等评估基准上的性能相当。我们预计 o1-mini 将是一种更快、更具成本效益的模型,适用于需要推理但没有广泛知识的应用程序。
今天,我们将 o1-mini 推出第 5 层 API 用户(在新窗口中打开) 成本比 OpenAI o1-preview 便宜 80%。ChatGPT Plus、Team、Enterprise 和 Edu 用户可以使用 o1-mini 作为 o1-preview 的替代品,具有更高的速率限制和更低的延迟(参见模型速度)。

Optimized for STEM Reasoning

针对 STEM 推理进行了优化

Large language models such as o1 are pre-trained on vast text datasets. While these high-capacity models have broad world knowledge, they can be expensive and slow for real-world applications. In contrast, o1-mini is a smaller model optimized for STEM reasoning during pretraining. After training with the same high-compute reinforcement learning (RL) pipeline as o1, o1-mini achieves comparable performance on many useful reasoning tasks, while being significantly more cost efficient.
When evaluated on benchmarks requiring intelligence and reasoning, o1-mini performs well compared to o1-preview and o1. However, o1-mini performs worse on tasks requiring non-STEM factual knowledge (see Limitations).
大型语言模型(如 o1)是在大型文本数据集上进行预训练的。虽然这些高容量模型具有广泛的世界知识,但对于实际应用程序来说,它们可能成本高昂且速度缓慢。相比之下,o1-mini 是一个较小的模型,针对预训练期间的 STEM 推理进行了优化。在使用与 o1 相同的高计算强化学习 (RL) 管道进行训练后,o1-mini 在许多有用的推理任务上实现了相当的性能,同时显著提高了成本效益。
在需要智能和推理的基准测试中进行评估时,与 o1-preview 和 o1 相比,o1-mini 表现良好。但是,o1-mini 在需要非 STEM 事实知识的任务上表现较差(请参阅 限制)。

Mathematics: In the high school AIME math competition, o1-mini (70.0%) is competitive with o1 (74.4%)–while being significantly cheaper–and outperforms o1-preview (44.6%). o1-mini’s score (about 11/15 questions) places it in approximately the top 500 US high-school students.
数学:在高中 AIME 数学竞赛中,o1-mini (70.0%) 与 o1 (74.4%) 竞争激烈,但价格明显更低,并且优于 o1-preview (44.6%)。o1-mini 的分数(约 11/15 个问题)使其位于美国高中生中大约 500 名。
Coding: On the Codeforces competition website, o1-mini achieves 1650 Elo, which is again competitive with o1 (1673) and higher than o1-preview (1258). This Elo score puts the model at approximately the 86th percentile of programmers who compete on the Codeforces platform. o1-mini also performs well on the HumanEval coding benchmark and high-school level cybersecurity capture the flag challenges (CTFs).
编码:在 Codeforces 竞赛网站上,o1-mini 达到了 1650 Elo,这再次与 o1 (1673) 竞争,高于 o1-preview (1258)。这个 Elo 分数使该模型大约处于在 Codeforces 平台上竞争的程序员的第 86 个百分位。o1-mini 在 HumanEval 编码基准测试和高中级网络安全捕获标志挑战 (CTF) 上也表现良好。

STEM: On some academic benchmarks requiring reasoning, such as GPQA (science) and MATH-500, o1-mini outperforms GPT-4o. o1-mini does not perform as well as GPT-4o on tasks such as MMLU and lags behind o1-preview on GPQA due to its lack of broad world knowledge.
STEM:在一些需要推理的学术基准上,例如 GPQA(科学)和 MATH-500,o1-mini 的表现优于 GPT-4o。o1-mini 在 MMLU 等任务上的表现不如 GPT-4o,并且由于缺乏广泛的世界知识,在 GPQA 上落后于 o1-preview。

Human preference evaluation: We had human raters compare o1-mini to GPT-4o on challenging, open-ended prompts in various domains, using the same methodology as our o1-preview vs GPT-4o comparison. Similar to o1-preview, o1-mini is preferred to GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in language-focused domains.
人类偏好评估:我们让人类评分员在各个领域中比较 o1-mini 和 GPT-4o 在各个领域具有挑战性的开放式提示上,使用与我们的 o1-preview 与 GPT-4o 比较相同的方法。与 o1-preview 类似,在推理较多的领域中,o1-mini 优于 GPT-4o,但在以语言为中心的领域中,o1-mini 不优于 GPT-4o。


Model Speed

模型速度

As a concrete example, we compared responses from GPT-4o, o1-mini, and o1-preview on a word reasoning question. While GPT-4o did not answer correctly, both o1-mini and o1-preview did, and o1-mini reached the answer around 3-5x faster.
作为一个具体的例子,我们比较了 GPT-4o、o1-mini 和 o1-preview 对单词推理问题的回答。虽然 GPT-4o 没有正确回答,但 o1-mini 和 o1-preview 都正确回答,而 o1-mini 找到答案的速度大约快了 3-5 倍。


Safety

安全

o1-mini is trained using the same alignment and safety techniques as o1-preview. The model has 59% higher jailbreak robustness on an internal version of the StrongREJECT dataset compared to GPT-4o. Before deployment, we carefully assessed the safety risks of o1-mini using the same approach to preparedness, external red-teaming, and safety evaluations as o1-preview. We are publishing the detailed results from these evaluations in the accompanying system card.
o1-mini 使用与 o1-Preview 相同的对齐和安全技术进行训练。与 GPT-4o 相比,该模型在 StrongREJECT 数据集的内部版本中的越狱鲁棒性高出 59%。在部署之前,我们使用与 o1-preview 相同的准备、外部红队和安全评估方法仔细评估了 o1-mini 的安全风险。我们将在随附的系统卡中发布这些评估的详细结果。


Limitations and What’s Next
限制和下一步
Due to its specialization on STEM reasoning capabilities, o1-mini’s factual knowledge on non-STEM topics such as dates, biographies, and trivia is comparable to small LLMs such as GPT-4o mini. We will improve these limitations in future versions, as well as experiment with extending the model to other modalities and specialities outside of STEM.

由于其专注于 STEM 推理能力,o1-mini 在非 STEM 主题(如日期、传记和琐事)上的事实知识可与 GPT-4o mini 等小型 LLM 相媲美。我们将在未来的版本中改进这些限制,并尝试将模型扩展到 STEM 之外的其他模式和专业。


▌内容来源

[1]  OpenAI o1-mini 原文链接:openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

跟锅头一起学AI
持续学习AI知识和使用技巧,思考如何用AI高效学习办公
 最新文章