【AI学习】学习使用 LLM 进行推理(原文+中文翻译)

文摘   2024-09-14 00:01   新加坡  


▌锅头导读

2024年9月12日,Open AI发布一系列新的 AI 模型——OpenAI o1这是OpenAI推出的一种新的大型语言模型,经过强化学习训练,可以执行复杂的推理。O1 在回答之前会思考 - 在响应用户之前,它可以产生一个很长的内部思维链。
本文是锅头了解OpenAI o1模型的笔记记录,从原文学习,尽可能让自己不被社会舆论传播和浮夸解说误导或带偏,也供有“求真”需求的同学学习参考。



学习使用 LLM 进行推理原文+中文翻译

Learning to Reason with LLMs
学习使用 LLM 进行推理
We are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.
我们推出了 OpenAI o1,这是一种新的大型语言模型,经过强化学习训练,可以执行复杂的推理。O1 在回答之前会思考 - 在响应用户之前,它可以产生一个很长的内部思维链。
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window).
Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.
OpenAI o1 在竞争性编程问题 (Codeforces) 中排名第 89 个百分位,在美国数学奥林匹克竞赛 (AIME) 资格赛中跻身美国前 500 名学生之列,在物理、生物和化学问题的基准 (GPQA) 上超过了人类博士水平的准确性。虽然使这个新模型像当前模型一样易于使用的工作仍在进行中,但我们正在发布这个模型的早期版本 OpenAI o1-preview,以便立即在 ChatGPT 中使用,并且受信任的 API 用户(在新窗口中打开).
我们的大规模强化学习算法教会模型如何在高度数据高效的训练过程中使用其思维链进行高效思考。我们发现,随着强化学习(训练时计算)的增加和思考时间的增加(测试时计算),o1 的性能会不断提高。扩展这种方法的限制与 LLM 预训练的限制有很大不同,我们将继续研究它们。


Evals
评估
To highlight the reasoning improvement over GPT-4o, we tested our models on a diverse set of human exams and ML benchmarks. We show that o1 significantly outperforms GPT-4o on the vast majority of these reasoning-heavy tasks. Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.
为了突出对 GPT-4o 的推理改进,我们在各种人体检查和 ML 基准测试中测试了我们的模型。我们表明,在绝大多数这些推理密集型任务中,o1 的性能明显优于 GPT-4o。除非另有说明,否则我们在最大测试时间计算设置上评估了 o1。
In many reasoning-heavy benchmarks, o1 rivals the performance of human experts. Recent frontier models1 do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models. We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America. On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
We also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology. In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions. We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark. These results do not imply that o1 is more capable than a PhD in all respects — only that the model is more proficient in solving some problems that a PhD would be expected to solve. On several other ML benchmarks, o1 improved over the state-of-the-art. With its vision perception capabilities enabled, o1 scored 78.2% on MMMU, making it the first model to be competitive with human experts. It also outperformed GPT-4o on 54 out of 57 MMLU subcategories.
在许多推理密集型基准测试中,o1 的性能可与人类专家的性能相媲美。最近的 Frontier 模型1在 MATH2 以及 GSM8K上做得很好,这些基准测试在区分模型方面不再有效。我们评估了 AIME 的数学成绩,AIME 是一项旨在挑战美国最聪明的高中数学学生的考试。在 2024 年 AIME 考试中,GPT-4o 平均只解决了 12% (1.8/15) 的问题。O1 平均 74% (11.1/15) 每个问题只有一个样本,83% (12.5/15) 在 64 个样本中达成一致,93% (13.9/15) 在使用学习评分函数重新排名 1000 个样本时。13.9 分的成绩跻身全国前 500 名学生之列,高于美国数学奥林匹克竞赛的分数线。
我们还在 GPQA 钻石上评估了 o1,这是一个困难的智力基准,用于测试化学、物理和生物学方面的专业知识。为了将模型与人类进行比较,我们聘请了具有博士学位的专家来回答 GPQA 钻石问题。我们发现 o1 的性能超过了那些人类专家,成为第一个在此基准测试中做到这一点的模型。这些结果并不意味着 o1 在所有方面都比博士更有能力——只是说该模型更擅长解决一些博士应该解决的问题。在其他几个 ML 基准测试中,o1 的改进超过了最先进的。开启视觉感知能力后,o1 在 MMMU 上的得分为 78.2%,成为首个与人类专家竞争的模型。在 57 个 MMLU 子类别中,它还在 54 个子类别中的表现优于 GPT-4o。

Chain of Thought

思路链


Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason. To illustrate this leap forward, we showcase the chain of thought from o1-preview on several difficult problems below.
类似于人类在回答困难问题之前可能会思考很长时间,o1 在尝试解决问题时使用思维链。通过强化学习,o1 学会磨练其思维链并改进它使用的策略。它学会识别和纠正错误。它学会了将棘手的步骤分解为更简单的步骤。它学会了在当前方法不起作用时尝试不同的方法。此过程显著提高了模型的推理能力。为了说明这一飞跃,我们在下面展示了 o1-preview 对几个难题的思路。


Coding
编码
We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.
With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.

我们训练了一个在 2024年国际信息学奥林匹克竞赛 (IOI) 中获得 213 分并排名第 49 个百分位的模型,从 o1 开始初始化并进行训练,以进一步提高编程技能。该模型在与人类参赛者相同的条件下参加了 2024 年 IOI 的比赛。它有 10 个小时来解决 6 个具有挑战性的算法问题,每个问题允许提交 50 个问题。
对于每个问题,我们的系统对许多考生提交的内容进行了抽样,并根据考试时选择策略提交了其中的 50 个。提交的内容是根据 IOI 公共测试用例、模型生成的测试用例和学习的评分函数的性能来选择的。如果我们随机提交,我们平均只会得到 156 分,这表明在比赛限制下,这种策略价值近 60 分。
在宽松的提交约束下,我们发现模型性能显著提高。当每个问题允许提交 10,000 个问题时,该模型获得了 362.14 分——高于金牌阈值——即使没有任何测试时间选择策略。
最后,我们模拟了由 Codeforces 主办的竞争性编程竞赛,以展示该模型的编码技能。我们的评估与比赛规则非常匹配,并允许 10 份提交。GPT-4o 获得 Elo 评级3 的 808 个百分位,位于人类竞争对手的第 11 个百分位。该模型远远超过了 GPT-4o 和 o1——它的 Elo 评分为 1807,表现优于 93% 的竞争对手。


Human preference evaluation

人类偏好评估


In addition to exams and academic benchmarks, we also evaluated human preference of o1-preview vs GPT-4o on challenging, open-ended prompts in a broad spectrum of domains. In this evaluation, human trainers were shown anonymized responses to a prompt from o1-preview and GPT-4o, and voted for which response they preferred. o1-preview is preferred to gpt-4o by a large margin in reasoning-heavy categories like data analysis, coding, and math. However, o1-preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases.

除了考试和学术基准之外,我们还评估了人类对 o1-preview 与 GPT-4o 在广泛领域中具有挑战性的开放式提示的偏好。在这项评估中,人类培训师对来自 o1-preview 和 GPT-4o 的提示进行了匿名响应,并投票选出他们更喜欢哪种响应。O1-Preview 在数据分析、编码和数学等推理密集型类别中比 GPT-4O 更受欢迎。但是,在某些自然语言任务中,o1-preview 不是首选,这表明它并不适合所有用例。



Safety

安全


Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model's safety refusal boundaries. We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.
To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, in accordance with our Preparedness Framework(opens in a new window). We found that chain of thought reasoning contributed to capability improvements across our evaluations. Of particular note, we observed interesting instances of reward hacking. Detailed results from these evaluations can be found in the accompanying System Card.
思维链推理为对齐和安全提供了新的机会。我们发现,将我们的模型行为政策整合到推理模型的思维链中是稳健地教授人类价值观和原则的有效方法。通过向模型传授我们的安全规则以及如何在上下文中对其进行推理,我们发现了推理能力直接有利于模型稳健性的证据:o1-preview 在关键越狱评估和评估模型安全拒绝边界的最难的内部基准上实现了显著提高的性能。我们相信,使用思维链为安全性和一致性提供了重大进步,因为 (1) 它使我们能够以清晰的方式观察模型思维,以及 (2) 关于安全规则的模型推理对于分布外场景更加稳健。
为了对我们的改进进行压力测试,我们在部署前根据我们的准备框架(在新窗口中打开)。我们发现,思维链推理有助于我们评估中的能力改进。特别值得注意的是,我们观察到了有趣的奖励黑客攻击实例。这些评估的详细结果可以在随附的 System Card 中找到。


Hiding the Chains of Thought

隐藏思想的枷锁


We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.

我们相信,隐藏的思维链为监控模型提供了独特的机会。假设它是忠实且清晰的,隐藏的思维链使我们能够“读取模型的思想”并理解其思维过程。例如,将来我们可能希望监控思路是否有操纵用户的迹象。然而,要做到这一点,模型必须能够自由地以不变的形式表达其想法,因此我们不能将任何政策合规性或用户偏好训练到思维链上。我们也不想让用户直接看到一个不对齐的思路。
因此,在权衡了包括用户体验、竞争优势和追求思维链监控选项在内的多种因素后,我们决定不向用户展示原始思维链。我们承认此决定有缺点。我们努力通过教模型在答案中重现思维链中的任何有用想法来部分弥补它。对于 o1 模型系列,我们展示了模型生成的思路链摘要。


Conclusion

结论


o1 significantly advances the state-of-the-art in AI reasoning. We plan to release improved versions of this model as we continue iterating. We expect these new reasoning capabilities will improve our ability to align models to human values and principles. We believe o1 – and its successors – will unlock many new use cases for AI in science, coding, math, and related fields. We are excited for users and API developers to discover how it can improve their daily work.

o1 显著推动了 AI 推理的最新技术。我们计划在继续迭代的过程中发布此模型的改进版本。我们预计这些新的推理功能将提高我们将模型与人类价值观和原则保持一致的能力。我们相信 o1 及其继任者将为 AI 在科学、编码、数学和相关领域解锁许多新的用例。我们很高兴用户和 API 开发人员能够发现它如何改善他们的日常工作。


▌内容来源

[1]  学习使用 LLM 进行推理 openai.com/index/learning-to-reason-with-llms/

跟锅头一起学AI
持续学习AI知识和使用技巧,思考如何用AI高效学习办公
 最新文章