【AI学习】学习使用 LLM 进行推理（原文+中文翻译）

文摘 2024-09-14 00:01 新加坡

▌锅头导读

2024年9月12日，Open AI发布一系列新的 AI 模型——OpenAI o1，这是OpenAI推出的一种新的大型语言模型，经过强化学习训练，可以执行复杂的推理。O1 在回答之前会思考 - 在响应用户之前，它可以产生一个很长的内部思维链。

本文是锅头了解OpenAI o1模型的笔记记录，从原文学习，尽可能让自己不被社会舆论传播和浮夸解说误导或带偏，也供有“求真”需求的同学学习参考。

▌学习使用 LLM 进行推理原文+中文翻译

Learning to Reason with LLMs

学习使用 LLM 进行推理

We are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.

我们推出了 OpenAI o1，这是一种新的大型语言模型，经过强化学习训练，可以执行复杂的推理。O1 在回答之前会思考 - 在响应用户之前，它可以产生一个很长的内部思维链。

OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window).

Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.

OpenAI o1 在竞争性编程问题（Codeforces）中排名第 89 个百分位，在美国数学奥林匹克竞赛（AIME）资格赛中跻身美国前 500 名学生之列，在物理、生物和化学问题的基准（GPQA）上超过了人类博士水平的准确性。虽然使这个新模型像当前模型一样易于使用的工作仍在进行中，但我们正在发布这个模型的早期版本 OpenAI o1-preview，以便立即在 ChatGPT 中使用，并且受信任的 API 用户（在新窗口中打开）.

我们的大规模强化学习算法教会模型如何在高度数据高效的训练过程中使用其思维链进行高效思考。我们发现，随着强化学习（训练时计算）的增加和思考时间的增加（测试时计算），o1 的性能会不断提高。扩展这种方法的限制与 LLM 预训练的限制有很大不同，我们将继续研究它们。

Evals

评估

To highlight the reasoning improvement over GPT-4o, we tested our models on a diverse set of human exams and ML benchmarks. We show that o1 significantly outperforms GPT-4o on the vast majority of these reasoning-heavy tasks. Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.

为了突出对 GPT-4o 的推理改进，我们在各种人体检查和 ML 基准测试中测试了我们的模型。我们表明，在绝大多数这些推理密集型任务中，o1 的性能明显优于 GPT-4o。除非另有说明，否则我们在最大测试时间计算设置上评估了 o1。

In many reasoning-heavy benchmarks, o1 rivals the performance of human experts. Recent frontier models1 do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models. We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America. On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.

We also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology. In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions. We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark. These results do not imply that o1 is more capable than a PhD in all respects — only that the model is more proficient in solving some problems that a PhD would be expected to solve. On several other ML benchmarks, o1 improved over the state-of-the-art. With its vision perception capabilities enabled, o1 scored 78.2% on MMMU, making it the first model to be competitive with human experts. It also outperformed GPT-4o on 54 out of 57 MMLU subcategories.

在许多推理密集型基准测试中，o1 的性能可与人类专家的性能相媲美。最近的 Frontier 模型1在 MATH2 以及 GSM8K上做得很好，这些基准测试在区分模型方面不再有效。我们评估了 AIME 的数学成绩，AIME 是一项旨在挑战美国最聪明的高中数学学生的考试。在 2024 年 AIME 考试中，GPT-4o 平均只解决了 12% （1.8/15）的问题。O1 平均 74% （11.1/15）每个问题只有一个样本，83% （12.5/15）在 64 个样本中达成一致，93% （13.9/15）在使用学习评分函数重新排名 1000 个样本时。13.9 分的成绩跻身全国前 500 名学生之列，高于美国数学奥林匹克竞赛的分数线。

我们还在 GPQA 钻石上评估了 o1，这是一个困难的智力基准，用于测试化学、物理和生物学方面的专业知识。为了将模型与人类进行比较，我们聘请了具有博士学位的专家来回答 GPQA 钻石问题。我们发现 o1 的性能超过了那些人类专家，成为第一个在此基准测试中做到这一点的模型。这些结果并不意味着 o1 在所有方面都比博士更有能力——只是说该模型更擅长解决一些博士应该解决的问题。在其他几个 ML 基准测试中，o1 的改进超过了最先进的。开启视觉感知能力后，o1 在 MMMU 上的得分为 78.2%，成为首个与人类专家竞争的模型。在 57 个 MMLU 子类别中，它还在 54 个子类别中的表现优于 GPT-4o。

Chain of Thought

思路链

Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason. To illustrate this leap forward, we showcase the chain of thought from o1-preview on several difficult problems below.

类似于人类在回答困难问题之前可能会思考很长时间，o1 在尝试解决问题时使用思维链。通过强化学习，o1 学会磨练其思维链并改进它使用的策略。它学会识别和纠正错误。它学会了将棘手的步骤分解为更简单的步骤。它学会了在当前方法不起作用时尝试不同的方法。此过程显著提高了模型的推理能力。为了说明这一飞跃，我们在下面展示了 o1-preview 对几个难题的思路。

Coding

编码

We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.

For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.

Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.

我们训练了一个在 2024年国际信息学奥林匹克竞赛（IOI）中获得 213 分并排名第 49 个百分位的模型，从 o1 开始初始化并进行训练，以进一步提高编程技能。该模型在与人类参赛者相同的条件下参加了 2024 年 IOI 的比赛。它有 10 个小时来解决 6 个具有挑战性的算法问题，每个问题允许提交 50 个问题。

对于每个问题，我们的系统对许多考生提交的内容进行了抽样，并根据考试时选择策略提交了其中的 50 个。提交的内容是根据 IOI 公共测试用例、模型生成的测试用例和学习的评分函数的性能来选择的。如果我们随机提交，我们平均只会得到 156 分，这表明在比赛限制下，这种策略价值近 60 分。

在宽松的提交约束下，我们发现模型性能显著提高。当每个问题允许提交 10,000 个问题时，该模型获得了 362.14 分——高于金牌阈值——即使没有任何测试时间选择策略。

最后，我们模拟了由 Codeforces 主办的竞争性编程竞赛，以展示该模型的编码技能。我们的评估与比赛规则非常匹配，并允许 10 份提交。GPT-4o 获得 Elo 评级3 的 808 个百分位，位于人类竞争对手的第 11 个百分位。该模型远远超过了 GPT-4o 和 o1——它的 Elo 评分为 1807，表现优于 93% 的竞争对手。

Human preference evaluation

人类偏好评估

In addition to exams and academic benchmarks, we also evaluated human preference of o1-preview vs GPT-4o on challenging, open-ended prompts in a broad spectrum of domains. In this evaluation, human trainers were shown anonymized responses to a prompt from o1-preview and GPT-4o, and voted for which response they preferred. o1-preview is preferred to gpt-4o by a large margin in reasoning-heavy categories like data analysis, coding, and math. However, o1-preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases.

除了考试和学术基准之外，我们还评估了人类对 o1-preview 与 GPT-4o 在广泛领域中具有挑战性的开放式提示的偏好。在这项评估中，人类培训师对来自 o1-preview 和 GPT-4o 的提示进行了匿名响应，并投票选出他们更喜欢哪种响应。O1-Preview 在数据分析、编码和数学等推理密集型类别中比 GPT-4O 更受欢迎。但是，在某些自然语言任务中，o1-preview 不是首选，这表明它并不适合所有用例。

Safety

安全

Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model's safety refusal boundaries. We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.

To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, in accordance with our Preparedness Framework(opens in a new window). We found that chain of thought reasoning contributed to capability improvements across our evaluations. Of particular note, we observed interesting instances of reward hacking. Detailed results from these evaluations can be found in the accompanying System Card.

思维链推理为对齐和安全提供了新的机会。我们发现，将我们的模型行为政策整合到推理模型的思维链中是稳健地教授人类价值观和原则的有效方法。通过向模型传授我们的安全规则以及如何在上下文中对其进行推理，我们发现了推理能力直接有利于模型稳健性的证据：o1-preview 在关键越狱评估和评估模型安全拒绝边界的最难的内部基准上实现了显著提高的性能。我们相信，使用思维链为安全性和一致性提供了重大进步，因为（1）它使我们能够以清晰的方式观察模型思维，以及（2）关于安全规则的模型推理对于分布外场景更加稳健。

为了对我们的改进进行压力测试，我们在部署前根据我们的准备框架（在新窗口中打开）。我们发现，思维链推理有助于我们评估中的能力改进。特别值得注意的是，我们观察到了有趣的奖励黑客攻击实例。这些评估的详细结果可以在随附的 System Card 中找到。

Hiding the Chains of Thought

隐藏思想的枷锁

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.

我们相信，隐藏的思维链为监控模型提供了独特的机会。假设它是忠实且清晰的，隐藏的思维链使我们能够“读取模型的思想”并理解其思维过程。例如，将来我们可能希望监控思路是否有操纵用户的迹象。然而，要做到这一点，模型必须能够自由地以不变的形式表达其想法，因此我们不能将任何政策合规性或用户偏好训练到思维链上。我们也不想让用户直接看到一个不对齐的思路。

因此，在权衡了包括用户体验、竞争优势和追求思维链监控选项在内的多种因素后，我们决定不向用户展示原始思维链。我们承认此决定有缺点。我们努力通过教模型在答案中重现思维链中的任何有用想法来部分弥补它。对于 o1 模型系列，我们展示了模型生成的思路链摘要。

Conclusion

结论

o1 significantly advances the state-of-the-art in AI reasoning. We plan to release improved versions of this model as we continue iterating. We expect these new reasoning capabilities will improve our ability to align models to human values and principles. We believe o1 – and its successors – will unlock many new use cases for AI in science, coding, math, and related fields. We are excited for users and API developers to discover how it can improve their daily work.

o1 显著推动了 AI 推理的最新技术。我们计划在继续迭代的过程中发布此模型的改进版本。我们预计这些新的推理功能将提高我们将模型与人类价值观和原则保持一致的能力。我们相信 o1 及其继任者将为 AI 在科学、编码、数学和相关领域解锁许多新的用例。我们很高兴用户和 API 开发人员能够发现它如何改善他们的日常工作。

▌内容来源

[1] 学习使用 LLM 进行推理 openai.com/index/learning-to-reason-with-llms/

http://mp.weixin.qq.com/s?__biz=MzkwMzQ0MDIzMg==&mid=2247490542&idx=3&sn=67203dc50c5594a22d6f33cb558b294f

跟锅头一起学AI

持续学习AI知识和使用技巧，思考如何用AI高效学习办公

最新文章

【AI学习】如何用AI工具做中秋节日祝福海报？（附效果和保姆级教程）

【每日AI提示词】如何生成鲜花服饰模特人物图？附FLUX.1、可灵、即梦等7个国内外AI生成效果

【AI学习】如何用AI工具（即梦）创作十二星座虚幻流光星云图？（附效果和保姆级教程）

【AI学习】一文带你了解OpenAI o1（附学习思维导图）

【AI学习】OpenAI o1-preview 简介（原文+中文翻译）

【AI学习】学习使用 LLM 进行推理（原文+中文翻译）

【AI学习】OpenAI o1-mini简介（原文+中文翻译）

【AI学习】OpenAI o1 贡献（原文+中文翻译）

【AI学习】Vidu参考生成视频功能如何使用？一致性效果如何？（附保姆级教程）

【AI学习】如何用即梦AI+剪映创作十二生肖植物景观变身效果视频合集？（附效果和保姆级教程）

【AI学习】如何用AI工具（即梦）创作十二生肖绿植景观效果图？（附效果和保姆级教程）

【AI学习】如何用AI工具（即梦）创作“中秋快乐”等自定义文字月饼实物摄影图？（附效果和保姆级教程）

【AI学习】如何用AI工具豆包+即梦创作水果掉入水里瞬间摄影图片和视频片段？（附效果和保姆级教程）

【每日AI提示词】如何生成电影级巨物入侵图？附FLUX.1、可灵、即梦等7个国内外AI生成效果

【每日AI提示词】如何生成微缩景观人物图？附Flux.1、即梦等7个国内外AI生成效果

【AI学习】如何用AI工具腾讯元宝+即梦+可灵+剪映创作角色一致性高的视频叙事故事《熊猫锅头体验宇航员的一天》（附保姆级教程）

【AI学习】如何用AI工具腾讯元宝+即梦创作一致性高的图片叙事故事《熊猫锅头体验宇航员的一天》？（附效果和保姆级教程）

【AI学习】一文带你了解AI生图生视频模型即梦AI，包括实测效果、功能特点、研发团队、使用方法等

【AI学习】海螺AI最近悄悄上线的【创作视频】功能如何使用？视频效果好不好？

【AI学习】一文带你了解AI生图模型Midjourney，包括实测效果、功能特点、研发团队、使用方法等

【每日AI提示词】如何生成水果卡通脸部漫画？附FLUX.1、可灵、即梦等7个国内外AI生成效果

【每日AI提示词】如何生成漂亮的四叶草翡翠饰品图？附Flux.1、即梦等7个国内外AI生成效果

【每日AI提示词】如何生成3D剪纸风格邀请卡片？附MJ、通义万相等7个国内外AI生成效果

【每日AI提示词】如何生成柔和色彩的蓝鸟插画？附MJ、秒画等7个国内外AI生成效果

【每日AI提示词】如何生成精致的霓虹水晶苹果图？附MJ、Flux.1、即梦等8个国内外AI生成效果

【AI学习】如何用AI快速创作“武侠风+幽默感文字图片”？（附效果和保姆级教程）

【每日AI提示词】如何生成极简水墨武侠人物背影图？附即梦、MJ等7个国内外AI生成效果

【AI学习】一文带你了解AI生图模型FLUX.1，包括实测效果、功能特点、研发团队、使用方法等

【每日AI提示词】如何生成黑神话悟空式的猴子时尚街拍图？附MJ、Flux.1、即梦等8个国内外AI生成效果

【AI学习】黑森林实验室官宣成立并发布FLUX.1模型套件介绍原文+中文翻译

【每日AI提示词】如何生成水彩艺术双重曝光图片？附MJ、可灵等7个国内外AI生成效果

【AI实测】实测图生视频首尾帧功能做汽车行驶视频哪家强? 可灵、即梦、Luma参赛，请你当评委

【AI学习】如何用AI工具让汽车海报动起来，变成汽车短片视频?（附效果和保姆级教程）

【AI学习】如何用AI工具快速创作充满创意的现实汽车海报图集?（附效果和保姆级教程）

【AI学习】如何用AI工具快速创作充满创意的汽车海报图集？（附效果和保姆级教程）

【AI学习】如何用AI工具做现实中不可能的两个人拥抱的视频？（附效果和保姆级教程）

【AI学习】如何用AI工具做幻术视频—红薯变猪，土豆变狗？（附效果和保姆级教程）

【AI学习】如何用AI工具（Kimi）花1分钟创作情绪满满的小红书风格标题和文案？（附效果和保姆级教程）

【AI学习】如何用AI工具Kimi+即梦+可灵+剪映创作角色一致性高的视频叙事故事《熊猫锅头化身庄稼人的一天》（附保姆级教程）

【AI学习】如何用AI工具创作一致性高的图片叙事故事《熊猫锅头化身庄稼人的一天》？（附效果和保姆级教程）

【AI学习】一文带你了解6个亲测可用的免版权图片素材网站（附许可使用说明）

【AI学习】如何用AI工具（Kimi）1分钟生成符合预期的会议纪要？（附提示词和效果）

【AI学习】如何用AI工具Kimi快速生成更符合自己预期的工作周报？（附效果、提示词和调优过程）

【AI学习】如何用AI工具一键替换视频人物，AI从换脸到换人的效果如何？（附实测效果和保姆级教程）

【阶段总结】165天，148篇原创，10000粉丝，4个400+社群，1200份手册，783元广告收益，锅头3000字经验与思考

【AI学习】如何用AI工具创建卡通人物形象，生成角色一致性高、多种表情动作的组图（附效果+价值百元实用教程）

【AI创作】“关关难过我关关过”治愈系文字+角色一致性高的组图（附提示词）

【AI学习】如何用AI工具一次写完1个月的朋友圈文案，建立真诚专业的人设形象？（附提示词+实操教程）

【AI学习】一文带你了解LibLib AI图像生成平台的功能、收费情况、使用方法

【AI学习】如何写文生图提示词让AI文生图效果更符合预期？（附学习脑图）

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉