DeepSeek FAQ - Stratechery 万字专栏解析 DeepSeek R1 的影响和意义(全文翻译)

文摘   2025-01-31 06:02   浙江  

 

DeepSeek FAQ - Stratechery 万字专栏解析 DeepSeek R1 的影响和意义(全文翻译)


So what did DeepSeek announce?

DeepSeek到底宣布了什么?


The most proximate announcement to this weekend’s meltdown was R1, a reasoning model that is similar to OpenAI’s o1. However, many of the revelations that contributed to the meltdown — including DeepSeek’s training costs — actually accompanied the V3 announcement over Christmas. Moreover, many of the breakthroughs that undergirded V3 were actually revealed with the release of the V2 model last January.

导致这场危机的最直接的公告是 R1,这是一种类似于OpenAI的 o1 的推理模型。然而,导致这场危机的许多启示——包括DeepSeek的训练成本——实际上是在圣诞节期间伴随着 V3 的公告而来的。此外,许多支撑 V3 的突破实际上是在去年一月发布的 V2 模型时揭晓的。


Is this model naming convention the greatest crime that OpenAI has committed?

这是OpenAI犯下的最严重的罪行吗?


Second greatest; we’ll get to the greatest momentarily.

第二严重;我们稍后再说最严重的。


Let’s work backwards: what was the V2 model, and why was it important?

逆向思维:V2模型是什么,以及为什么它很重要?


The DeepSeek-V2 model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each.

DeepSeek-V2 模型引入了两个重要的突破:DeepSeekMoE和DeepSeekMLA。DeepSeekMoE中的“MoE”指的是“专家混合”。有些模型,如GPT-3.5,在训练和推理期间都会激活整个模型;然而,并非所有部分都对当前主题都必要。MoE将模型分成多个“专家”,并仅激活必要的部分;GPT-4被认为是一个具有约1.1万亿参数的16个专家的MoE模型。


I’m not sure I understood any of that.

我不确定我是否理解了任何内容。


The key implications of these breakthroughs — and the part you need to understand — only became apparent with V3, which added a new approach to load balancing (further reducing communications overhead) and multi-token prediction in training (further densifying each training step, again reducing overhead): V3 was shockingly cheap to train. DeepSeek claimed the model training took 2,788 thousand H800 GPU hours, which, at a cost of 5.576 million.

这些突破的关键影响——你需要理解的部分——只有在 V3 时才变得明显,V3 增加了一种新的负载平衡方法(进一步减少了通信开销)和训练中的多令牌预测(进一步加密每个训练步骤,再次减少了开销):V3 的训练成本令人惊讶地低。DeepSeek声称模型训练需要2,788千H800 GPU小时,以每GPU小时2美元的成本,总共仅为5,576美元。


That seems impossibly low.

这似乎不可能低。


DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses; from the V3 paper:

DeepSeek明确表示,这些成本仅适用于最终的训练运行,不包括其他任何费用;从 V3 的论文中:

"Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre- training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is 5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data."

所以,不,你不能仅以5,576万美元就能复制DeepSeek公司。


I still don’t believe that number.

我仍然不相信那个数字。


Actually, the burden of proof is on the doubters, at least once you understand the V3 architecture. Remember that bit about DeepSeekMoE: V3 has 671 billion parameters, but only 37 billion parameters in the active expert are computed per token; this equates to 333.3 billion FLOPs of compute per token. Here I should mention another DeepSeek innovation: while parameters were stored with BF16 or FP32 precision, they were reduced to FP8 precision for calculations; 2048 H800 GPUs have a capacity of 3.97 exoflops, i.e. 3.97 billion billion FLOPS. The training set, meanwhile, consisted of 14.8 trillion tokens; once you do all of the math it becomes apparent that 2.8 million H800 hours is sufficient for training V3. Again, this was just the final run, not the total cost, but it’s a plausible number.

实际上,证明责任在于怀疑者,至少在你理解了 V3 架构后。记住DeepSeekMoE的内容:V3拥有6,710亿个参数,但每个令牌仅计算活性专家的370亿个参数;这相当于每令牌3,330亿个浮点运算。这里我应该提到DeepSeek的另一项创新:尽管参数以BF16或FP32精度存储,但在计算时降低到FP8精度;2048台H800 GPU的容量为3.97 exoflops,即39,700亿亿次浮点运算。训练集consisted of 14.8 trillion tokens; 一旦你做完所有数学计算,就会发现2,800万H800小时的时间足以训练 V3。再次强调,这仅仅是最终的运行成本,而不是总成本,但这个数字是合理的。


Scale AI CEO Alexandr Wang said they have 50,000 H100s.

Scale AI的CEO亚历山大·王(Alexandr Wang)表示他们有50,000个H100s


I don’t know where Wang got his information; I’m guessing he’s referring to this November 2024 tweet from Dylan Patel, which says that DeepSeek had “over 50k Hopper GPUs”. H800s, however, are Hopper GPUs, they just have much more constrained memory bandwidth than H100s because of U.S. sanctions.

我不知道王从哪里得到了他的信息;我猜他指的是Dylan Patel在2024年11月发的这条推文,该推文称DeepSeek拥有“超过50,000个Hopper GPUs”。然而,H800s是Hopper GPUs,它们由于美国制裁,内存带宽比H100s受到更多限制。


So was this a violation of the chip ban?

这是否违反了芯片禁令?


Nope. H100s were prohibited by the chip ban, but not H800s. Everyone assumed that training leading edge models required more interchip memory bandwidth, but that is exactly what DeepSeek optimized both their model structure and infrastructure around.

没有。H100s被芯片禁令禁止,但H800s没有。大家都以为训练前沿模型需要更多的互芯片内存带宽,但这正是DeepSeek在其模型结构和基础设施上进行了优化的领域。


So V3 is a leading edge model?

所以,V3 是一款前沿模型吗?


It’s definitely competitive with OpenAI’s 4o and Anthropic’s Sonnet-3.5, and appears to be better than Llama’s biggest model. What does seem likely is that DeepSeek was able to distill those models to give V3 high quality tokens to train on.

它无疑与OpenAI的 4o 和Anthropic的Sonnet-3.5具有竞争力,并且似乎优于Llama的最大模型。看起来很可能DeepSeek能够提取这些模型,将高质量令牌用于 V3 的训练。


What is distillation?

什么是蒸馏?


Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.

蒸馏是一种从另一个模型中提取理解的方法;你可以向教师模型发送输入并记录输出,然后用它来训练学生模型。这就是你如何从GPT-4得到GPT-4 Turbo的方式。对于一家公司来说,蒸馏在其自己的模型上更容易,因为它们有完整的访问权限,但你仍然可以通过API以一种略微笨拙的方式进行蒸馏,甚至如果你有创意的话,可以通过聊天客户端进行蒸馏。


Distillation seems terrible for leading edge models.

蒸馏对前沿模型似乎很可怕。


It is! On the positive side, OpenAI and Anthropic and Google are almost certainly using distillation to optimize the models they use for inference for their consumer-facing apps; on the negative side, they are effectively bearing the entire cost of training the leading edge, while everyone else is free-riding on their investment.

这确实很可怕!积极方面,OpenAI、Anthropic和谷歌几乎肯定在使用蒸馏来优化它们为消费者应用提供推理服务的模型;消极方面,它们实际上承担了训练前沿模型的全部成本,而其他公司则免费搭车。


Is this why all of the Big Tech stock prices are down?

这是为什么所有大型科技公司的股票都在下跌吗?


In the long run, model commoditization and cheaper inference — which DeepSeek has also demonstrated — is great for Big Tech. A world where Microsoft gets to provide inference to its customers for a fraction of the cost means that Microsoft has to spend less on data centers and GPUs, or, just as likely, sees dramatically higher usage given that inference is so much cheaper. Another big winner is Amazon: AWS has by-and-large failed to make their own quality model, but that doesn’t matter if there are very high quality open source models that they can serve at far lower costs than expected.

从长远来看,模型的大众化和更便宜的推理——DeepSeek也展示了这一点——对大型科技公司来说是很棒的。微软可以以一小部分成本为客户提供推理服务,这意味着微软无需在数据中心和GPU上花费那么多,或者,同样可能,由于推理成本大幅降低,微软会看到显著更高的使用率。另一家大赢家是亚马逊:AWS基本上未能制作出自己的高质量模型,但如果有非常高质量的开源模型可以以远低于预期的成本提供服务,这并不是问题。


I asked why the stock prices are down; you just painted a positive picture!

我问为什么股价下跌,你却画了一幅积极的图景!


My picture is of the long run; today is the short run, and it seems likely the market is working through the shock of R1’s existence.

我的图景是长期的;而今天是短期,市场似乎正在经历 R1 存在的冲击。


Wait, you haven’t even talked about R1 yet.

等等,你还没有谈到 R1 呢。


R1 is a reasoning model like OpenAI’s o1. It has the ability to think through a problem, producing much higher quality results, particularly in areas like coding, math, and logic (but I repeat myself).

R1 是类似于OpenAI的 o1 的推理模型。它能够思考问题,产生质量更高的结果,特别是在编码、数学和逻辑等领域(我在重复自己)。


Is this more impressive than V3?

这比 V3 更令人印象深刻吗?


Actually, the reason why I spent so much time on V3 is that that was the model that actually demonstrated a lot of the dynamics that seem to be generating so much surprise and controversy. R1 is notable, however, because o1 stood alone as the only reasoning model on the market, and the clearest sign that OpenAI was the market leader.

实际上,我花了很多时间讨论 V3,因为那是真正展示了许多引发如此多惊讶和争议的动态的模型。然而,R1 也值得注意,因为 o1 是市场上唯一的推理模型,也是OpenAI是市场领导者的最明确迹象。


How did DeepSeek make R1?

DeepSeek是如何制作 R1 的?


DeepSeek actually made two models: R1 and R1-Zero. I actually think that R1-Zero is the bigger deal; as I noted above, it was my biggest focus in last Tuesday’s Update:

DeepSeek实际上制作了两个模型:R1 和 R1-Zero。我实际上认为 R1-Zero 更重要;如上所述,它是我在上周二的更新中关注的最重要的问题。


So are we close to AGI?

我们离AGI近了吗?


It definitely seems like it. This also explains why Softbank (and whatever investors Masayoshi Son brings together) would provide the funding for OpenAI that Microsoft will not: the belief that we are reaching a takeoff point where there will in fact be real returns towards being first.

看起来确实如此。这也解释了为什么软银(以及孙正义带来的投资者)会为微软不会为OpenAI提供资金:他们相信我们正在接近一个起飞点,真正的回报将向第一方流动。


But isn’t R1 now in the lead?

但是 R1 不是现在领先吗?


I don’t think so; this has been overstated. R1 is competitive with o1, although there do seem to be some holes in its capability that point towards some amount of distillation from o1-Pro. OpenAI, meanwhile, has demonstrated o3, a far more powerful reasoning model. DeepSeek is absolutely the leader in efficiency, but that is different than being the leader overall.

我不这么认为;这一点被夸大了。R1 与 o1 具有竞争力,尽管它的能力中似乎存在一些漏洞,这可能意味着它从 o1-Pro 进行了一定程度的蒸馏。与此同时,OpenAI展示了 o3,这是一款更强大的推理模型。DeepSeek绝对是效率方面的领导者,但这与整体领导地位不同。


So why is everyone freaking out?

那么,为什么大家都在惊慌失措?


I think there are multiple factors. First, there is the shock that China has caught up to the leading U.S. labs, despite the widespread assumption that China isn’t as good at software as the U.S.. This is probably the biggest thing I missed in my surprise over the reaction. The reality is that China has an extremely proficient software industry generally, and a very good track record in AI model building specifically.

我认为有几个因素。首先,是中国在软件方面赶上美国领先实验室的震撼,尽管人们普遍认为中国在软件方面不如美国。这可能是我在对反应感到惊讶时忽略的最重要的事情。现实是,中国的软件产业总体上非常高效,并且在AI模型构建方面有着非常好的成绩。


I own Nvidia! Am I screwed?

我拥有Nvidia股票!我搞定了吗?


There are real challenges this news presents to the Nvidia story. Nvidia has two big moats:

  • • CUDA is the language of choice for anyone programming these models, and CUDA only works on Nvidia chips.
  • • Nvidia has a massive lead in terms of its ability to combine multiple chips together into one large virtual GPU.

这些新闻对Nvidia的故事带来了真正的挑战。Nvidia有两个巨大的护城河:

  • • CUDA是任何编程这些模型者的语言,CUDA仅在Nvidia芯片上运行。
  • • Nvida在将多个芯片组合成一个大型虚拟GPU的能力方面拥有巨大的领先优势。

So what about the chip ban?

那么,关于芯片禁令呢?


The easiest argument to make is that the importance of the chip ban has only been accentuated given the U.S.’s rapidly evaporating lead in software. Software and knowhow can’t be embargoed — we’ve had these debates and realizations before — but chips are physical objects and the U.S. is justified in keeping them away from China.

最简单的论点是,鉴于美国在软件方面迅速消失的领先地位,芯片禁令的重要性只会越来越凸显。软件和技术知道办法无法被禁运——我们已经有了这些辩论和认识——但芯片是实物,美国有理由将其与中国隔绝。


So you’re not worried about AI doom scenarios?

所以,你不担心AI末日情景吗?


I definitely understand the concern, and just noted above that we are reaching the stage where AIs are training AIs and learning reasoning on their own. I recognize, though, that there is no stopping this train. More than that, this is exactly why openness is so important: we need more AIs in the world, not an unaccountable board ruling all of us.

我当然理解这种担忧,并注意到我们已经接近AI训练AI、自主学习推理的阶段。我认识到,然而,无法阻止这列列车。更重要的是,这就是为什么开放如此重要的原因:我们需要更多的AI在世界上,而不是一个不受制约的委员会统治我们所有人。


Wait, why is China open-sourcing their model?

等等,为什么中国公开了他们的模型?


Well DeepSeek is, to be clear; CEO Liang Wenfeng said in a must-read interview that open source is key to attracting talent:

当然,DeepSeek公开了他们的模型;CEO梁文锋在一篇必读的采访中表示,开源是吸引人才的关键:

"In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat."


"面对破坏性技术,闭源创造的护城河是暂时的。即使OpenAI的闭源方法也无法阻止他人赶上。因此,我们将我们的价值锚定在团队上——我们的同事通过这个过程成长,积累知道如何,并形成一个能够创新的组织和文化。这就是我们的护城河。”


So is OpenAI screwed?

所以,OpenAI搞定了吗?


Not necessarily. ChatGPT made OpenAI the accidental consumer tech company, which is to say a product company; there is a route to building a sustainable consumer business on commoditizable models through some combination of subscriptions and advertisements. And, of course, there is the bet on winning the race to AI take-off.

不一定。ChatGPT使OpenAI成为意外的消费技术公司,也就是说,一家产品公司;通过订阅和广告的某种组合,可以在可以大规模生产的模型上构建可持续的消费业务。当然,还有赢得AI起飞竞赛的赌注。


So this is all pretty depressing, then?

那么,这一切相当令人沮丧吗?


Actually, no. I think that DeepSeek has provided a massive gift to nearly everyone. The biggest winners are consumers and businesses who can anticipate a future of effectively-free AI products and services. Jevons Paradox will rule the day in the long run, and everyone who uses AI will be the biggest winners.

实际上,不是的。我认为DeepSeek为几乎每个人提供了巨大的礼物。最大的赢家是消费者和企业,他们可以预期一个基本上免费的AI产品和服务的未来。杰文斯悖论将在长期内主导,所有使用AI的人都将是最大的赢家。


Another set of winners are the big consumer tech companies. A world of free AI is a world where product and distribution matters most, and those companies already won that game; The End of the Beginning was right.

另一群赢家是大型消费技术公司。一个免费AI的世界是一个产品和分销最重要的世界,而这些公司已经赢得了那场比赛;《开始的结束》是对的。


China is also a big winner, in ways that I suspect will only become apparent over time. Not only does the country have access to DeepSeek, but I suspect that DeepSeek’s relative success to America’s leading AI labs will result in a further unleashing of Chinese innovation as they realize they can compete.

中国也是一个大赢家,随着时间的推移,这将以我们尚未预料到的方式体现。不仅中国可以使用DeepSeek,我还怀疑DeepSeek相对于美国领先的AI实验室的相对成功将导致中国创新进一步释放,因为他们意识到可以与之竞争。


That leaves America, and a choice we have to make. We could, for very logical reasons, double down on defensive measures, like massively expanding the chip ban and imposing a permission-based regulatory regime on chips and semiconductor equipment that mirrors the E.U.’s approach to tech; alternatively, we could realize that we have real competition, and actually give ourself permission to compete. Stop wringing our hands, stop campaigning for regulations — indeed, go the other way, and cut out all of the cruft in our companies that has nothing to do with winning. If we choose to compete we can still win, and, if we do, we will have a Chinese company to thank.

这就留给了美国,以及我们必须做出的选择。出于非常合理的原因,我们可以加倍防御措施,例如大规模扩展芯片禁令,并对芯片和半导体设备实施基于许可的监管制度,效仿欧盟对科技的做法;或者,我们可以意识到我们面对真实的竞争,并真正允许自己去竞争。停止手足无措,停止为法规进行游说——实际上,另一个方向,并剔除我们公司中无关紧要的所有无能为力。如果我们选择竞争,我们仍然可以获胜,如果我们这样做,我们将会感谢一家中国公司。


[The End]

 


奇怪的草稿本
一个草稿本,记录奇怪的念头和知识。
 最新文章