451 Research | OCP全球峰会:以GB200 NVL72架构为主导的盛宴

科技   2024-11-01 11:51   北京  

OCP全球峰会:以GB200 NVL72架构为主导的盛宴

OCP Global Summit: A large party dominated by GB200 NVL72 architecture


Analysts - Perkins Liu

Publication date: Tuesday, October 29 2024

前 言 Introduction

OCP全球峰会于10月15日至17日在加州圣何塞举行了2024年全球峰会。此次峰会连续三年创下新纪录,共有7047人参加,比2023年增加了60%,是迄今为止规模最大的一次。NVIDIA GB200 NVL72架构及其基础设施解决方案的统治地位验了AI进一步巩固了其作为OCP峰会上最大应用地位的事实。

The Open Compute Project Foundation held its 2024 Global Summit from October 15-17 in San Jose, California. The summit set new records for the third consecutive year, with 7,047 people attending the event, a 60% increase over 2023, marking the event as the biggest yet. Artificial intelligence further fortified its position as the biggest application on the Open Compute Project platform, demonstrated by the dominance of NVIDIA GB200 NVL72 architecture and infrastructure solutions around it.


观 点 / The Take


英伟达公司(NVIDIA Corp.)的GB200 NVL72架构成为2024年OCP全球峰会上的亮点,引起了参会者的极大关注。各路服务器供应商在其展台的中心突出展示了NVL72架构机架解决方案,而基础设施制造商则在展位上主打相关配套的产品。这种焦点的融合反映了行业内前所未有的合作模式,以应对AI进步推动的机架密度挑战,强调了对GB200 NVL72的高期望。随着人们对这种创新建筑的热情高涨,这种狂热的可持续性仍不确定——只有时间和市场才能证明这一点。

NVIDIA Corp.'s GB200 NVL72 architecture emerged as the standout highlight at the 2024 Open Compute Project (OCP) Global Summit, capturing significant attention from attendees. Server vendors prominently featured NVL72 architecture rack solutions at the center of their booths, while infrastructure manufacturers showcased complementary offerings throughout the expo floor. This convergence of focus reflects an unprecedented collaborative effort within the industry to address the increasing rack density challenge driven by AI advances, underscoring high expectations for the GB200 NVL72. As excitement builds around this innovative architecture, the sustainability of this frenzy remains uncertain — only time and the market will tell.



背 景 / Context

OCP于2011年由Facebook(现为Meta Platforms Inc.)发起,其使命是利用开源和开放协作来加速和促进硬件创新,从计算核心(包括服务器、存储和网络设备)到支持机架和整个数据中心基础设施。灵感来自Facebook在俄勒冈州普赖恩维尔的超大规模数据中心的设计和建设实践。2009年,一小群工程师聚集在那里,花了两年时间从头开始设计和建造数据中心包括:软件、服务器、机架、电源供应商和冷却系统。

The OCP Foundation was initiated in 2011 by Facebook (now Meta Platforms Inc.) with the mission to take advantage of open source and open collaboration to speed up and foster hardware innovation, starting from the core of computing including servers, storage and networking equipment to the supporting racks and the entire datacenter infrastructure. The inspiration came from Facebook's design and construction practice of its hyperscale datacenter in Prineville, Oregon, where a small group of engineers gathered in 2009 and spent the next two years designing and building the datacenter from the ground up: software, servers, racks, power suppliers and cooling.


与该公司之前的数据中心设施相比,数据中心的能源效率提高了38%,运营成本降低了24%。虽然始于超大规模数据中心,但OCP扩展了,将协作模式带到非超大规模和边缘数据中心,并进一步扩展到电信行业。

The datacenter was 38% more energy efficient to build and 24% less expensive to run than the company's previous facilities. Although starting from hyperscale datacenters, OCP expanded,bringing the collaboration model to non-hyperscale and edge datacenters and further to the telecom industry.


OCP拥有400多家会员公司,大约7000人积极参与其讨论,并向所有人开放。OCP市场列出了270多个产品和400多个批准的成员贡献,包括规范、设计和文档(最佳实践建议、参考架构等),在峰会开幕时大约有300个。

OCP has more than 400 member companies and roughly 7,000 people actively participating in its discussions, which are open to all. The OCP marketplace lists more than 270 products and more than 400 approved member contributions, including specifications, designs and documents (best practice recommendations, reference architectures, etc.), numbered at roughly 300 as the Summit opened.


此次峰会设有23个动态内容专场(19个项目专场和4个特别专场),有超过610位演讲者并超过425场会议。创新平台有8个与OCP项目相关的站点和7个新兴技术示范。OCP与合作伙伴共同举办了三个活动,包括开放式云网络软件(SONiC)研讨会、内存结构论坛和DMTF可管理性研讨会。这次活动吸引了121家赞助商,圣何塞会议中心的两个大厅里摆满了100个展位。

The Summit featured 23 dynamic content tracks (19 project-focused and four special focus), with over 610 speakers and more than 425 sessions. Eight OCP project-related stations and seven emerging technology demonstrations were at the Innovation Village. Three colocated events were held between OCP and partners, including the Software for Open Networking in the Cloud (SONiC) Workshop, Memory Fabric Forum and DMTF Manageability Workshop. The event attracted 121 sponsors, and two full halls at the San Jose Convention Center were filled with 100 booths.


OCP本身有15名全职员工,比一年前增加了3名,超过250名志愿者在OCP许多项目中担任领导角色,包括数据中心环境中的服务器、网络、存储、机架和冷却,以及区域社区。

The foundation itself has 15 full-time staff, three more than one year ago, with more than 250 volunteers taking leadership roles across the many OCP projects including server, networking, storage, rack and cooling in a datacenter environment and beyond, as well as regional communities.


重大公告 / Major announcements

像往常一样,在峰会上宣布了一些公告。然而,对OCP最重要的贡献莫过于英伟达提供了Blackwell加速计算平台设计中的基础元素。这包括在OCP全球峰会上分享NVIDIA GB200 NVL72系统机电设计的关键要素。包括机架架构、计算和开交换机箱、液体冷却、热环境和NVIDIA NVLink电缆盒体积的规格。这些技术细节旨在增强数据中心的计算密度和网络带宽。As usual, a few announcements were made at the Summit. However, none was more significant than the contribution of foundational elements of its Blackwell-accelerated computing platform design that NVIDIA made to the OCP. This includes sharing critical aspects of the NVIDIA GB200 NVL72 system's electro-mechanical design at the OCP Global Summit. The contributions encompass specifications for rack architecture, compute and switch tray mechanics, liquid cooling, thermal environments and NVIDIA NVLink cable cartridge volumetrics. These technical details aim to enhance compute density and networking bandwidth in datacenters.

该系统提供令人印象深刻的计算能力,拥有720千万亿次的训练和1.4千万亿次的推理任务。为了有效地管理高密度工作负载,GB200 NVL72采用了全液冷设计,冷却剂的入口温度为45°C(113°F),出口温度为65°C(149°F)。

This system is engineered to deliver impressive computational power, boasting 720 petaflops for training and 1.4 exaflops for inference tasks. To efficiently manage high-density workloads, the GB200 NVL72 employs a fully liquid-cooled design, using coolant temperatures ranging from 45°C (113°F) for inlet to 65°C (149°F) for outlet.


此前,英伟达已经对不同时代的OCP做出了一些贡献,包括英伟达HGX H100基板设计规范。这些努力在为计算机制造商提供更广泛的选择,并促进AI技术的广泛采用。此外,英伟达还扩大了其Spectrum-X以太网网络平台与OCP开发规范的一致性。这种一致性使组织能够优化使用OCP识别设备的AI基础设施性能,确保在保持软件一致性的同时保留现有投资。

Previously, NVIDIA has made several contributions to the OCP across various hardware generations, including the NVIDIA HGX H100 baseboard design specification. These efforts are intended to provide a wider array of options for computer manufacturers and facilitate the broader adoption of AI technologies. In addition, NVIDIA has expanded the alignment of its Spectrum-X Ethernet networking platform with OCP-developed specifications. This alignment allows organizations to optimize the performance of AI infrastructures that use OCP-recognized equipment, ensuring that existing investments are preserved while maintaining software consistency.



此外,下一代英伟达 ConnectX-8 SuperNIC是Spectrum-X平台的一部分,支持OCP社区的交换机抽象接口和SONiC标准。这使自适应路由和基于遥测的堵塞控制成为可能,增强了大规模AI基础设施的以太网性能。ConnectX-8超级网卡,能够提供高达800gb /s,将于明年上市,为组织开发高适应性的网络解决方案。

Furthermore, the next-generation NVIDIA ConnectX-8 SuperNIC, part of the Spectrum-X platform, supports OCP Communities' Switch Abstraction Interface and SONiC standards. This enables adaptive routing and telemetry-based congestion control, enhancing Ethernet performance for large-scale AI infrastructure. The ConnectX-8 SuperNICs, capable of delivering up to 800 Gb/s, will be available next year, equipping organizations to develop highly adaptable networking solutions.


Meta正在将其为AI工作负载设计的高性能机架Catalina提供给OCP。基于英伟达Blackwell平台,Catalina强调模块化和灵活性,同时支持英伟达GB200 Grace Blackwell超级芯片,以满足现代AI基础设施的需求。它的特点是开放机架V3 (Orv3),一个高功率机架(HPR),能够支持高达140 kW,满足GPU日益增长的功率需求。液冷解决方案包括电源架、计算托盘、开关托盘、Orv3 HPR、Wedge 400光纤交换机、管理交换机、电池备份单元和机架管理控制器。Catalina的模块化设计允许用户为特定的AI工作负载定制机架,同时遵守现有和新兴的行业标准。

Meta is in the process of contributing Catalina, its high-powered rack designed for AI workloads, to the OCP. Built on the NVIDIA Blackwell platform, Catalina emphasizes modularity and flexibility while supporting the NVIDIA GB200 Grace Blackwell Superchip to meet modern AI infrastructure demands. It features the Open Rack v3 (Orv3), a high-power rack (HPR) capable of supporting up to 140 kW, addressing the increasing power needs of GPUs. The liquid-cooled solution includes a power shelf, compute tray, switch tray, Orv3 HPR, Wedge 400 fabric switch, management switch, battery backup unit and rack management controller. Catalina's modular design allows users to customize the rack for specific AI workloads while adhering to both existing and emerging industry standards.


2022年,Meta推出了Grand Teton,这是继Zion-EX平台之后的下一代AI平台,用于处理内存带宽受限的工作负载需求。该公司已经扩展了大提顿平台,以支持AMD本能MI300X,并将此更新版本贡献给OCP。

In 2022, Meta introduced Grand Teton, a next-generation AI platform succeeding the Zion-EX platform, to handle the demands of memory-bandwidth-bound workloads. The company has expanded the Grand Teton platform to support the AMD Instinct MI300X and is also contributing this updated version to the OCP.


其他公告包括OCP与Ecma国际(致力于信息和通信系统开放式标准化的领先全球标准制定组织)之间的战略联盟,OCP与数据中心零净创新中心之间的战略联盟以及OCP Chiplet市场的开放,以建立开放式的芯片经济。

Other announcements include the strategic alliance between OCP and Ecma International, a leading global standards developing organization dedicated to the open standardization of information and communication systems, strategic alliance between OCP and Net Zero Innovation Hub for Data Centers and the opening of OCP Chiplet Marketplace in establishing an open chiplet economy.


生态系统协同 Collaboration of ecosystem

GB200 NVL72的机架密度为132 kW,需要直接对芯片进行液体冷却作为标准设计,突出了AI进步推动机架密度的快速增长。从2010年到2022年,数据中心机架密度从4kw上升到12kw,十年来稳步增长。虽然生态系统略有改善,但最近AI的激增导致机架密度急剧上升,预计在短短两年内将超过130千瓦。这种快速的变化让生态系统措手不及,凸显了从芯片到服务器、机架和整体数据中心基础设施的各级协作的必要性。显然,英伟达在这方面投入了大量精力。The GB200 NVL72 features a rack density of 132 kW and requires direct-to-chip liquid cooling as standard design, highlighting the rapid increase in rack density driven by AI advances. From 2010 to 2022, datacenter rack density rose from 4 kW to 12 kW, showing a steady increase over a decade. While the ecosystem responded with marginal improvements, the recent surge in AI has caused a dramatic spike in rack density, projected to exceed 130 kW in just two years. This rapid change has left the ecosystem unprepared, underscoring the need for collaboration across all levels — from chips to servers, racks and overall datacenter infrastructure. NVIDIA apparently has put quite a lot of effort into this.

在2023年的峰会上,服务器供应商普遍展示了能够处理70到100 kW密度的直接芯片冷却的液体冷却机架,以及他们自己的设计。然而,今年在同一展区,美国超微、华硕、Wiwynn、和硕、英业达、Ampere、Aivres、云达科技和技嘉科技等厂商在展台中央突出展示了GB200 NVL72解决方案,体现了强大的行业一致性。数据中心基础设施制造商,包括英维克、台达、威图电子、得捷电子、双鸿科技、CoolIT、JETCOOL和LiquidStack,也将其产品集中在GB200 NVL72架构上。维谛技术宣布了其GB200 NVL72平台的参考设计。

At 2023's summit, server vendors universally showcased liquid-cooling-ready racks capable of handling 70 kW to 100 kW densities with direct-to-chip cooling, with their own variations of design. This year, however, on the same floor, vendors like Super Micro Computer Inc., Asus, Wiwynn Corp., Pegatron Corp., Inventec Corp., Ampere, Aivres, QCT and GIGA-BYTE Technology Co., Ltd. prominently featured GB200 NVL72 solutions at the center of their booths, reflecting strong industry unity. Datacenter infrastructure manufacturers, including Shenzen Envicool Technology Co. Ltd., Delta Electronics Inc., Rittal, Lite-On Technology Corp., Auras Technology Co. Ltd., CoolIT, JETCOOL Technologies and LiquidStack, also centered their products around the GB200 NVL72 architecture. Vertiv announced its reference design for the GB200 NVL72 platform.


这种集体关注表明,市场对GB200 NVL72的预期很高,引发了市场的强烈兴奋。这种狂热还会持续多久?只有市场才能揭示结果。

This collective focus indicates high expectations for the GB200 NVL72, generating significant market excitement. How long will this frenzy remain? Only the market will reveal the outcome.


公众号声明:本文系由451Research授权DeepKnowledge并认可的中文版本,仅供读者学习参考,不得用于任何商业用途。

关注我们获取更多精彩内容


往期推荐

● 相约北京 直面AIDC变化与未来 | 2024数据中心标准大会报名正式开启

● 4年后将成东盟第一!CDCC专家解读马来西亚数据中心热潮

● 探秘全球最大GPU集群,20万GPU超算在路上

CDCC
数据中心标准、技术沟通交流平台
 最新文章