芯片级液冷技术- SMT行业之家-优秀的源头厂家交易网

芯片级液冷技术

2024年07月11日 09:05 浏览：147 来源：小萍子

芯片级精密冷却

Precision cooling at the chip level

January 30, 2024

By Sebastian Moss

随着ADAS功能的不断丰富，涉及到的算法较多，数据处理耗费较多的计算资源，产胜大量热量，自然冷却已经不能满足这么高运算的散热要求，因此有的厂家会采用液冷方式的方式进行散热，例如：特斯拉Model Y的自动驾驶控制器autopilot HW 3.0。

模组通过三明治的方式把两块独立的主板集成到一起，再通过液冷共同散热：

AI 的能源消耗不仅包括用于为服务器供电的电力，还包括冷却数据中心

近期，一家名为 ZutaCore 的公司展示业界首款用于 NVIDIA GPU 的介电直接芯片液冷冷板。这是一种无水、直接到芯片、两相液体冷却系统，专为 AI 和高性能计算工作负载而设计。该公司已与英特尔、戴尔和威图等众多供应商合作，另有多家服务器制造商正在与 ZutaCore 合作，以完成英伟达 GPU 平台的认证和测试。

图 | 介电直接到芯片的液冷冷板（来源：ZutaCore 官网）

尽管传统的基于空气的冷却方法逐渐被淘汰，液态冷却技术为数据中心提供了新的可能性。但基于水的冷却方案会消耗大量的水资源，也面临着提高能效和降低环境影响的挑战。

ZutaCore 公司的“HyperCool”冷却解决方案不依赖于水作为冷却介质，使用的是一种特殊的介电液体。这种冷却方式直接将冷却液体接触到需要冷却的芯片上，与传统的空气冷却或间接液体冷却相比，可以更有效地吸收和移除热量。HyperCool 技术还能够回收和重新利用数据中心产生的热量，实现 100% 的热量回用。

下图展示了 HyperCool 系统的运作方式，以及如何将热能回收利用于学校、办公室和家庭中。

图 | 直接到芯片的无水介电液体冷却原理（来源：ZutaCore官网）

其中的 HyperCool Dielectric Cold Plate 是系统的核心部分，直接安装在需要冷却的芯片上。使用无水的介电液体，这种液体具有很好的散热性能且不导电，并具有极低的全球变暖潜值（GWP）和臭氧消耗潜值（ODP）。

当介电液体吸收了芯片产生的热量后，会变成热蒸汽。HyperCool Heat Rejection Unit 负责将吸收的热量从热蒸汽中排出。这个过程中介电液体会冷却下来并转换成液态，循环返回到冷板中继续吸收热量。

从热排单元中排出的热量可以通过设施水系统进行回收。回收的热量可以用于加热办公室和家庭或用于给学校的暖气系统提供热能，实现 100% 的可持续性。

这种直接对芯片的冷却解决方案更为高效，使用的能源和空间不到传统系统的一半。整个系统的设计旨在有效地将数据中心的废热回收利用，减少能源浪费，同时也减轻了对环境的影响。

通过采用这种高效的冷却技术，数据中心可以显著减少运营成本，特别是在冷却系统的维护和能源消耗方面，从而使总拥有成本降低 50%。

传统冷却技术可能因温度升高而导致性能下降或需要进行热管理从而限制性能。HyperCool 技术由于提供的冷却效率更高，数据中心可以安装更多的服务器和处理器，从而支持更高的工作负载而不会过热。通过有效控制温度，处理器能够以接近其设计上限的性能长时间运行，从而提高整体的计算输出。

这样不仅避免了水资源的消耗和潜在的泄漏风险，数据中心的计算性能也有望提升到原来的 10 倍。

值得一提的是，HyperCool 系统能够让运营商在几乎不改变现有基础设施的情况下进行升级，提高处理能力的同时也减少能源和空间使用。这有利于经常需要迅速扩展其计算能力的云服务提供商和大型企业。

另外，当前每个英伟达 H100 GPU 的功耗高达 700 W，这对于已经在控制热量、能耗和空间方面承压的数据中心来说是一个不小的挑战。据了解，HyperCool 可以将冷却能耗降低 80%，支持超过 1500W 的 GPU，同时将机架密度提高 300%。

JetCool's CEO on how thousands of tiny jets could be coming to a data center near you

JetCool 的首席执行官谈数以千计的小型喷气式飞机如何进入您附近的数据中心

As chip temperatures and rack densities rise, a plethora of companies have come forward to pitch their vision of the future.

随着芯片温度和机架密度的提高，许多公司纷纷站出来推销他们对未来的愿景。

Cooling demands of artificial intelligence and other high-density workloads have outstripped the capabilities of air systems, requiring some form of liquid cooling.

人工智能和其他高密度工作负载的冷却需求已经超过了空气系统的能力，需要某种形式的液体冷却。

“When you think about the landscape of liquid cooling, we see three different technical categories,” JetCool CEO Bernie Malouin explained.

“当你考虑液体冷却的前景时，我们会看到三个不同的技术类别，”JetCool 首席执行官 Bernie Malouin 解释道。

“There’s single phase immersion, dipping in the oil. And that's interesting, but there are some limitations on chip power - for a long time, they've been stuck at 400W. There are some that are trying to get that a little bit better, but not as much as is needed.”

“有单相浸泡，浸入油中。这很有趣，但芯片功率存在一些限制——很长一段时间以来，它们一直停留在 400W。有些人正试图让它变得更好一点，但没有达到需要的程度。

The second category is two-phase dielectrics: “We see those handling the higher [thermal design point (TDP)] processors, so those can get to 900-1,000W. Those are fit technologically for the future of compute, but they’re held back by the chemicals.”

第二类是两相电介质：“我们看到那些处理更高[热设计点（TDP）]处理器的人，因此这些处理器可以达到900-1,000W。这些在技术上适合未来的计算，但它们受到化学物质的阻碍。

Many two-phase solutions use perfluoroalkyl substances (PFAS), otherwise known as forever chemicals, which are linked to human health risks, and face restrictions in the US and Europe. Companies like ZutaCore have pledged to shift to other solutions by 2026, but the move has proved slow.

许多两相解决方案使用全氟烷基物质（PFAS），也称为永久化学品，这些物质与人类健康风险有关，在美国和欧洲面临限制。像ZutaCore这样的公司已经承诺到2026年转向其他解决方案，但事实证明这一举动进展缓慢。

“It’s a concern for a lot of our customers, they're coming to us instead because they're worried about the safety of those fluids,” Malouin said. “They're concerned about the continued availability of those fluids.

“这是我们很多客户关心的问题，他们来找我们是因为他们担心这些液体的安全性，”Malouin说。“他们担心这些液体的持续可用性。

And then there’s the third category: Direct Liquid Cooling (DLC) cold plates. “We’re one of them,” Malouin said. “There are others.”

然后是第三类：直接液体冷却（DLC）冷板。“我们是其中之一，”马鲁因说。“还有其他人。”

DLC cold plates are one of the oldest forms of IT liquid cooling - simply shuttling cold liquid to metal plates mounted directly on the hottest components. They have long been used by the high-performance computing community, but JetCool believes that the concept is due for a refresh.

DLC 冷板是最古老的 IT 液体冷却形式之一 - 只需将冷液体穿梭到直接安装在最热组件上的金属板上即可。它们长期以来一直被高性能计算社区使用，但JetCool认为这个概念应该更新了。

Instead of passing fluid over a surface, its cooling jets route fluid directly at the surface of a chip. “We have these arrays of a thousand tiny fluid jets, and we work directly with the major chipmakers - the Intels, AMDs, Nvidias - and we intelligently landscape these jets to align to where the heat sources are on a given processor.”

其冷却射流不是将流体通过表面，而是将流体直接输送到切屑表面。“我们有一千个微小的流体射流阵列，我们直接与主要的芯片制造商合作 - 英特尔，AMD，英伟达 - 我们智能地景观这些射流，以对齐给定处理器上的热源位置。

Rather than treating the entire chip as a whole with a singular cooling requirement, the microconvective cooling approach “tries to balance the disparate heat loads, disparate thermal requirements of certain parts of that chip stack,” Malouin said.

Malouin说，微对流冷却方法不是将整个芯片视为一个整体，而是“试图平衡不同的热负荷，芯片堆栈某些部分的不同热需求”。

“When you start thinking about really integrated packages, the cores themselves might be able to run a little higher temperature, but then you might have high bandwidth memory (HBM) sections that aren't as power hungry, but have a lower temperature limit.”

“当你开始考虑真正集成的封装时，内核本身可能能够运行更高的温度，但你可能会有高带宽内存（HBM）部分，这些部分不那么耗电，但温度限制较低。

Instead of trying to design for the high-power cores and the temperature-sensitive HBM, each section can be cooled at a slightly different rate. “This allows you to decouple those things and allows you to have precision cooling where you need,” Malouin said.

与其尝试为高功率内核和温度敏感型 HBM 进行设计，不如以略微不同的速率冷却每个部分。“这使您可以将这些东西解耦，并允许您在需要的地方进行精确冷却，”Malouin说。

While Malouin believes that facility-level liquid cooling is the future of data centers, the company also has a self-contained system for those looking to dip their toe in cooler waters, with a Dell partnership focused on dual socket deployments.

虽然Malouin认为设施级液体冷却是数据中心的未来，但该公司也为那些希望将脚趾浸入较冷水域的人提供了一个独立的系统，戴尔与戴尔的合作伙伴关系专注于双插槽部署。

Two small pumping modules provide the fluid circulation and an air heat exchanger ejects heat at the other end of the Smart Plate system.

两个小型泵送模块提供流体循环，空气热交换器在智能板系统的另一端喷射热量。

"When we add these pumps, you add some electrical draw, but you don't need the fans to be running nearly as hard, so it makes it 15-20 decibels quieter - and in net, we pull out about 100W per server after we've taken the penalty off of the pumps," Malouin claimed.

“当我们添加这些泵时，你会增加一些电力消耗，但你不需要风扇几乎如此努力地运行，所以它使它安静了 15-20 分贝 - 在净值中，我们在每个服务器上拉出大约 100W 在我们取消了泵的惩罚后，”Malouin 声称。

When you go to 10 racks or more, going to the facility level makes more sense, he said. Asked about the preferred inlet temperature, Malouin said the system was flexible but added, "we actually really like the warm fluids."

他说，当你去 10 个或更多机架时，去设施级别更有意义。当被问及首选的入口温度时，Malouin表示该系统很灵活，但补充说，“我们实际上真的很喜欢温暖的流体。

He said: "We have facilities today that are feeding us inlet cooling temperatures that are 60°C (140°F) and over. And we're still cooling those devices under full load." That's not common just yet, but Malouin believes that warmer waters will grow in popularity in places like Europe due to the heat reuse potential.

他说：“我们今天的设施为我们提供60°C（140°F）及以上的进气冷却温度。而且我们仍在满负荷下冷却这些设备。这还不常见，但Malouin认为，由于热再利用的潜力，温暖的水域在欧洲等地将越来越受欢迎。

Back in the US, the company is part of the Department of Energy's COOLERCHIPS project, aimed at dramatically advancing data center cooling systems.

回到美国，该公司是能源部COOLERCHIPS项目的一部分，该项目旨在大幅推进数据中心冷却系统。

The focus of JetCool's $1m+ award is not just on the cooling potential, but a tantalizing secondary benefit: "We have instances where we've made the silicon intrinsically between eight and 10 percent more electrically efficient," Malouin claimed.

JetCool 的 $1m+ 奖励的重点不仅在于冷却潜力，还在于诱人的次要好处：“在某些情况下，我们使硅的电气效率提高了 8% 到 10%，”Malouin 声称。

"That has nothing to do with the cooling system power usage, but with leakage."

“这与冷却系统的用电量无关，而是与泄漏有关。”

Malouin doesn't mean leakage of the cooling system, but rather the quantum phenomenon of semiconductor leakage currents that can significantly impact a chip's performance.

Malouin并不意味着冷却系统的泄漏，而是半导体泄漏电流的量子现象，这种现象会显着影响芯片的性能。

The recent history of data center cooling has tended to assume that allowing temperatures to rise higher will save energy because less is used in cooling.

数据中心冷却的最新历史倾向于假设允许温度升高将节省能源，因为冷却中使用的更少。

Results, including research by Jon Summers at the Swedish research institute RISE, are finding that leakage currents in the silicon limit the benefits of running hotter.

包括瑞典研究机构RISE的乔恩·萨默斯（Jon Summers）的研究结果发现，硅中的漏电流限制了运行温度的好处。

"A big part of our COOLERCHIPS endeavor is to substantiate that through more rigorous scientific evidence and extrapolate it to different environments to see where it holds or where it doesn't go."

“我们COOLERCHIPS努力的很大一部分是通过更严格的科学证据来证实这一点，并将其外推到不同的环境中，看看它在哪里成立或不去哪里。

Looking even further ahead, Malouin sees an opportunity to get deeper into the silicon. “In some cases, it might actually be integrated as an embedded layer within the silicon, and then coupling that to a system that's outside that's doing some heat reuse. When we think about that holistically, we think that there's a real opportunity for a step change in data center efficiency.”

展望更远的未来，Malouin看到了更深入地研究硅的机会。“在某些情况下，它实际上可能被集成为硅片中的嵌入层，然后将其耦合到外部的系统，该系统正在进行一些热再利用。当我们从整体上考虑这个问题时，我们认为数据中心效率有一个真正的机会。

For now, the company says that it is able to support the 900W loads of the biggest Nvidia GPUs and is currently cooling undisclosed ‘bespoke’ chips that use 1,500W.

目前，该公司表示，它能够支持最大的Nvidia GPU的900W负载，目前正在冷却使用1,500W的未公开的“定制”芯片。

“Ultimately, you're really going to have to look at liquid cooling if you want to run not just the future of generative AI, but if you want to run the now of generative AI.”

“归根结底，如果你不仅想运行生成式人工智能的未来，而且想运行生成式人工智能的现在，你真的必须考虑液体冷却。”