编者荐语:
QSimulate 也将QUELO-G应用到NVIDIA提供的最新GPU硬件中,包括NVIDIA A100及 H100。这些深度优化的软件算法利用了NVIDIA 最近才在CUDA图中引入的条件节点。
Quantum Mechanics-Enhanced Drug Discovery Using QUELO-G and CUDA Graphs
In drug discovery, approaches based on the so-called classical force field have been routinely used and considered useful. However, it is also widely recognized that some important physics are missing in the force field models, resulting in limited applicability.
For example, force field models do not provide accurate predictions when comparing two molecules with different formal charges because protein polarization is not taken into account in the model. Furthermore, these models do not apply to covalent drug molecules because of the lack of ability to form and break chemical bonds.
Many have agreed that simulation based on quantum mechanics, the fundamental physics that governs the microscopic world, is the solution to these problems—except that quantum mechanics simulations were thought to be too costly and time-consuming to be practical.
QSimulate recently announced the launch of QUELO-G, which enables quantum mechanics-based free energy perturbation (FEP) simulation with unprecedented throughput. With 100+ nanoseconds per day per GPU card throughput, these simulations are performed within a few hours. Directly applying quantum mechanics-based simulation to identify new drug molecules will transform computer-aided drug discovery.
This post explains how QSimulate is addressing this challenge by developing innovative algorithms and implementing them into software that takes advantage of the latest GPU hardware offered by NVIDIA, such as NVIDIA A100 and NVIDIA H100. The tightly optimized software utilizes some of the features recently made available in the CUDA Toolkit, including conditional graph nodes in CUDA Graphs (introduced in version 12.3 and enhanced in version 12.4).
Throughput challenge for quantum mechanics
One of the important aspects of simulation for drug discovery is that each energy and force evaluation has to be performed on the order of a few milliseconds. This requirement originates from the fact that, because proteins are flexible and change their shape over time, it is necessary to sample a thermodynamic ensemble by means of molecular dynamics (MD) simulation over multiple nanoseconds to accurately predict the properties. One example is the binding affinity of drug molecules to a protein target.
If the goal is to achieve a 100 nanosecond-per-day throughput of dynamics (with the standard time step of 2 femtoseconds), each time step must be completed in less than 2 milliseconds of wall time. Classical force-field simulations have been demonstrated to achieve this throughput. See, for example, relevant posts about GROMACS and NAMD.
Conventionally, it was believed that it is difficult, if not impossible, to accelerate quantum mechanics simulations to the millisecond-per-time-step regime, even with modern NVIDIA GPU hardware. This is partly why most of the existing implementations of quantum mechanics simulations on GPUs are focused on the acceleration of large calculations (such as coupled cluster calculations and large-scale density functional theory calculations), typically from hours to minutes or from days to hours. To learn more, see GPU-Accelerated Quantum Chemistry and Molecular Dynamics.
While useful, these are not directly applicable to free energy calculations for drug discovery as the throughput is off by several orders of magnitudes. Technically speaking, the throughput challenge for quantum mechanics originates from the fact that quantum mechanics is, in essence, an optimization problem with a large number of parameters that must be solved iteratively.
This inevitably introduces complex control logic and limits the degree of concurrency. When naively implemented, the complex logic for quantum mechanics results in frequent device-to-host communication, which lowers utilization of the GPU.
QUELO-G has overcome this challenge using CUDA Graphs in conjunction with a tight-binding quantum mechanics approach (GFN-xTB) and the hybrid quantum mechanics/molecular mechanics (QM/MM) scheme.
Technical solutions using CUDA Graphs and conditional nodes
Consider the Krylov subspace algorithm as an example of iterative quantum mechanics algorithms. This approach constructs a subspace consisting of trial vectors, in which a new trial vector is generated from the best solution from the previous iteration. The process is iterated until convergence criteria are met. Figure 1 shows the schematic representation of the algorithm.
Figure 1. A standard approach to implementing iterative solvers on GPU with a control code on CPU (left) compared to an approach based on CUDA Graphs using the conditional graph node feature (right)
The standard way to implement this algorithm is to make the host responsible for the loop and condition branch while offloading all other parts to the device. The trial vectors (and hence the subspace) can be kept in the device memory. However, the residual error, which is a scalar value, needs to be communicated from the device to the host in each iteration for the control logic performed on the host.
The latency of device-to-host communication is far from negligible, even for a scalar variable. As this happens repeatedly owing to the structure of the underlying algorithm, it becomes significant for the overall performance of quantum mechanics-based dynamics simulation while pushing forward to achieve a millisecond-per-time-step throughput.
The software implementation in QUELO-G uses conditional graph nodes, recently introduced in CUDA Graphs (since CUDA 12.3). Using this feature, the entire iterative procedure in quantum mechanics algorithms can be mapped to CUDA Graphs, in which the loop and conditional branch are performed in the device (Figure 2). This eliminates the need to communicate a scalar from the device to the host in each iteration. This approach not only enables significant performance increases, but also simplifies the code structure thanks to better abstraction.
In addition, by abstracting the code using CUDA Graphs, the software will automatically benefit from further development and optimization of the CUDA Graphs runtime for existing NVIDIA A100 and NVIDIA H100 GPUs, and for future NVIDIA GPUs.
Quantum mechanics-based free energy perturbation simulation performance
Using the implementation strategy above, the QM/MM dynamics throughput of more than 100 nanoseconds per day is accomplished. Because QSimulate’s production platform is intended to minimize the cost per simulation for commercial reasons, Multi-Instance GPU (MIG) is typically used with which seven concurrent dynamics simulations are performed on a GPU. Figure 3 compiles the throughput measured on A100 and H100 GPUs for a protein-ligand system consisting of about 25 K classical atoms in a unit cell.
Already with A100, the QM/MM MD throughput for simulations with a quantum mechanics region consisting of 74 atoms, a typical size of small-molecule drugs, surpassed 100 nanoseconds per day. With one H100, the QM/MM MD and FEP throughputs were measured to be 120 and 90 nanoseconds, respectively. With the larger quantum mechanics region consisting of 200 atoms, a greater speedup of more than 50% was observed between A100 and H100. Note that these timings can be further improved by future innovation and software optimization.
These results are compared against an optimized CPU implementation previously developed by the QSimulate team in collaboration with the Sugita group at RIKEN. Figure 4 shows the results. Four columns in Figure 4 correspond to four quantum mechanics regions in Figure 3. The underlying algorithms in the CPU and GPU implementations are identical, including cutoff, thresholds, and parameters. Because the CPU code is optimized as tightly as the GPU implementation, Figure 4 represents a fair comparison between different hardware architectures.
The CPU timing benchmarks were measured with four CPU cores of Intel Xeon 8375C (Ice Lake), and the results were converted to those for one CPU core for simplicity. Note that QSimulate’s production platform that targets CPU hardware uses four CPU cores, as in this timing benchmark, mainly because it is difficult to scale the CPU simulation to a larger number of threads and CPU cores.
Even with the smallest quantum mechanics region consisting of 74 quantum mechanics atoms, the throughput ratio between those on a GPU card and on a CPU core was observed to be about 140 and 170 for A100 and H100 cards, respectively. As the quantum mechanics region becomes larger, the ratio becomes more significant. With the 200 quantum mechanics atoms, the throughput using an H100 card is equal to that using 265 CPU cores.
Simulation in the era of AI
As AI is transforming society, physics-based simulation is more important than ever. Digital transformation of complex workflows, like those in drug discovery, requires AI predictions to be quickly and reliably assessed by accurate simulation approaches, which then feed more data to the AI models. The advancement summarized in this post—the first-ever quantum mechanics-based simulation in the millisecond-per-time-step regime—shows how QSimulate’s domain knowledge and software engineering together with NVIDIA hardware and software, including the latest CUDA Toolkit and software stack, have helped make a leap toward the next paradigm of digital discovery.
To learn more and get started with GPU-accelerated quantum mechanics simulation for drug discovery, visit the QUELO-G product page.
—END—
香港X科技创业平台是由红杉中国创始及执行合伙人沈南鹏先生,香港科技大学李泽湘教授,及香港大学陈冠华教授,联合22位香港超级教授及科技精英于香港创办。香港X科技创业平台的愿景是激发香港青年科技创新创业,助力香港和大湾区创科生态圈繁荣,促进香港和粤港澳大湾区建设成为国际创新科技中心。
平台下属的香港X科技基金专注投资香港及粤港澳大湾区早期科技领域初创企业。