前些天搜到一篇文章,有点意思。大家感兴趣的话可以看看。原文是PDF格式,我调整了一下图片在文中的位置。如果想看原文,可以直接在网上搜。Huaicheng Li, Daniel S. Berger, Stanko
Novakovic, Lisa Hsu, Dan Ernst, Pantea Zardoshti, Monish Shah, Samir Rajadnya,
Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, Ricardo BianchiniPublic cloud providers
seek to meet stringent performance requirements and low hardware cost. A key
driver of performance and cost is main memory. Memory pooling promises to improve
DRAM utilization and thereby reduce costs. However, pooling is challenging
under cloud performance requirements. This paper proposes Pond, the first
memory pooling system that both meets cloud performance goals and significantly
reduces DRAM cost. Pond builds on the Compute Express Link (CXL) standard for
load/store access to pool memory and two key insights. First, our analysis of
cloud production traces shows that pooling across 8-16 sockets is enough to
achieve most of the benefits. This enables a small-pool design with low access
latency. Second, it is possible to create machine learning models that can
accurately predict how much local and pool memory to allocate to a virtual
machine (VM) to resemble same-NUMA-node memory performance. Our evaluation with
158 workloads shows that Pond reduces DRAM costs by 7% with performance within
1-5% of same-NUMA-node VM allocations.Motivation. Many public cloud
customers deploy their workloads in the form of virtual machines (VMs), for
which they get virtualized compute with performance approaching that of a
dedicated cloud, but without having to manage their own on-premises datacenter.
This creates a major challenge for public cloud providers: achieving excellent
performance for opaque VMs (i.e., providers do not know and should not
inspect what is running inside the VMs) at a competitive hardware cost.A key driver of both
performance and cost is main memory. The gold standard for memory performance
is for accesses to be served by the same NUMA node as the cores that issue
them, leading to latencies in tens of nanoseconds. A common approach is to
preallocate all VM memory on the same NUMA node as the VM’s cores. Preallocating
and statically pinning memory also facilitate the use of virtualization
accelerators [1–6], which are enabled by
default, for example, on AWS and Azure [7, 8]. At the same time, DRAM
has become a major portion of hardware cost due to its poor scaling properties
with only nascent alternatives [9–15]. For example, DRAM can be 50% of server
cost [16].Through analysis of
production traces from Azure, we identify memory stranding as a dominant source of
memory waste and a potential source of massive cost savings. Stranding happens
when all cores of a server are rented (i.e., allocated to customer
VMs) but unallocated memory capacity remains and cannot be rented. We find that
up to 25% of DRAM becomes stranded as more cores become allocated to VMs.Limitations of the state
of the art. Despite
this significant amount of stranding, reducing DRAM usage in the public cloud
is challenging due to its stringent performance requirements. For example,
existing techniques for process-level memory compression [17, 18] require page fault
handling, which adds microseconds of latency, and moving away from statically
preallocated memory.Pooling memory via memory
disaggregation is a promising approach because stranded memory can be returned to
the disaggregated pool and used by other servers. Unfortunately, existing
pooling systems also have microsecond access latencies and require page faults
[1, 19–24] or changes to the VM
guest [17, 21–23, 25–38].Our work. This work describes Pond, the
first system to achieve both same-NUMA-node memory performance and competitive
cost for public cloud platforms. To achieve this, Pond combines hardware and
systems techniques. It relies on the Compute Express Link (CXL) interconnect
standard [39], which enables cacheable
load/store (ld/st) accesses to pooled
memory on Intel, AMD, and ARM processors [40–42] at nanosecond-scale latencies.
CXL access via loads/stores is a game changer as it allows memory to remain
statically preallocated while physically being located in a shared pool. However,
even with loads/stores, CXL accesses still face higher latencies than
same-NUMA-node accesses. Pond introduces systems support for CXL-based pooling
that essentially eliminates the impact of this higher latency.Pond is feasible because
of four key insights. First, by analyzing traces from 100 production clusters
at Azure, we find that pool sizes between 8-16 sockets lead to sufficient DRAM
savings. The pool size defines the number of CPU sockets able to use pool
memory. Further, analysis of CXL topologies lead us to estimate that CXL will add
70-90ns to access latencies over same-NUMA-node DRAM with a pool size of 8-16
sockets, and add more than 180ns for rack-scale pooling. We conclude that
grouping 8 dual-socket (or 16 single-socket) servers is enough to achieve most of
the benefits of pooling.Second, by emulating
either 64ns or 140ns of memory access overheads, we find that 43% and 37% of
158 workloads are within 5% of the performance on same-NUMA-node DRAM when
entirely allocated in pool memory. However, more than 21% of workloads suffer a
performance loss above 25%. This emphasizes the need for small pools and shows
the challenge with achieving same-NUMA-node performance. This characterization
also allows us to train a machine learning (ML) model that can identify a subset
of insensitive workloads ahead of time to be allocated on the Pond memory pool.Third, we observe through
measurements at Azure that _50%
of all VMs touch less than 50% of their rented memory. Conceptually, allocating
untouched memory from the pool should not have any performance impact even for
latency-sensitive VMs. We find that — while this concept does not hold for the
uniform address spaces assumed in prior work [1, 19–24] — it does hold if we
expose pool memory to a VM’s guest OS as a zero-core virtual NUMA (zNUMA) node, i.e., a node with memory but
no cores, like Linux’s CPU-less NUMA [43]. Our experiments show zNUMA effectively
biases memory allocations away from the zNUMA node. Thus, a VM with a zNUMA
sized to match its untouched memory will indeed not see any performance impact.Fourth, Pond can allocate
CXL memory with same-NUMA-node performance using correct predictions of a) whether a VM will be
latency-sensitive and b) a VM’s amount of untouched memory. For
incorrect predictions, Pond introduces a novel monitoring system that detects
poor memory performance and triggers a mitigation that migrates the VM to use
only same-NUMA-node memory. Further, we find that all inputs to train and run
Pond’s ML models can be obtained from existing hardware telemetry with no
measurable overhead.Artifacts. CXL is still a year from
broad deployment. Meanwhile, deploying Pond requires extensive testing within
Azure’s system software and distributed software stack. We implement Pond on
top of an emulation layer that is deployed on production servers. This allows
us to prove the key concepts behind Pond by exercising the VM allocation
workflow, zNUMA, and by measuring guest performance. Additionally, we support
the four insights from above by reporting from extensive experiments and
measurements in Azure’s datacenters. We evaluate the effectiveness of pooling
using simulations based on VM traces from 100 production clusters.Contributions. Our main contributions
are:• The first public
characterization of memory stranding and untouched memory at a large public
cloud provider.• The first analysis of
the effectiveness and latency of different CXL memory pool sizes.• Pond, the first
CXL-based full-stack memory pool that is practical and performant for cloud deployment.• An accurate prediction
model for latency and resource management at datacenter scale. These models
enable a configurable performance slowdown of 1-5%.• An extensive evaluation
that validates Pond’s design including the performance of zNUMA and our
prediction models in a production setting. Our analysis shows that we can
reduce DRAM needs by 7% with a Pond pool spanning 16 sockets, which corresponds
to hundreds of millions of dollars for a large cloud provider.Hypervisor memory
management. Public
cloud workloads are virtualized [44]. To maximize performance and minimize
overheads, hypervisors perform minimal memory management and rely on
virtualization accelerators to improve I/O performance [1, 45–47]. Examples of common
accelerators are direct I/O device assignment (DDA) [1, 45] and Single Root I/O
Virtualization (SRIOV) [46, 47]. Accelerated networking is enabled by
default on AWS and Azure [7, 8]. As pointed out in prior work,
virtualization acceleration requires statically preallocating (or “pinning”) a
VM’s entire address space [1–6].Memory stranding. Cloud VMs demand a vector
of resources (e.g., CPUs, memory, etc.) [48–51]. Scheduling VMs thus
leads to a multi-dimensional bin-packing problem [49, 52–54] which is complicated by
constraints such as spreading VMs across multiple failure domains.
Consequently, it is difficult to provision servers that closely match the
resource demands of the incoming VM mix. When the DRAM-to-core ratio of VM
arrivals and the server resources do not match, tight packing becomes more
difficult. We define a resource as stranded when it is technically
available to be rented to a customer, but is practically unavailable as some
other resource has exhausted. The typical scenario for memory stranding is that all cores have
been rented, but there is still memory available in the server.Reducing stranding. Multiple techniques can re
cores [55, 56] enables more memory to
be rented. However, oversubscription only applies to a subset of VMs for
performance reasons. Our measurements at Azure (§3.1) include clusters that
enable oversubscription and still show significant memory stranding.The approach we target is
to disaggregate a portion of memory into a pool that is accessible by multiple hosts
[31, 57, 58]. This breaks the fixed
hardware configuration of servers. By dynamically reassigning memory to
different hosts at different times, we can shift memory resources to where they
are needed, instead of relying on each individual server to be configured for
all cases pessimistically. Thus, we can provision servers close to the average
DRAM-to-core ratios and tackle deviations via the memory pool.Pooling via CXL. CXL contains multiple
protocols including ld/st memory semantics (CXL.mem)
and I/O semantics (CXL.io). CXL.mem maps device memory to the system address
space. Last-level cache (LLC) misses to CXL memory addresses translate into
requests on a CXL port whose reponses bring the missing cachelines (Figure 1). Similarly, LLC
write-backs translate into CXL data writes. Neither action involves page faults
or DMAs. CXL memory is virtualized using hypervisor page tables and the
memory-management unit and is thus compatible with virtualization acceleration.
The CXL.io protocol facilitates device discovery and configuration. CXL 1.1 targets
directly-attached devices, 2.0 [59, 60] adds switchbased pooling, and 3.0 [61, 62] standardizes switch-less
pooling (§4) and higher bandwidth.CXL.mem uses PCIe’s electrical
interface with custom link and transaction layers for low latency. With PCIe
5.0, the bandwidth of a bidirectional _8-CXL
port at a typical 2:1 read:write-ratio matches a DDR5-4800 channel. CXL request
latencies are largely determined by the CXL port. Intel measures round-trip CXL
port traversals at 25ns [63] which, when combined with expected
controller-side latencies, leads to an end-to-end overhead of 70ns for CXL reads,
compared to NUMA-local DRAM reads. While FPGA-based prototypes report higher
latency [64, 65], Intel’s measurements
match industry-expectations for ASIC based memory controllers [62–64].3. Memory Stranding &
Workload Sensitivity to Memory LatencyThis section quantifies
the severity of memory stranding and untouched memory at Azure using production
data.Dataset. We measure stranding in
100 cloud clusters over a 75-day period. These clusters host mainstream first-party
and third-party VM workloads. They are representative of the majority of the
server fleet. We select clusters with similar deployment years, but spanning
all major regions on the planet. A trace from each cluster contains millions of
per-VM arrival/departure events, with the time, duration, resource demands, and
server-id.Memory stranding. Figure 2a shows the daily average amount
of stranded DRAM across clusters, bucketed by the percentage of scheduled CPU
cores. In clusters where 75% of CPU cores are scheduled for VMs, 6% of memory is
stranded. This grows to over 10% when _85%
of CPU cores are allocated to VMs. This makes sense since stranding is an
artifact of highly utilized nodes, which correlates with highly utilized
clusters. Outliers are shown by the error bars, representing 5thand
95thpercentiles.
At 95th, stranding reaches 25%
during high utilization periods. Individual outliers even reach 30% stranding.Figure 2b shows stranding over
time across 8 racks. A workload change (around day 36) suddenly increased stranding
significantly. Furthermore, stranding can affect many racks concurrently (e.g., racks 2, 4–7) and it is
generally hard to predict which clusters/racks will have stranded memory.NUMA spanning. Many VMs are small and can
fit on a single socket. On two-socket systems, the hypervisor at Azure seeks to
schedule such that VMs fit entirely (cores and memory) on a single NUMA node.
In rare cases, we see NUMA spanning where a VM has all of its
cores on one socket and a small amount of memory from another socket. We find
that spanning occurs for about 2-3% of VMs and fewer than 1% of memory pages,
on average.Savings from pooling. Azure currently does not
pool memory. However, by analyzing its VM-to-server traces, we can estimate the
amount of DRAM that could be saved via pooling. Figure 3 presents average
reductions from pooling DRAM when VMs are scheduled with a fixed percentage of
either 10%, 30%, or 50% of pool DRAM. The pool size refers to the number of
sockets that can access the same DRAM pool. As the pool size increases, the
figure shows that required overall DRAM decreases. However, this effect
diminishes for larger pools. For example, with a fixed 50% pool DRAM, a pool
with 32 sockets saves 12% of DRAM while a pool with 64 sockets saves 13% of
DRAM. Note that allocating a fixed 50% of memory to pool DRAM leads to significant
performance loss compared to socket-local DRAM (§6). Pond overcomes this
challenge with multiple techniques (§4).Summary and implications. From this analysis, we draw
a few important observations and implications for Pond:• We observe 3-27% of stranded
memory in production at the 95th percentile, with some outliers at 36%.• Almost all VMs fit into
one NUMA node.• Pooling memory across
16-32 sockets can reduce cluster memory demand by 10%. This suggests that
memory pooling can produce significant cost reductions but assumes that a high
percentage of DRAM can be allocated on memory pools. When implementing DRAM pools
with cross-NUMA latencies, providers must carefully mitigate potential
performance impacts.3.2. VM Memory Usage at
AzureWe use Pond’s telemetry on
opaque VMs (§4.2) to characterize the
percentage of untouched memory across our cloud clusters. Generally, we find
that while VM memory usage varies across clusters, all clusters have a
significant fraction of VMs with untouched memory. Overall, the 50thpercentile
is 50% untouched memory.Summary and implications. From this analysis, we draw
key observations and implications for Pond:• VM memory usage varies
widely.• In the cluster with the
least amount of untouched memory, still over 50% of VMs have more than 20%
untouched memory. Thus, there is plenty of untouched memory that can be
disaggregated at no performance penalty.• The challenges are (1)
predicting how much untouched memory a VM is likely to have and (2) confining
the VM’s accesses to local memory. Pond addresses both.3.3. Workload Sensitivity
to Memory LatencyTo characterize the
performance impact of CXL latency for typical workloads in Azure’s datacenters,
we evaluate 158 workloads under two scenarios of emulated CXL access latencies:
182% and 222% increase in memory latency, respectively. We then compare the
workload performance to NUMA-local memory placement. Experimental details are
in §6.1. Figures 4 and 5 show workload slowdowns
relative to NUMA-local performance for both scenarios.Under a 182% increase in
memory latency, we find that 26% of the 158 workloads experience less than 1%
slowdown under CXL. An additional 17% of workloads see less than 5% slowdowns.
At the same time, some workloads are severely affected with 21% of the
workloads facing >25% slowdowns.Different workload classes
are affected differently, e.g., GAPBS (graph processing) workloads
generally see higher slowdowns. However, the variability within each workload
class is typically much higher than across workload classes. For example,
within GAPBS even the same graph kernel reacts very differently to CXL latency,
based on different graph datasets. Overall, every workload class has at least
one workload with less than 5% slowdown and one workload with more than 25%
slowdown (except SPLASH2x).Azure’s proprietary
workloads are less impacted than the overall workload set. Of the 13 production
workloads, 6 do not see noticeable impact (<1%); 2 see _5% slowdown; and the remaining half are impacted
by 10–28%. This is in part because these production workloads are NUMA-aware
and often include data placement optimizations.Under a 222% increase in
memory latency, we find that 23% of the 158 workloads experience less than 1%
slowdown under CXL. An additional 14% of workloads see less than 5% slowdowns.
More than 37% of workloads face >25%
slowdowns. Generally, we find that higher latency magnifies the effects seen
under lower latency: workloads performing well under 182% latency also tend to
perform well under 222% latency; workloads severely affected by 182% are even
more affected by 222%.Summary and implications. While the performance of
some workloads is insensitive to disaggregated memory latency, some are heavily
impacted. This motivates our design decision to include socket-local DRAM
alongside pool DRAM to mitigate CXL latency impact for those latency-sensitive
workloads. Memory pooling solutions can be effective if they’re are effective
at identifiying sensitive workloads.Our measurements and
observations at Azure (§2–3) lead us to define the following design
goals.G1 Performance comparable to
NUMA-local DRAMG2 Compatibility with
virtualization acceleratorsG3 Compatibility with opaque
VMs and unchanged guest OSes/applicationsG4 Low host resource overheadTo quantify (G1), we
define a performance
degradation margin (PDM) for
a given workload as the allowable slowdown relative to running the workload
entirely on NUMA-local DRAM. Pond seeks to achieve a configurable PDM, e.g., 1%, for a configurable
tail-percentage (TP) of VMs, e.g., 98% (§3.1). To achieve this high
performance, Pond uses a small but fast CXL pool (§4.1). As Pond’s memory
savings come from pooling instead of oversubscription, Pond must minimize pool
fragmentation and wastage in its system software layer (§4.2). To achieve (G2), Pond
preallocates local and pool memory at VM start. Pond decides this allocation in
its allocation, performance monitoring, and mitigation pipeline (§4.3). This pipeline uses
novel prediction models to achieve the PDM (§4.4). Finally, Pond overcomes VM-opaqueness (G3)
and host-overheads (G4) using lightweight hardware counter telemetry.Hosts within a Pond pool
have separate cache coherency domains and run separate hypervisors. Pond uses
an ownership model where pool memory is explicitly moved among hosts. A new
external memory controller (EMC) ASIC implements the pool using multiple DDR5
channels accessed through a collection of CXL ports running at PCIe 5 speeds.EMC memory management. The EMC offers multiple CXL
ports and appears to each host as a single logical memory device [59, 60]. In CXL 3.0 [61, 62], this configuration is
standardized as multi-headed device (MHD) [62, §2.5]. The EMC exposes
its entire capacity on each port (e.g., to hosts) via a Host-managed Device Memory
(HDM) decoder. Hosts program each EMC’s address range but treat them initially
as offline. Pond dynamically assigns memory at the granularity of 1GB memory
slices. Each slice is assigned to at most one host at a given time and hosts
are explicitly notified about changes (§4.2). Tracking 1024 slices
(1TB) and 64 hosts (6 bits) requires 768B of EMC state. The EMC implements dynamic
slice assignment by checking permission of each memory access, i.e., whether requestor and
owner of the cacheline’s slice match. Disallowed accesses result in fatal
memory errors.EMC ASIC design. The EMC offers multiple _8-CXL ports, which
communicate with DDR5 memory controllers (MC) via an on-chip network (NOC). The
MCs must offer the same reliability, availability, and serviceability capabilities
[67, 68] as server-grade memory
controllers including memory error correction, management, and isolation. A key
design parameter of Pond’s EMC is the pool size, which defines the number of
CPU sockets able to use pool memory. We first observe that the EMC’s IO,
(De)Serializer, and MC requirements resemble AMD Genoa’s 397mm2 IO-die (IOD) [42, 66]. Figure 6 shows that EMC
requirements for a 16-socket Pond parallel the IOD’s requirements, with a small
8-socket Pond paralleling half an IOD. Thus, up to 16-sockets can directly connect
to an EMC. Pool sizes of 32-64 would combine CXL switches with Pond’s
multi-headed EMC. The opti mal design point balances the potential pool savings
for larger pool sizes (§6) with the added cost of larger EMCs and
switches.EMC Latency. While latency is affected
by propagation delays, it is dominated by CXL port latency, and any use of CXL
retimers and CXL switches. Port latencies are discussed in §2 and [63]. Retimers are devices used
to maintain CXL/PCIe signal integrity over longer distances and add about 10ns
of latency in each direction [69, 70]. In datacenter conditions, signal integrity
simulations [71] indicate that CXL could
require retimers above 500mm. Switches add at least 70ns of latency due to
ports/arbitration/NOC with estimates above 100ns [72].Figure 7 breaks down Pond’s latency
for different pool sizes. Figure 8 compares Pond’s latency to a design that relies
only on switches instead of a multi-headed EMC. We find that Pond reduces
latencies by 1/3 with 8-and 16-socket pools adding only 70-90ns relative to
NUMA local DRAM. In practice, we expect Pond to be deployed primarily with
small 8/16-socket pools, given the latency and cost overheads, and diminishing
returns of larger pools (§3). Modern CPUs can connect to multiple EMCs which
allows scaling to meet bandwidth and capacity goals for different clusters.4.2. System Software LayerPond’s system software
involves multiple components.Pool memory ownership. Pool management involves assigning
Pond’s memory slices to hosts and reclaiming them for the pool (Figure 9). It involves 1)
implementing the control paths for pool-level memory assignment and 2)
preventing pool memory fragmentation.Hosts discover local and
pool capacity through CXL device discovery and map them to their address space.
Once mapped, the pool address range is marked hot-pluggable and “not enabled.”
Slice assignment is controlled at runtime via a Pool Manager (PM) that is
colocated on the same blade as the EMCs (Figure 7). In Pond’s current design,
the PM is connected to EMCs and CPU sockets via a low-power management bus (e.g., [73]). To allocate pool
memory, the Pool Manager triggers two types of interrupts at the EMC and host
driver. Add_capacity(host,
slice) interrupts
the host driver which reads the address range to be hot-plugged. The driver
then communicates with the OS memory manager to bring the memory online. The
EMC adds the host id to its permission table at the slice offset. Release_capacity (host, slice) works similarly by
offlining the slice on the host and resetting the slice’s permission table
entry on the EMC. An alternative to this design would be inband-communication using
the Dynamic Capacity Device (DCD) feature in CXL 3.0 [62, §9.13]. This change
would maintain the same functionality for Pond.Pond must avoid
fragmenting its online pool memory as the contiguous 1GB address range must be
free before it can be offlined for reassignment to another host. Pool memory is
allocated to VMs in 1GB-aligned increments (§4.3). While this prevents
fragmentation due to VM starts and completions, our experience has shown that
host agents and drivers can allocate pool memory and cause fragmentation. Pond
thus uses a special-purpose memory partition that is only available to the
hypervisor. Host agents and drivers allocate memory in host-local memory partition,
which effectively contains fragmentation.With these optimizations,
offlining 1GB slices empirically takes 10-100 milliseconds/GB. Onlining memory is
near instantaneous with microseconds/GB. These observations are reflected in
Pond’s asynchronous release strategy (§4.3).Failure management. Hosts only interleave
across local memory. This minimizes the EMCs’ blast radius and facilitate
memory hot-plugging. EMC failures affect only VMs with memory on that EMC,
while VMs with memory on other EMCs continue normally. CPU/host failures are isolated
and associated pool memory is reallocated to other hosts. Pool Manager failures
prevent reallocating pool memory but do not affect the datapath.Exposing pool memory to
VMs. VMs
that use both NUMA-local and pool memory see pool memory as a zNUMA node. The
hypervisor creates a zNUMA node by adding a memory block (node_memblk) without an entry in the node_cpuid in the SLIT/SRAT tables [74]. We later show the
guest-OS preferentially allocates memory from the local NUMA node before going
to zNUMA (§6). Thus, if zNUMA is sized
to the amount of untouched memory, it is never going to be used. Figure 10 shows the view of a Linux
VM which includes the correct latency in the NUMA distance matrix (numa_slit). This facilitates guest-OS
NUMA-aware memory management [75, 76] for the rare case that the zNUMA is used (§4.4).Reconfiguration of memory
allocation. To
remain compatible with (G2), local and pool memory mapping generally remain
static during a VM’s lifetime. There are two exceptions that are implemented
today. When livemigrating aVMor when remapping a page with a memory fault, the
hypervisor temporarily disables virtualization acceleration and the VM falls
back to a slower I/O path [77]. Both events are quick and transient and
typically only happen once during a VM’s lifetime. We implement a third variant
which allows Pond a one-time correction to a suboptimal memory allocation.
Specifically, if the host has local memory available, Pond disables the accelerator,
copies all of the VM’s memory to local memory, and enables the accelerator
again. This takes about 50ms for every GB of pool memory that Pond allocated to
the VM.Telemetry for opaque VMs. Pond requires two types of
telemetry for VMs. First, we use the core-performance measurement unit (PMU) to
gather hardware counters related to memory performance. Specifically, we use
the top-down-method for analysis (TMA) [78, 79]. TMA characterizes how
the core pipeline slots are used. For example, we use the “memory-bound”
metric, which is defined as pipeline stalls due to memory loads and stores. Figure12 lists these metrics. While
TMA was developed for Intel, its relevant parts are available on AMD and ARM as
well [80]. We modify Azure’s
production hypervisor to associate these metrics with individual VMs (§5) and record VM counter
samples in a distributed database. All our core-PMU-metrics use simple counters
and induce negligible overhead (unlike event-based sampling [81, 82]).Second, we use hypervisor
telemetry to track a VM’s untouched pages. We use an existing hypervisor
counter that tracks guest-committed memory, which overestimates used memory.
This counter is available for 98% of Azure VMs. We also scan access bits in the
hypervisor page table (§5). Since we only seek untouched pages,
frequently access bits reset is not required. This minimizes overhead.4.3. Distributed Control
Plane LayerFigure 11 shows the two tasks
performed by Pond’s control plane: (A) predictions to allocate memory during VM
scheduling and (B) QoS monitoring and resolution.Predictions and VM
scheduling (A). Pond uses MLbased prediction models (§4.4) to decide how much pool memory
to allocate for a VM during scheduling. After a VM request arrives (A1), the
scheduler queries the distributed ML serving system (A2) for a prediction on
how much local memory to allocate for the VM. The scheduler then informs the
Pool Manager about the target host and associated pool memory needs (A3). The
Pool Manager triggers a memory onlining workflow using the configuration bus to
the EMCs and host (A4). Memory onlining is fast enough to not block a VM’s
start time (§4.2). The scheduler informs
the hypervisor to start the VM on a zNUMA node matching the onlined memory
amount.Memory offlining is slow
and cannot happen on the critical path of VM starts (§4.2). Pond resolves this by always
keeping a buffer of unallocated pool memory. This buffer is replenished when
VMs terminate and hosts asynchronously release associated slices.QoS monitoring (B). Pond continuously inspects
the performance of all running VMs via its QoS monitor. The monitor queries
hypervisor and hardware performance counters (B1) and uses an ML model of
latency sensitivity (§4.4) to decide whether the VM’s performance impact
exceeds the PDM. In this case, the
monitor asks its mitigation manager (B2) to trigger a memory recon figuration
(§4.2) through the hypervisor
(B3). After this reconfiguration, the VM uses only local memory.Pond’s VM scheduling (A)
and QoS monitoring (B) algorithms rely on two prediction models (in Figure 13).Predictions for VM
scheduling (A). For scheduling, we first check if we can correlate a
workload history with the VM requested. This works by checking if there have
been previous VMs with the same metadata as the request VM, e.g., the customer-id, VMtype,
and location. This is based on the observation that VMs from the same customer
tend to exhibit similar behavior [48].If we have prior workload
history, we make a prediction on whether this VM is likely to be memory latency
insensitive, i.e., its performance would be
within the PDM while using only pool
memory. (Model details appear below.) Latency-insensitive VMs are allocated
entirely on pool DRAM.If the VM has no workload
history or is predicted to be latency-sensitive, we predict untouched memory (UM) over its lifetime.
Interestingly, UM predictions with only generic
VM metadata such as customer history, VM type, guest OS, and location are
accurate (§6). VMs without untouched
memory (UM = 0) are allocated entirely with local DRAM.
VMs with a UM > 0 are allocated with a rounded-down
GB-aligned percentage of pool memory and a corresponding zNUMA node; the
remaining memory is allocated on local DRAM.If we underpredict UM, the VM will not touch
the slower pool memory as the guest OS prioritizes allocating local DRAM. If we
overpredict UM, we rely on the QoS
monitor for mitigation. Importantly, Pond always keeps a VM’s memory mapped in
hypervisor page tables at all times. This means that even if our predictions
happen to be incorrect, performance does not fall off a cliff.QoS monitoring (B). For zNUMA VMs, Pond
monitors if it overpredicted the amount of untouched memory during scheduling.
For pool-backed VMs and zNUMA VMs with less untouched memory than predicted, we
use the sensitivity model to determine whether the VM workload is suffering
excessive performance loss. If not, the QoS monitor initiates a live VM
migration to a configuration allocated entirely on local DRAM.Model details. Pond’s two ML prediction
models consume telemetry that is available for opaque VMs from Pond’s system
software layer (§4.2). Figure 12 shows features, labels,
and the training procedure for the latency insensitivity model. The model uses
supervised learning (§5) with core-PMU metrics as features and the
slowdown of pool memory relative to NUMA-local memory as labels. Pond gets
samples of slowdowns from offline test runs and A/B tests of internal workloads
which make their performance numbers available. These feature-label-pairs are
used to retrain the model daily. As the core-PMU is lightweight (§5), Pond continuously
measures core-PMU metrics at VM runtime. This enable the QoS monitor to react
quickly and enables retaining a history of VMs that have been latency
sensitive.Figure 14 shows the inputs and
training procedure for the untouched-memory model. The model uses supervised learning
(details in §5) with VM metadata as
features and the minimum untouched memory over each VM’s lifetime as labels.
Its most important feature is a range of percentiles (e.g., 80th–99th) of the recorded untouched
memory by a customer’s VMs in the last week.Parameterization of
prediction models. Pond’s latency insensitivity model is parameterized to
stay below a target
rate of false positives (FP), i.e., workloads it incorrectly specifies as
latency insensitive but which are actually sensitive to memory latency. This
parameter enforces a tradeoff as the percentage of workloads that are labeled
as latency insensitive (LI) is
a function of FP. For example, a rate of
0.1% FP may force the model to 5%
of LI.Similarly, Pond’s
untouched memory model is parameterized to stay below a target
rate of overpredictions (OP), i.e., workloads that touch more memory than
predicted and thus would use memory pages on the zNUMA node. This parameter
enforces a tradeoff as the percentage of untouched memory (UM) is a function of OP. For example, a rate of
0.1% OP may force the model to 3%
of UM.With two models and their
respective parameters, Pond needs to decide how to balance FP and OP between the two models.
This balance is done by solving an optimization problem based on the given
performance degradation margin (PDM) and
the target percentage of VMs that meet this margin (TP). Specifically, Pond seeks to maximize the
average amount of memory that is allocated on the CXL pool, which is defined byLI and UM, while keeping the
percentage of false positives (FP) and
untouched overpredictions (OP)
below the TP.maximize (LIPDM)+(UM)
subject to (FPPDM)+(OP) _ (100TP) (1)
Note that TP essentially defines how
often the QoS monitor has to engage and initiate memory reconfigurations.Besides PDM and TP, Pond has no other
parameters as it automatically solves the optimization problem from Eq.(1). The models rely on
their respective framework’s default hyperparameters (§5).We implement and evaluate
Pond on production servers that emulate pool latency. Pond artifacts are
open-sourced at https://github.com/vtess/Pond.System software. This implementation
comprises three parts. First, we emulate a single-socket system with a CXL pool
on a two-socket server by disabling all cores in one socket, while keeping its
memory accessible from the other socket. This memory mimics the pool.Second, we change Azure’s
hypervisor to instantiate arbitrary zNUMA topologies. We extend the API between
the control plane and the host to pass the desired zNUMA topology to the
hypervisor.Third, we implement
support in Azure’s hypervisor for the telemetry required for training Pond’s
models. We extend each virtual core’s metadata with a copy of its core-PMU
state and transfer this state when it gets scheduled on different physical
cores. Pond samples core-PMU counters once per second, which takes 1ms. We
enable access bit scanning in hypervisor page tables. We scan and reset access
bits every 30 minutes, which takes 10s.Distributed control plane.We
train our prediction models by aggregating daily telemetry into a central
database. The latency insensitivity model uses a simple random forest (Random Forest)
from Scikit-learn [83] to classify whether a workload exceeds the PDM. The model uses a set of
200 hardware counters as supported by current Intel processors. The untouched
memory model uses a gradient boosted regression model (GBM) from LightGBM [84] and makes a quantile
regression prediction with a configurable target percentile. After exporting to
ONNX [85], the prototype adds the
prediction (the size of zNUMA) on the VM request path using a custom inference
serving system similar to [86–88]. Azure’s VM scheduler incorporates zNUMA
requests and pool memory as an additional dimension into its bin packing,
similar to other cluster schedulers [49, 89–93].Our evaluation addresses
the performance of zNUMA VMs (§6.2, §6.3), the accuracy of Pond’s prediction models (§6.4), and Pond’s end-to-end
DRAM savings (§6.5).We evaluate the
performance of our prototype using 158 cloud workloads. Specifically, our
workloads span in-memory databases and KV-stores (Redis [94], VoltDB [95], and TPC-H on MySQL [96]), data and graph
processing (Spark [97] and GAPBS [98]), HPC (SPLASH2x [99]), CPU and shared-memory
benchmarks (SPEC CPU [100] and PARSEC [101]), and a range of Azure’s
internal workloads (Proprietary). Figure 4 overviews these workloads.
We quantify DRAM savings with simulations.Prototype setup. We run experiments on
production servers at Azure and similarly-configured lab servers. The production
servers use either two Intel Skylake 8157M sockets with each 384GB of DDR4, or
two AMD EPYC 7452 sockets with each 512GB of DDR4. On Intel, we measure 78ns
NUMA-local latency and 80GB/s bandwidth and 142ns remote latency and 30GB/s
bandwidth (3/4 of a CXL _8
link). On AMD, we measure 115ns NUMA-local latency and 255ns remote latency.
Our BIOS disables hyper-threading, turbo-boost, and C-states.We use performance results
of VMs entirely backed by NUMA-local DRAM as our baseline. We present zNUMA
performance as normalized slowdowns, i.e., the ratio to the baseline. Performance
metrics are workload specific, e.g., job runtime, throughput and tail latency,
etc.Each experiment involves
running the application with one of 7 zNUMA sizes (as percentages of the
workload’s memory footprint in Figure 16). With at least three repetitions of each
run and 158 workloads, our evaluation spans more than 3,500 experiments and
10,000 machine hours. Most experiments used lab servers; we spot check outliers
on production servers. Simulations. Our simulations are based
on traces of production VM requests and their placement on servers. The traces
are from randomly selected 100 clusters across 34 datacenters globally over 75
days.The simulator implements
different memory allocation policies and tracks each server and each pool’s
memory capacity at second accuracy. Generally, the simulator schedules VMs on
the same nodes as in the trace and changes their memory allocation to match the
policy. For rare cases where a VM does not fit on a server, e.g., due to insufficient pool
memory, the simulator moves the VMs to another server.Model evaluation. We evaluate our model with
production resource logs. About 80% of VMs have sufficient history to make a
sensitivity prediction. Our deployment does not report each workload’s
perceived performance (opaque VMs). We thus evaluate latency sensitivity model based
on our 158 workloads.6.2. zNUMA VMs on
Production NodesWe perform a small-scale
experiment on Azure production nodes to validate zNUMA VMs. The experiment evaluates
four internal workloads: an audio/video conferencing application, a database
service, a key-value store, and a business analytics service. To see the
effectiveness of zNUMA, we assume a correct prediction of untouched memory, i.e., the local footprint fits
into the VM’s local vNUMA node. Figure 15 shows access bit scans
over 48 hours from the video workload and a table that shows the traffic to the
zNUMA node for the four workloads.Finding 1. We find that zNUMA nodes
are effective at containing the memory access to the local vNUMA node. A small
fraction of accesses goes to the zNUMA node. We suspect that this is in part
due to the guest OS memory manager’s metadata that is explicitly allocated on
each vNUMA node. We find that the video workload sends fewer than 0.25% of
memory accesses to the zNUMA node. Similarly, the other three workloads send
0.06-0.38% of memory access to the zNUMA node. Accesses within the local vNUMA
node are spread outImplications. With a negligible fraction
of memory accesses on zNUMA, we expect negligible performance impact given a
correct prediction of untouched memory.6.3. zNUMA VMs in the LabWe scale up our evaluation
to 158 workloads in a lab setting. Since we fully control these workloads, we
can now also explicitly measure their performance. We rerun each workload on
all-local memory, a correctly sized zNUMA (0% spilled), differently-sized zNUMA
nodes sized between 10-100% of the workload’s footprint. Figure 16 shows a violin plot of
associated slowdowns. This setup covers both normal behavior (all-local and 0%
spill) and misprediction behavior for latency sensitive workloads. Thus, this
is effectively a sensitivity study.Finding 2. With a correct prediction
of untouched memory, workload slowdowns have a similar distribution to all-local
memory.Implications. This performance result is
expected since the zNUMA node is rarely accessed (§6.2). Our evaluation can thus
assume no performance impact under correct predictions of untouched memory (§6.5).Finding 3. For overpredictions of
untouched memory (and correspondingly undersized local vNUMA nodes), the
workload spills into zNUMA. Many workloads see an immediate impact on slowdown.
Slowdowns further increase if more workload memory spills into zNUMA. Some
workloads are slowed down by as much as 30-35% with 20-75% of workload memory
spilled and up to 50% if entirely allocated on pool memory. We use access bit scans
to verify that these workloads indeed actively access their entire working set.Implications. Allocating a fixed
percentage of pool DRAM to VMs would lead to significant performance slowdowns.
There are only two strategies to reduce this impact: 1) identify which
workloads will see slowdowns and 2) allocate untouched memory on the pool. Pond
employs both strategies.6.4. Performance of Prediction
ModelsWe evaluate Pond’s
prediction models (§4.4) and its combined prediction model based on
Eq.(1).6.4.1. Predicting Latency
SensitivityPond seeks to predict whether
a VM is latency insensitive, i.e., whether running the workload on pool memory
would stay within the performance degradation margin (PDM). We tested the model for PDM between 1-10% and on both
182% and 222% latency increases, but report details only for 5% and 182%. OtherPDM values lead to
qualitatively similar results. The 222% model is 16% less effective given the
same false positive rate target. We compare thresholds on memory and DRAM
boundedness [78, 79] to our RandomForest (§5).Figure 17 shows the model’s false
positive rate as a function of the percentage of workloads labeled as latency insensitive,
similar to a precision-recall curve [102]. Error bars show 99% confidence from a
100-fold validation based on randomly splitting into equal-sized training and testing
datasets.Finding 4. While DRAM boundedness is correlated
with slowdown, we find examples where high slowdown occurs even for a small
percentage of DRAM boundedness. For example, multiple workloads exceed 20%
slowdown with just two percent of DRAM boundedness.Implication. This shows the general
hardness of predicting whether workloads exceed the PDM. Heuristic as well as predictors will make
statistical errors.Finding 5. We find that “DRAM bound”
significantly outperforms “Memory bound” (Figure 17). Our Random- Forest
performs slightly better than “DRAM bound”.Implication. Our RandomForest can place
30% of workloads on the pool with only 2% of false positives.6.4.2. Predicting
Untouched MemoryPond predicts the amount
of untouched memory over a VM’s future lifetime (§4.4). We evaluate this model
using metadata and resource usage logs from 100 clusters over 75 days. The model
is trained nightly and evaluated on the subsequent day. Figure 18 compares our GBM model to
the heuristic that assumes a fixed fraction of memory as untouched across all
VMs. The figure shows the overprediction rate as a function of the average
amount of untouched memory. Figure 19 shows a production version of the untouched memory
model during the first 110 days of 2022.Finding 6. We find that the GBM model
is 5_ more accurate than the
static policy, e.g., when labeling 20% of memory as untouched,
GBM overpredicts only 2.5% of VMs while the static policy overpredicts 12%.Implication. Our prediction model
identifies 25% of untouched memory while only overpredicting 4% of VMs.Finding 7. The production version of
our model performs similarly to the simulated model. Distributional shifts lead
to some variability over time.Implication. We find that accurately
predicting untouched memory is practical and a realistic assumption.6.4.3. Combined Prediction
Models We
characterizePond’s combined models
(Eq.(1)) using “scheduling
mispredictions”, i.e., the fraction of VMs that will exceed the PDM. This incorporates the
overpredictions of untouched memory, how much the model overpredicted, and the probability
of this overprediction leading to a workload exceeding the PDM. Further, Pond uses its
QoS monitor to mitigate up to 1% of mispredictions. Figure 20 shows scheduling
mispredictions as a function of the average amount of cluster DRAM that is
allocated on its pools for 182% and 222% memory latency increases,
respectively.Finding 8. Pond’s combined model
outperforms its individual models by finding their optimal combination.Implication. With a 2% scheduling
misprediction target, Pond can schedule 44% and 35% of DRAM on pools with 182%
and 222% memory latency increases, respectively.6.5. End-to-end Reduction
in StrandingWe characterize Pond’s
end-to-end performance while constraining its rate of scheduling mispredictions.
Figure 21 shows the reduction in
aggregate cluster memory as a function of pool size for Pond under 182% and
222% memory latency increase, respectively, and a strawman static allocation
policy. We evaluate multiple scenarios; the figure shows PDM =5% and TP =98%. In this scenario, the
strawman statically allocates each VM with 15% of pool DRAM. About 10% of VMs
would touch the pool DRAM (Figure 18). Of those touching pool DRAM, we’d expect
that about 14 would see a slowdown exceeding
a PDM =5% (Figure 16). So, the strawman would
have about 2.5% of scheduling mispredictions.Finding 9. At a pool size of 16
sockets, Pond reduces overall DRAM requirements by 9% and 7% under 182% and
222% latency increases, respectively. Static reduces DRAM by 3%. When varying PDM between 1 and 10% and TP between 90 and 99.9% we
find the relative savings of the three systems to be qualitatively similar.Implication. Pond can safely reduce
cost. A QoS monitor that mitigates more than 1% of mispredictions, can achieve
more aggressive performance targets (PDM).Finding 10. Throughout the
simulations, Pond’s pool memory offlining speeds remain below 1GB/s and 10GB/s for
99.99% and 99.999% of VM starts, respectively.Implication. Pond is practical and
achieve design goals.Robustness of ML. Similar to other
oversubscribed resources (CPU [55] and disks [103]), customers may overuse
resources to get local memory. When multiplexing resource for millions of
customers, any individual customer’s behavior will have a small impact.
Providers can also provide small discounts when resources are not fully
utilized.Alternatives to static
memory preallocation. Pond is designed for compatibility static
memory as potential workarounds are not yet practical. The PCIe Address Translation
Service (ATS/PRI) [104] enables compatibility with page faults.
Unfortunately, ATS/PRI-devices are not yet broadly available [1]. Virtual IOMMUs [2, 5, 6] allow fine-grained
pinning but require guest OS changes and introduce overhead.Hardware-level
disaggregation: Hardware-based disaggregation designs [19, 58, 105–109] are not easily
deployable as they do not rely on commodity hardware. For instance,
ThymesisFlow [58] and Clio [105] propose FPGA-based
rack-scale memory disaggregation designs on top of OpenCAPI [110] and RDMA. Their hardware
layer shares goals with Pond. Their software goals differ fundamentally, e.g., ThymesisFlow advocates
application changes for performance, while Pond focuses on platform-level
ML-driven pool memory management that is transparent to users.Hypervisor/OS level
disaggregation: Hypervisor/OS level approaches [17, 21, 22, 31–38] rely on page faults and
access monitoring to maintain the working set in local DRAM. Such OS-based
approaches bring significant overhead, jitter, and are incompatible with
virtualization acceleration (e.g., DDA).Runtime/application level
disaggregation: Runtimebased disaggregation designs [23, 25, 26] propose customized APIs
for remote memory access. While effective, this approach requires developers to
explicitly use these mechanisms at the application level.Memory tiering: Prior works have
considered the broader impact of extended memory hierarchies and how to handle
them [17, 111–114]. For example, Google achieves
6_s latency via proactive
hot/cold page detection and compression [17, 115]. Nimble [75] optimizes Linux’s page
tracking mechanism to tier pages for increased migration bandwidth. Pond takes
a different MLbased approach looking at memory pooling design at the platform-level
and is orthogonal to these works.ML for systems: ML is increasingly applied
to tackle systems problems, such as cloud efficiency [48, 55], memory/storage
optimizations [116, 117], microservices [118], caching/prefetching
policies [119, 120]. We uniquely apply ML
methods for untouched memory prediction to support pooled memory provisioning
to VMs without jeopardizing QoS.Coherent memory and NUMA
optimizations: Traditional cache coherent NUMA architectures [121] use specialized
interconnects to implement a shared address space. There are also system-level
optimizations for NUMA, such as NUMA-aware data placement [122] and proactive page
migration [76]. NUMA scheduling
policies [123–125] balance compute and
memory across NUMA nodes. Pond’s ownership overcomes the need for coherence
across the memory pool. zNUMA’s zero-core nature requires rethinking of
existing optimizations which are largely optimized for symmetric NUMA systems.DRAM costs are an
increasing cost factor for cloud providers. This paper is motivated by the
observation of stranded and untouched memory across 100 production cloud
clusters. We proposed Pond, the first fullstack memory pool that satisfies the
requirements of cloud providers. Pond comprises contributions at the hardware, system
software, and distributed system layers. Our results showed that Pond can
reduce the amount of needed DRAM by 7% with a pool size of 16 sockets and
assuming CXL increases latency by 222%. This translates into an overall
reduction of 3.5% in cloud server cost.