We can distinguish between systems that contain:
• A single processor containing a single core.
• A multi-core processor, such as the Cortex-A53, with several cores capable of independent instruction execution, and can be externally viewed as a single unit or cluster, either by the system designer or by an operating system that can abstract the underlying resources from the application layer.
• Multiple clusters , in which each cluster contains multiple cores.
• ARM uses HMP to mean a system composed of clusters of application processors that are 100% identical in their instruction set architecture but very different in their microarchitecture. All the processors are fully cache coherent and a part of the same coherency domain.
L2 memory system
The Cortex-A53 L2 memory system contains the L2 cache pipeline and all logic required to maintain memory coherence between the cores of the cluster. It has the following features:
• An SCU that connects the cores to the external memory system through the master memory interface. The SCU maintains data cache coherency between the cores and arbitrates L2 requests from the cores.
• When the Cortex-A53 processor is implemented with a single core, it still includes the Snoop Control Unit (SCU).
Snoop Control Unit
The Cortex-A53 processor supports between one and four individual cores with L1 Data cache coherency maintained by the SCU.
• The SCU is clocked synchronously and at the same frequency as the cores.
• The SCU maintains coherency between the individual data caches in the processor using ACE modified equivalents of **MOESI** state, as described in Data Cache Unit on page XX.
• The SCU contains buffers that can handle direct cache-to-cache transfers between cores without having to read or write any data to the external memory system. Cache line migration enables dirty cache lines to be moved between cores, and there is no requirement to write back transferred cache line data to the external memory system.
• Each core has tag and dirty RAMs that contain the state of the cache line. Rather than access these for each snoop request the SCU contains a set of duplicate tags that permit each coherent data request to be checked against the contents of the other caches in the cluster. The duplicatetags filter coherent requests from the system so that the cores and system can function efficiently even with a high volume of snoops from the system.
• When an external snoop hits in the duplicate tags a request is made to the appropriate core.
In this example, slave interfaces S5 and S6 support the ACE protocol for connecting masters such as the Cortex-A53 or Cortex-A72 processors. The CCI-500 manages full coherency and data sharing between L1 and L2 caches of all connected processor clusters.
Snoop filter
The CCI-500 contains an inclusive snoop filter that records the addresses of data that is stored in the ACE master caches.
The snoop filter can respond to snoop transactions in the case of a miss, and snoop appropriate masters only in the case of a hit. Snoop filter entries are maintained by observing transactions from ACE masters to determine when entries have to be allocated and deallocated.
The snoop filter can respond to multiple coherency requests without it being necessary to broadcast to all ACE interfaces. For example, if the address is not in any cache, the snoop filter responds with a miss and directs the request to memory. If the address is in a processor cache, the request is considered a hit and is directed to the ACE port containing that address in its cache.
Arm recommends that you configure the snoop filter directory to be 0.75-1 times the total size of exclusive caches of processors that are attached to the CCI-500. The snoop filter is 8-way set associative and, to minimize conflicts, stores twice as many tags as the configured size. An example of a conflict is when the CCI-500 is unable to insert a new entry in an available position in the snoop filter. If a conflict occurs, an existing entry is evicted, and the snoop filter issues a CleanInvalid snoop to processors that might be holding the evicted lines. This type of eviction is known as a back-invalidation, and is expected to occur rarely if you configure the snoop filter size as Arm recommends.
The snoop filter is updated by monitoring transactions from the attached masters, that allocate and deallocate data into their caches. In the ACE protocol, the deallocation of clean data is indicated using the Evict transaction.
ACE uses a MOESI state machine for cross-cluster coherency.
A cluster can be configured with up to three different types of cores in the same cluster. Each core type targeting different power efficiency and performance levels. This arrangement allows for an intermediate core that has an intermediate performance and efficiency level. The cluster also supports complexes.
A cluster can be configured in many arrangements. Examples of cluster arrangements are:
• One or more cores of the same type.
• Various arrangements of two types of cores. For example, one or more cores targeting either a
high-performance level or a higher power efficiency level.
• Various arrangements of three of cores. For example, one or more high-performance cores,
power-efficient cores, and intermediate cores.
• One or more complexes and no individual cores.
• One or more complexes and individual cores.
DSU-120
A DSU-120 DynamIQ ™ cluster consists of between one and 14 cores, with up to three different types of cores in the same cluster. Cores can be configured for various performance points during macrocell implementation and run at different frequencies and voltages.
The DSU-120 DynamIQ ™ cluster also supports complexes where typically two cores are linked together and share logic. Examples of shared logic include a floating-point unit and an L2 cache.
All cores in the DSU-120 DynamIQ ™ cluster, including those in complexes, are coherently connected to an L3 memory system that includes an L3 cache and a Snoop Control Unit (SCU). The SCU maintains coherency between caches in the cores and the L3 cache, and includes a **snoop filter** to optimize coherency maintenance operations. The shared L3 cache simplifies process migration between the cores.
Coherency and snoop control
The DSU-120 has the following coherency and snoop control features:
• Snoop Control Unit (SCU) maintains coherency and consistency in the memory system internal to the cluster, and (optionally) external to the cluster.
• SCU includes a set of snoop filters, automatically sized, one for each cache slice.
Snoop Control Unit
The Snoop Control Unit (SCU) maintains coherency between all the data caches in the cluster.
The SCU contains buffers that can handle direct cache-to-cache transfers between cores without having to read or write data to the L3 cache. Cache line migration enables dirty lines to be moved between cores.
The SCU contains a set of snoop filters that track the addresses for locations cached in the core caches. Including the snoop filters means that the SCU does not need to request a look up in the core caches when it receives a coherent memory request. These snoop filters are accessed by the coherent requests from the other cores or from the system. If there is a simultaneous hit in the L3 tags and the SCU snoop filters, then the L3 cache normally provides the data in preference to a core. The size of the snoop filter is automatically determined from the configured number of cores and the cache sizes in those cores.
Cortex ® ‑A710
To maintain data coherency between multiple cores, the Cortex ® ‑A710 core uses the Modified Exclusive Shared Invalid (MESI) protocol.
图1-11 CHI MESI规范
MESI(Modified, Exclusive, Shared, Invalid)是一种基于Invalidate的高速缓存一致性协议,也称为伊利诺伊州协议,因其在伊利诺伊大学厄巴纳-香槟分校的发展而得名。MESI协议是支持回写高速缓存的最常用协议之一,主要用于管理多处理器系统中缓存数据的一致性问题。
定义:MESI协议是一种用于维护多处理器系统中缓存数据一致性的协议。它通过定义缓存行的四种状态(Modified、Exclusive、Shared、Invalid),以及这些状态之间的转换规则,来确保各个处理器对共享数据的访问是一致的。
背景:在现代计算机系统中,CPU的运算速度远快于内存的访问速度,为了缓解这一矛盾,引入了高速缓存(Cache)作为CPU和内存之间的缓冲。然而,多处理器系统中的多个CPU可能同时访问同一个内存地址,导致缓存数据不一致的问题。MESI协议就是为了解决这一问题而设计的。
The AMBA 4 ACE-Lite interface is a subset of the full interface, designed for one-way IO coherent system masters such as DMA engines, network interfaces, and GPUs.
[01] <DDI0487K_a_a-profile_architecture_reference_manual.pdf>
[02] <DEN0024A_v8_architecture_PG.pdf>
[03] <80-LX-MEM-yk0008_CPU-Cache-RAM-Disk关系.pdf>
[04] <80-ARM-ARCH-HK0001_一文搞懂CPU工作原理.pdf>
[05] <80-ARM-MM-Cache-wx0003_Arm64-Cache.pdf>
[06] <80-ARM-MM-HK0002_一文搞懂cpu-cache工作原理.pdf>
[07] <80-MM-yd0001_Caches-From-a-Mostly-OS-Software-Perspective.pdf>
[08] <80-MM-yd0002_Improving-Kernel-Performance-by-Unmapping-the-Page-Cache.pdf>
[09] <arm_cortex_a710_core_trm_101800_0201_07_en.pdf>
[10] <DDI0608B_a_armv9a_supplement_RETIRED.pdf>
[11] <arm_cortex_a520_core_trm_102517_0003_06_en.pdf>
[12] <arm_cortex_a720_core_trm_102530_0002_05_en.pdf>
[13] <79-LX-LK-z0002_奔跑吧Linux内核-V-2-卷1_基础架构.pdf>
[14] <80-ARM-MM-Cache-wx0001_Cache多核之间的一致性MESI.pdf>
[15] <80-ARM-MM-Cache-wx0002_深度学习armv8_armv9_cache的原理.pdf>
[16] <80-ARM-MM-Cache-ym0001_带着几个疑问-从Cache的应用场景学起.pdf>
[17] <80-ARM-MM-Cache-ym0002_Cache是如何工作的-概念以及工作过程.pdf>
[18] <80-ARM-MM-Cache-ym0003_多核多Cluster多系统之间的缓存一致性.pdf>
[19] <DDI0500J_cortex_a53_trm.pdf>
[20] <DDI0488H_cortex_a57_mpcore_trm.pdf>
[21] <cortex_a72_mpcore_trm_100095_0003_06_en.pdf>
[22] <corelink_cci550_cache_coherent_interconnect_technical_reference_manual_en.pdf>
[23] <80-ARM-DyIQ-wx0001_ARM架构系列(2)-DynamIQ技术.pdf>
[24] <ARM_DynamIQ_The_future_of_multi-core_computing.pdf>
[25] <cortex_a72_mpcore_trm_100095_0003_06_en.pdf>
[26] <arm_cortex_a710_core_trm_101800_0201_07_en.pdf>
[27] <DEN0013D_cortex_a_series_PG.pdf>
[28] <DDI0329L_l220_cc_r1p7_trm.pdf>
[29] <arm_dsu_120_trm_102547_0201_07_en.pdf>
[30] <80-Cache-MESI-yd0001_Cache_coherency_controller_for_MESI_protocol_based.pdf>
[31] <80-Cache-MESI-yd0002_cache-coherence.pdf>
[32] <80-Cache-MESI-yd0003_Cache-coherence-in-shared-memory-architectures.pdf>
SRAM - Static Random-Access Memory
DRAM - Dynamic Random Access Memory
SSD - Solid state disk
HDD - Hard Disk Drive
SOC - System on a chip
AMBA - Advanced Microcontroller Bus Architecture 高级处理器总线架构
TLB - translation lookaside buffer(地址变换高速缓存)
VIVT - Virtual Index Virtual Tag
PIPT - Physical Index Physical Tag
VIPT - Virtual Index Physical Tag
AHB - Advanced High-performance Bus 高级高性能总线
ASB - Advanced System Bus 高级系统总线
APB - Advanced Peripheral Bus 高级外围总线
AXI - Advanced eXtensible Interface 高级可拓展接口
DSU - DynamIQ Share Unit
ACE - AXI Coherency Extensions
CHI - Coherent Hub Interface 一致性集线器接口
CCI - Cache Coherent Interconnect
ADB - AMBA Domain Bridge
CMN - Coherent Mesh Network
MESI - Modified, Exclusive, Shared, Invalid