[A-17]ARMv8/ARMv9-Memory-内存屏障机制(Observer & Barrier)

文摘   2024-11-09 23:16   辽宁  





1.1 Observer


1.1.1 Observer的概念


Observer: A PE or mechanism in the system, such as a peripheral device, that can generate reads from or writes to memory.

• An Observer refers to either a Processor Element (PE) or some other mechanism in the system, such as a peripheral device, that can generate reads from, or writes to, memory. Observers can observe memory accesses.

• The ARM Architecture Reference Manual defines certain key words, in particular, the terms observe and must be observed. In typical systems, this defines how the bus interface of a master, for example, a core or GPU and the interconnect, must handle bus transactions. Only masters are able to observe transfers. All bus transactions are initiated by a master. 
The architecture considers the following as separate Observers:

• The instruction interface of the core, typically called the Instruction Fetch Unit (IFU)

• The data interface, typically called the Load Store Unit (LSU)

• The MMU table walk unit

As described in Who is an Observer?, an Observer is something that can make memory accesses. For example, the MMU generates reads to walk translation tables.


图1-1 High-Level ARM SOC ARCH


(1) Observer必须是SOC系统内部的一个功能单元,比如PE、GPU、VPU、DMA控制器等等。

(2) Observer还可以继续细分PE内部的一个前端或者后端的更加精细的功能单元,比如MMU或者LSU等等。

(3) Observer必须要和系统的内存模型产生关联,它观察的内存空间Device和Normal两种类型的内存。

(4) Observer可以理解我SOC总线架构下Master这个概念的别名,只是它更侧重内存操作的相关总线事物。

(5) Observer之间是有关联的,必要的时候要做一致性处理(可以是硬件的方式,也可以是软件的方式)。


1.1.2 Observability


The order that a master performs transactions in is not necessarily the same order that such transactions complete at the slave device, because transactions might be re-ordered by the interconnect unless some ordering is explicitly enforced.

A write to memory is observed when it reaches a point in the memory system in which it becomes visible. When it is visible, it is coherent to all the Observers in the specified Shareability domain, as specified in the memory barrier instruction. If a PE writes to a memory location, the write is observable if another PE would see the updated value if it read the same location. For example, if the memory is Normal cacheable, the write is observable when it reaches the coherent data caches of that Shareability domain.

A simple way to describe observability is to say that “I have observed your write when I can read what you wrote and I have observed your read when I can no longer change the value you read” where both I and you refer to cores or other masters in the system.


(1) 这些Observer首先是要分组,也就是按照内存共享的Domain观察自己领地内存的被共享的内存。

(2) 观察的具体内容有两个层面:

• 一个是行为,就是这些被共享内存被Observer写或者被Observer读。

• 一个是结果,就是站在一个操作共享内存的Observer视角,共享域内其他Observer啥时候能够看到这个结果。

(3) 对于产生结果事实可以换种表达方式:observability。

• PE就是一个Observer,当一个PE-Core-A向一个内存地址写数据的时候要经过若干指令周期才能完成,而此时另外一个PE-Core-B能够看到这个结果才代表PE-Core-A这个写的动作可被观察和感知。相对的,此时这个PE-Core-B读这个地址上的数据时,PE-Core-A能够捕获且不能够改变当前数据的值时,代表PE-Core-B的这读的动作才能够被感知。


1.1.3 Order


While the effect of ordering is largely hidden from the programmer within a single PE, the microarchitectural innovations have a profound impact on the ordering of memory accesses. Write buffering, speculation, and cache coherency protocols, in particular, can all mean that the order in which memory accesses occur, as seen by an external observer, differs significantly from the order of accesses that would appear in the SEM. This is usually invisible in a uniprocessor environment, but the effect becomes much more significant when multiple PEs are trying to communicate with memory. In reality, these effects are often only significant at particular synchronization boundaries between the different threads of execution.


1.2 内存屏障(Barrier)

内存屏障,也称为内存栅栏或同步屏障,是一种用于强制执行内存访问顺序和同步事件的指令。在现代计算机系统中,特别是多处理器或多核系统中,内存屏障(Memory Barrier)是一个至关重要的概念。ARM架构作为移动设备和嵌入式系统的主流架构之一,同样提供了内存屏障机制以确保数据的一致性和指令的正确执行顺序。

The Arm architecture is a weakly ordered memory architecture that supports out of order completion. Memory barrier is the general term applied to an instruction, or sequence of instructions, that forces synchronization events by a PE with respect to retiring load/store instructions. The memory barriers defined by the architecture provide a range of functionality, including:

• Ordering of load/store instructions.

• Completion of load/store instructions.

• Context synchronization.

The following subsections describe the Arm memory barrier instructions:

• Instruction Synchronization Barrier (ISB).

• Data Memory Barrier (DMB).

• Speculation Barrier (SB).

• Consumption of Speculative Data Barrier (CSDB).

• Speculative Store Bypass Barrier (SSBB).

• Profiling Synchronization Barrier (PSB).

• Physical Speculative Store Bypass Barrier (PSSBB).

• Trace Synchronization Barrier (TSB).

• Data Synchronization Barrier (DSB).

• Shareability and access limitations on the data barrier operations.

• Load-Acquire, Load-AcquirePC, and Store-Release.

• LoadLOAcquire, StoreLORelease.

• Guarded Control Stack Barrier (GCSB).


1.2.1  Data Memory Barrier(DMB)


The Data Memory Barrier (DMB) prevents the reordering of specified explicit data accesses across the barrier instruction. All explicit data load or store instructions, which are executed by the PE in program order before the DMB, are observed by all Observers within a specified Shareability domain before the data accesses after the DMB in program order.

当所有在它前面的存储器访问操作都执行完毕后,才提交(commit)在它后面的访问指令。DMB 指令保证的是 DMB 指令之前的所有内存访问指令和 DMB 指令之后的所有内存访问指令的顺序。也就是说, DMB 指令之后的内存访问指令不会被处理器重排到 DMB 指令的前面。DMB 指令不会保证内存访问指令在内存屏障指令之前必须完成,它仅仅保证内存屏障指令前后的内存访问指令的执行顺序。DMB 指令仅仅影响内存访问指令、数据高速缓存指令以及高速缓存管理指令等,并不会影响其他指令的顺序。


(1) 如图1-2,一个Observer(PE-Core-A)执行如下指令流,且假设X1和X3所代表的地址初始值是0x0.

图1-2 未使用内存屏障机制

(2) 如果我们对Observer(PE-Core-A)不做任何的屏障措施,那么此时Observer(PE-Core-B)对共享数据发起读访问可能产生如下上下文,如图1-3所示。

图1-3 未使用内存屏障机制的上下文


• Observer(PE-Core-A)未执行任何对共享数据的操作。
• 或者Observer(PE-Core-A)都发生了操作,但是操作没有完成,其他Observer观察不到Observer(PE-Core-A)的写操作(X1&X3 不是Observability状态 )。
• Observer(PE-Core-B)发生了共享内存读操作。


• Observer(PE-Core-A)发生了重排序,X3被修改完成。

• Observer(PE-Core-A)对X1未发生操作,或者对X1的操作没有完成,其他Observer观察不到Observer(PE-Core-A)对X1的写操作(X1 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。


• Observer(PE-Core-A)完成X1的写操作。

• Observer(PE-Core-A)没有开始或者没有完成X3的写操作(X3 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。


• Observer(PE-Core-A)完成X1、X3的写操作。

• Observer(PE-Core-B)发生了共享内存读操作。

(3) 对1-2的代码片段,加入内存屏障DMB,其他条件不变。

图1-4 使用内存屏障机制

(4) 对Observer(PE-Core-A)使用屏障措施,那么此时Observer(PE-Core-B)对共享数据发起读访问可能产生如下上下文,如图1-5所示。

图1-5 使用内存屏障机制的上下文


• Observer(PE-Core-A)未执行任何对共享数据的操作。

• 或者Observer(PE-Core-A)都发生了操作,但是操作没有完成,其他Observer观察不到Observer(PE-Core-A)的写操作(X1&X3 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。


• Observer(PE-Core-A)对X1的操作完成,对X3的操作未开始或者未完成(X3 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。


• Observer(PE-Core-A)对X1、X3的操作完成。

• Observer(PE-Core-B)发生了共享内存读操作。


1.2.2 Data Synchronization Barrier(DSB)


A DSB is a memory barrier that ensures that those memory accesses that occur before the DSB have completed before the completion of the DSB instruction. In doing this, it acts as a stronger barrier than a DMB. All the ordering that a DMB creates with specific arguments is also generated by a DSB with the same arguments.

A DSB that is executed by a PE completes when:

• All explicit memory accesses of the required access types appear in program order before the DSB are complete for the set of observers in the required Shareability domain.

• If the argument specified in the DSB is reads and writes, then all cache maintenance instructions and all TLB maintenance instructions that are issued by the PE before the DSB are complete for the required Shareability domain.

DSB比DMB 指令要严格一些,仅当所有在它前面的访问指令都执行完毕后,才会执行在它后面的指令,即任何指令都要等待 DSB 指令前面的访问指令完成。位于此指令前的所有缓存,如分支预测和 TLB 维护操作需全部完成。这个通过一个例子说明,如图1-6所示:

图1-6 使用内存屏障DSB





The completion of a read is easier to explain than the completion of a write. This is because the completion of a read is the point at which read data is returned to the architectural general-purpose registers of the PE.
The completion of a write is more complicated. For a write to Device memory, the point at which the write is complete depends on the Early-write acknowledgment attribute that is specified in the Device memory type, as described in Device memory in the Armv8-A memory model guide. If the memory system supports Early-write acknowledgment, then the DSB can retire before the write has reached the end peripheral. A write to memory that is defined as Device-nGnRnE can only complete when the write response comes from the end peripheral.


(1) 内存访问完成的概念分成“读完成”和“写完成”。

(2) "读完成"判定标准就是相关的数据经过PE-Core的存储子系统到达执行单元Back-end的EUs。

(3) "写完成"就要复杂一些,需要根据具体的情况判断,但是也是要将相关的反馈信号发送到Back-end的EUs才算完成。


For all memory, the completion rules are defined as:

• A Memory Read effect R 1 to a Location is complete for a shareability domain when all of the following are true:

— Any write to the same Location by an observer within the shareability domain will be Coherence-after R 1 .

— Any translation table walks associated with R 1 are complete for that shareability domain.

• A Memory Write effect W 1 to a Location is complete for a shareability domain when all of the following are true:

— Any write to the same Location by an observer within the shareability domain will be Coherence-after W 1 .

— Any read to the same Location by an observer within the shareability domain will either Reads-from W 1 or Reads-from a Memory Write effect that is Coherence-after W 1 .

— Any translation table walks associated with the write are complete for that shareability domain.

• A translation table walk is complete for a shareability domain when the memory accesses, including the updates to translation table entries, associated with the translation table walk are complete for that shareability domain, and the TLB is updated.

• A cache maintenance instruction is complete for a shareability domain when the memory effects of the instruction are complete for that shareability domain, and any translation table walks that arise from the instruction are complete for that shareability domain.

• A TLB invalidate instruction is complete when all memory accesses using the TLB entries that have been invalidated are complete.



图1-7 使用内存屏障DSB


图1-8 指定内存屏障作用域

例子中给屏障指令制定了参数ISHST,说明此时生效的区域只是Inner SHareable内部的Observer,而且只关心存储操作。那么此时可以发生如下的上下文:

(1) 代码片段中非存储操作的指令都可以在当前Observer-PE0上发生重排序。

(2) 当前Inner SHareable中的其他Observer观察共性数据X1和X3的顺序不变。

(2) 非当前Inner SHareable中的Observer观察共性数据X1和X3的顺序可以是任意的,也就是X3可能比X1提前生效。

1.2.3 Instruction Synchronization Barrier(ISB)


This is used to guarantee that any subsequent instructions are fetched, again, so that privilege and access are checked with the current MMU configuration. It is used to ensure any previously executed context-changing operations, such as writes to system control registers, have completed by the time the ISB completes. In hardware terms, this might mean that the instruction pipeline is flushed, for example. Typical uses of this would be in memory management, cache control, and context switching code, or where code is being moved about in memory.

ISB指令会等待之前的所有指令完成,并清空指令流水线中的缓存,刷新指令预取队列,以确保执行的指令是最新的版本,可确保后续指令按照正确的顺序执行。使用ISB可以避免执行过程中出现错误的指令或无效的指令。在涉及到修改程序状态、刷新指令缓存或确保指令的顺序性的场景中,使用ISB来刷新流水线中的指令,确保后续指令按照正确的顺序执行。它比 DMB 指令和 DSB 指令严格,ISB 指令通常用来保证上下文切换的效果,如 VMID、ASID 更改、 TLB 维护操作等等。这部分涉及到ARM上下文的切换,课题非常宏大,我们会规划专门的文章,这里暂时不展开讨论了。

1.3 内存屏障的使用场景




图1-9 多线程编程中的内存屏障




图1-10 驱动程序中的内存屏障

这个例子中,我们假定执行驱动代码的PE-Core和DMA控制器组成一组Observer观察组。此时,vmb()保证了DMA要传送的数据块的启始地址优先于开始发送的指令被DMA Controller观察到,保证后续DMA控制器执行传送时,会发送正确的数据块。



(1) 对一些系统寄存器的操作,也会对Observer看到顺序有影响。

(2) 更多内存屏障应用场景,如上下文切换、内存管理、多处理器系统中的同步。

(3) 内存屏障的性能考虑,怎么才算合理使用内存屏障。

(4) 内存执行完成的标准有哪些应用场景。

(5) 等等



[00] <corelink_dmc520_technical_reference_manual_en.pdf>

[01] <corelink_dmc620_dynamic_memory_controller_trm.pdf>

[02] <IP-Controller/DDI0331G_dmc340_r4p0_trm.pdf>

[03] <80-ARM-IP-cs0001_ARMv8基础篇-400系列控制器IP.pdf>

[04] <arm_cortex_a725_core_trm_107652_0001_04_en.pdf>

[05] <DDI0487K_a_a-profile_architecture_reference_manual.pdf>

[06] <armv8_a_address_translation.pdf>

[07] <cortex_a55_trm_100442_0200_02_en.pdf>

[08] <learn_the_architecture_aarch64_memory_management_guide_en.pdf>

[09] <learn_the_architecture_armv8-a_memory_systems_en.pdf>

[10] <79-LX-LK-z0002_奔跑吧Linux内核-V-2-卷1_基础架构.pdf>

[11] <79-LX-LD-s003-Linux设备驱动开发详解4_0内核-3rd.pdf>

[12] <learn_the_architecture_memory_systems_ordering_and_barriers.pdf>

[13] <arm_dsu_120_trm_102547_0201_07_en.pdf>

[14] <80-ARM-MM-AL0001_内存学习(三):物理地址空间.pdf>

[15] <80-LX-MM-cs0002_Linux内存屏障.pdf>


MMU             - Memory Management Unit

TLB               - translation lookaside buffer

VIPT              - Virtual Index Physical Tag

VIVT              - Virtual Index Virtual Tag

PIPT               - Physical Index Physical Tag

VA                   -  Virtual Address

PA                   -  Physical Address

IPS                  - Intermediate Physical Space

IPA                  - Intermediate Physical Address

VMID               - virtual machine identifier

TLB                  - translation lookaside buffer(地址变换高速缓存)

VTTBR_EL2     - Virtualization Translation Table Base Registers

ASID                 - Address Space Identifier (ASID)

DMC                 - Dynamic Memory Controller

DDR SDRAM   - Double Data Rate Synchronous Dynamic Random Access Memory

TBI                   - Top Byte Ignore

DMB            - Data Memory Barrier

DSB            - Data Synchronization Barrier 

ISB             - Instruction Synchronization Barrier

DSU         - DynamIQ ™ Shared Unit

SOC          - System on Chip
