[A-17]ARMv8/ARMv9-Memory-内存屏障机制(Observer & Barrier)

文摘 2024-11-09 23:16 辽宁

ver0.3

前言

坚持读到这篇文章的伙伴们已经很了不起了，因为现在大家已经是了解了ARM的弱内存排序架构的人了。现在的你，不仅了解内存的类型、内存的共享机制、内存一致性机制，还了解了CPU架构以及处理器的微架构了，这些都为我们讲解今天的主题打下了坚实的基础。处理器的升级换代三个方向：更快、更高、更强。“更快”就是主频更高，单时钟周期内执行的指令会更多；“更高”就是处理器PE-Core引入了更多的硬件机制保证“更快”的性能能够得到充分发挥，比如我们前文提到的ARM弱排序内存模型；“更强”则是功能层面、更加丰富的处理能力，比如浮点计算、向量计算等等。从PE-Core的层面看过去，更多的硬件能力提升意味着更加高效的代码执行效率，这些优化的机制就像一只无形的手控制着程序的指令流的执行，但是这些机制会给计算机的世界带来一定的不确定性。这些不确定性的直接表现就是对内存的重排序，有些情况下这不是我们希望看到的。比如汽车的功能安全领域，系统和对手件的交互，就是必须严丝合缝不能出现任何不一致的地方。这个时候就需要ARM架构提供相应的机制给到码农使用，让他们设计出相应的程序使自己放心。这个机制就像一只有形的手一样控制着整个系统的执行，下面我们就进入本文的主题：内存屏障机制。

正文

1.1 Observer

在介绍内存屏障机制之前，我们需要先了解一个重要的概念：Observer。尽管我们前序文章中对本文相关的背景知识做了很多介绍，但是还是有必要花费一点时间对Observer做一下明确的解释，这个对我们直接理解ARM的内存屏障机制有很大的帮助。

1.1.1 Observer的概念

首先看一下ARM手册中对Observer的介绍：

Observer: A PE or mechanism in the system, such as a peripheral device, that can generate reads from or writes to memory.

• An Observer refers to either a Processor Element (PE) or some other mechanism in the system, such as a peripheral device, that can generate reads from, or writes to, memory. Observers can observe memory accesses.

• The ARM Architecture Reference Manual defines certain key words, in particular, the terms observe and must be observed. In typical systems, this defines how the bus interface of a master, for example, a core or GPU and the interconnect, must handle bus transactions. Only masters are able to observe transfers. All bus transactions are initiated by a master.

The architecture considers the following as separate Observers:

• The instruction interface of the core, typically called the Instruction Fetch Unit (IFU)

• The data interface, typically called the Load Store Unit (LSU)

• The MMU table walk unit

As described in Who is an Observer?, an Observer is something that can make memory accesses. For example, the MMU generates reads to walk translation tables.

在写这篇文章的时候，看了很多资料就想搞一个确切的说法到底啥是ARM架构体系下的Observer。挺苦恼的的，不过后来想开了都搬上来，大家自己看，群众中肯定是有聪明人的，哈哈。

图1-1 High-Level ARM SOC ARCH

通过手册的描述，结合图1-1，我们来简单归纳一下一个ARM体系下的Observer需要具备的特征。

(1) Observer必须是SOC系统内部的一个功能单元，比如PE、GPU、VPU、DMA控制器等等。

(2) Observer还可以继续细分PE内部的一个前端或者后端的更加精细的功能单元，比如MMU或者LSU等等。

(3) Observer必须要和系统的内存模型产生关联，它观察的内存空间Device和Normal两种类型的内存。

(4) Observer可以理解我SOC总线架构下Master这个概念的别名，只是它更侧重内存操作的相关总线事物。

(5) Observer之间是有关联的，必要的时候要做一致性处理(可以是硬件的方式，也可以是软件的方式)。

我们可以简单理解具备上述特征的SOC中的功能单元，其实都可以认为是ARM体系下的Observer。这里，我们可以延伸一下到SOC外面的设备，从某种层面上说其实它们也算是ARM体系下的Observer，因为它们是通过外围接口间接的链接到SOC总线上的节点，它们的行为也会间接的产生总线事物。

1.1.2 Observability

上面一个小节我们介绍了ARM体系下的Observer，原来他们就是一圈围绕内存转悠的Master。这里就会产生一个问题，就是这些观察者们到底在这个系统运行过程中都在干啥？到底在观察什么？什么时候观察的？先看一下手册中的描述。

The order that a master performs transactions in is not necessarily the same order that such transactions complete at the slave device, because transactions might be re-ordered by the interconnect unless some ordering is explicitly enforced.

A write to memory is observed when it reaches a point in the memory system in which it becomes visible. When it is visible, it is coherent to all the Observers in the speciﬁed Shareability domain, as speciﬁed in the memory barrier instruction. If a PE writes to a memory location, the write is observable if another PE would see the updated value if it read the same location. For example, if the memory is Normal cacheable, the write is observable when it reaches the coherent data caches of that Shareability domain.

A simple way to describe observability is to say that “I have observed your write when I can read what you wrote and I have observed your read when I can no longer change the value you read” where both I and you refer to cores or other masters in the system.

我们直接给出结论：

(1) 这些Observer首先是要分组，也就是按照内存共享的Domain观察自己领地内存的被共享的内存。

(2) 观察的具体内容有两个层面：

• 一个是行为，就是这些被共享内存被Observer写或者被Observer读。

• 一个是结果，就是站在一个操作共享内存的Observer视角，共享域内其他Observer啥时候能够看到这个结果。

(3) 对于产生结果事实可以换种表达方式：observability。

• PE就是一个Observer，当一个PE-Core-A向一个内存地址写数据的时候要经过若干指令周期才能完成，而此时另外一个PE-Core-B能够看到这个结果才代表PE-Core-A这个写的动作可被观察和感知。相对的，此时这个PE-Core-B读这个地址上的数据时，PE-Core-A能够捕获且不能够改变当前数据的值时，代表PE-Core-B的这读的动作才能够被感知。

其实上面这段的核心意思就是，对共享内存数据的读和写，都是要有一个过程的，无论是写还是读都要在被共享数据在observable状态下才可以，否则后果自负。

1.1.3 Order

上一个小节，我们讲了共享的数据在Observer之间被读和写都得卡着点才行，也就是要在observable状态下才可以获取有效的能够表达其他Observer意图的数据。显然，这就需要一种能够保证一定内存方位顺序的机制工作在Observer之间互通有无才可以，否则在ARM的弱排序内存模型下，大家还是会有隐隐的担忧。我们看下手册中的描述：

While the effect of ordering is largely hidden from the programmer within a single PE, the microarchitectural innovations have a profound impact on the ordering of memory accesses. Write buffering, speculation, and cache coherency protocols, in particular, can all mean that the order in which memory accesses occur, as seen by an external observer, differs significantly from the order of accesses that would appear in the SEM. This is usually invisible in a uniprocessor environment, but the effect becomes much more significant when multiple PEs are trying to communicate with memory. In reality, these effects are often only significant at particular synchronization boundaries between the different threads of execution.

手册中的内存其实前序的文章中，我们都已经反复的讨论过了，这里就是稍微再总结了一下：内存的世界需要屏障。

1.2 内存屏障(Barrier)

内存屏障，也称为内存栅栏或同步屏障，是一种用于强制执行内存访问顺序和同步事件的指令。在现代计算机系统中，特别是多处理器或多核系统中，内存屏障（Memory Barrier）是一个至关重要的概念。ARM架构作为移动设备和嵌入式系统的主流架构之一，同样提供了内存屏障机制以确保数据的一致性和指令的正确执行顺序。

The Arm architecture is a weakly ordered memory architecture that supports out of order completion. Memory barrier is the general term applied to an instruction, or sequence of instructions, that forces synchronization events by a PE with respect to retiring load/store instructions. The memory barriers defined by the architecture provide a range of functionality, including:

• Ordering of load/store instructions.

• Completion of load/store instructions.

• Context synchronization.

The following subsections describe the Arm memory barrier instructions:

• Instruction Synchronization Barrier (ISB).

• Data Memory Barrier (DMB).

• Speculation Barrier (SB).

• Consumption of Speculative Data Barrier (CSDB).

• Speculative Store Bypass Barrier (SSBB).

• Profiling Synchronization Barrier (PSB).

• Physical Speculative Store Bypass Barrier (PSSBB).

• Trace Synchronization Barrier (TSB).

• Data Synchronization Barrier (DSB).

• Shareability and access limitations on the data barrier operations.

• Load-Acquire, Load-AcquirePC, and Store-Release.

• LoadLOAcquire, StoreLORelease.

• Guarded Control Stack Barrier (GCSB).

在ARM架构中，内存屏障机制主要是通过指令防止处理器对内存访问的乱序执行，确保数据的一致性和指令的正确顺序。这些指令我们不打算一一做介绍，只挑选其中广大人民群众喜闻乐见的三条(DMB\DSB\ISB)指令重点讨论，帮助大家理解ARM体系下内存屏障工作的基本原理。

1.2.1 Data Memory Barrier（DMB）

先看手册中的描述

The Data Memory Barrier (DMB) prevents the reordering of speciﬁed explicit data accesses across the barrier instruction. All explicit data load or store instructions, which are executed by the PE in program order before the DMB, are observed by all Observers within a speciﬁed Shareability domain before the data accesses after the DMB in program order.

当所有在它前面的存储器访问操作都执行完毕后，才提交(commit)在它后面的访问指令。DMB 指令保证的是 DMB 指令之前的所有内存访问指令和 DMB 指令之后的所有内存访问指令的顺序。也就是说， DMB 指令之后的内存访问指令不会被处理器重排到 DMB 指令的前面。DMB 指令不会保证内存访问指令在内存屏障指令之前必须完成，它仅仅保证内存屏障指令前后的内存访问指令的执行顺序。DMB 指令仅仅影响内存访问指令、数据高速缓存指令以及高速缓存管理指令等，并不会影响其他指令的顺序。

我们通过一个例子来讲解DMB：

(1) 如图1-2，一个Observer(PE-Core-A)执行如下指令流，且假设X1和X3所代表的地址初始值是0x0.

图1-2 未使用内存屏障机制

(2) 如果我们对Observer(PE-Core-A)不做任何的屏障措施，那么此时Observer(PE-Core-B)对共享数据发起读访问可能产生如下上下文，如图1-3所示。

图1-3 未使用内存屏障机制的上下文

Context-1：

• Observer(PE-Core-A)未执行任何对共享数据的操作。

• 或者Observer(PE-Core-A)都发生了操作，但是操作没有完成，其他Observer观察不到Observer(PE-Core-A)的写操作(X1&X3 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。

Context-2：

• Observer(PE-Core-A)发生了重排序，X3被修改完成。

• Observer(PE-Core-A)对X1未发生操作，或者对X1的操作没有完成，其他Observer观察不到Observer(PE-Core-A)对X1的写操作(X1 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。

Context-3：

• Observer(PE-Core-A)完成X1的写操作。

• Observer(PE-Core-A)没有开始或者没有完成X3的写操作(X3 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。

Context-4：

• Observer(PE-Core-A)完成X1、X3的写操作。

• Observer(PE-Core-B)发生了共享内存读操作。

(3) 对1-2的代码片段，加入内存屏障DMB，其他条件不变。

图1-4 使用内存屏障机制

(4) 对Observer(PE-Core-A)使用屏障措施，那么此时Observer(PE-Core-B)对共享数据发起读访问可能产生如下上下文，如图1-5所示。

图1-5 使用内存屏障机制的上下文

Context-1：

• Observer(PE-Core-A)未执行任何对共享数据的操作。

• 或者Observer(PE-Core-A)都发生了操作，但是操作没有完成，其他Observer观察不到Observer(PE-Core-A)的写操作(X1&X3 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。

Context-2：

• Observer(PE-Core-A)对X1的操作完成，对X3的操作未开始或者未完成(X3 不是Observability状态 )。

• Observer(PE-Core-B)发生了共享内存读操作。

Context-3：

• Observer(PE-Core-A)对X1、X3的操作完成。

• Observer(PE-Core-B)发生了共享内存读操作。

通过上面的小实验，可以发现使用了DMB之后，Observer(PE-Core-A)不会对X3的操作进行重排序优化，这就保证了其他Observer例如Observer(PE-Core-B)对Observer(PE-Core-A)操作X1、X3的观察顺序。只是此时只能保证顺序，还不能保证Observer(PE-Core-A)对X1的操作生效的结果。

1.2.2 Data Synchronization Barrier（DSB）

先看手册：

A DSB is a memory barrier that ensures that those memory accesses that occur before the DSB have completed before the completion of the DSB instruction. In doing this, it acts as a stronger barrier than a DMB. All the ordering that a DMB creates with speciﬁc arguments is also generated by a DSB with the same arguments.

A DSB that is executed by a PE completes when:

• All explicit memory accesses of the required access types appear in program order before the DSB are complete for the set of observers in the required Shareability domain.

• If the argument speciﬁed in the DSB is reads and writes, then all cache maintenance instructions and all TLB maintenance instructions that are issued by the PE before the DSB are complete for the required Shareability domain.

DSB比DMB 指令要严格一些，仅当所有在它前面的访问指令都执行完毕后，才会执行在它后面的指令，即任何指令都要等待 DSB 指令前面的访问指令完成。位于此指令前的所有缓存，如分支预测和 TLB 维护操作需全部完成。这个通过一个例子说明，如图1-6所示：

图1-6 使用内存屏障DSB

上面的例子中如果使用的DMB的话，那么ADD这行指令在当前Observer(PE-Core)还是可以做重新排序的，但是使用了DSB之后ADD永远也不会跑到DSB前面的任何一个位置执行，并且它得到执行的必要条件还是要等到当前Observer(PE-Core)对X1的操作完成。

Completion

DSB和DMB的区别还是比较好理解的，但是要彻底搞清楚还要明白到底啥是执行完成，也就是completion。

先看一个简单的版本：

The completion of a read is easier to explain than the completion of a write. This is because the completion of a read is the point at which read data is returned to the architectural general-purpose registers of the PE.

The completion of a write is more complicated. For a write to Device memory, the point at which the write is complete depends on the Early-write acknowledgment attribute that is speciﬁed in the Device memory type, as described in Device memory in the Armv8-A memory model guide. If the memory system supports Early-write acknowledgment, then the DSB can retire before the write has reached the end peripheral. A write to memory that is deﬁned as Device-nGnRnE can only complete when the write response comes from the end peripheral.

简单归纳一下：

(1) 内存访问完成的概念分成“读完成”和“写完成”。

(2) "读完成"判定标准就是相关的数据经过PE-Core的存储子系统到达执行单元Back-end的EUs。

(3) "写完成"就要复杂一些，需要根据具体的情况判断，但是也是要将相关的反馈信号发送到Back-end的EUs才算完成。

上面是一个简单的版本，实际上在ARM体系内部的场景更加复杂，根据上下文不同判定的标准也不同，这里只引用部分手册中的规则原文，不展开讨论。

For all memory, the completion rules are defined as:

• A Memory Read effect R 1 to a Location is complete for a shareability domain when all of the following are true:

— Any write to the same Location by an observer within the shareability domain will be Coherence-after R 1 .

— Any translation table walks associated with R 1 are complete for that shareability domain.

• A Memory Write effect W 1 to a Location is complete for a shareability domain when all of the following are true:

— Any write to the same Location by an observer within the shareability domain will be Coherence-after W 1 .

— Any read to the same Location by an observer within the shareability domain will either Reads-from W 1 or Reads-from a Memory Write effect that is Coherence-after W 1 .

— Any translation table walks associated with the write are complete for that shareability domain.

• A translation table walk is complete for a shareability domain when the memory accesses, including the updates to translation table entries, associated with the translation table walk are complete for that shareability domain, and the TLB is updated.

• A cache maintenance instruction is complete for a shareability domain when the memory effects of the instruction are complete for that shareability domain, and any translation table walks that arise from the instruction are complete for that shareability domain.

• A TLB invalidate instruction is complete when all memory accesses using the TLB entries that have been invalidated are complete.

Scope

DSB和DMB这两个指令也有有自己的工作范围的，如图1-7所示：

图1-7 使用内存屏障DSB

当给DSB和DMB指令配上相关的参数之后，那么这时内存屏障生效就产生了范围，如图1-8所示。

图1-8 指定内存屏障作用域

例子中给屏障指令制定了参数ISHST，说明此时生效的区域只是Inner SHareable内部的Observer，而且只关心存储操作。那么此时可以发生如下的上下文：

(1) 代码片段中非存储操作的指令都可以在当前Observer-PE0上发生重排序。

(2) 当前Inner SHareable中的其他Observer观察共性数据X1和X3的顺序不变。

(2) 非当前Inner SHareable中的Observer观察共性数据X1和X3的顺序可以是任意的，也就是X3可能比X1提前生效。

1.2.3 Instruction Synchronization Barrier（ISB）

先看手册：

This is used to guarantee that any subsequent instructions are fetched, again, so that privilege and access are checked with the current MMU configuration. It is used to ensure any previously executed context-changing operations, such as writes to system control registers, have completed by the time the ISB completes. In hardware terms, this might mean that the instruction pipeline is flushed, for example. Typical uses of this would be in memory management, cache control, and context switching code, or where code is being moved about in memory.

ISB指令会等待之前的所有指令完成，并清空指令流水线中的缓存，刷新指令预取队列，以确保执行的指令是最新的版本，可确保后续指令按照正确的顺序执行。使用ISB可以避免执行过程中出现错误的指令或无效的指令。在涉及到修改程序状态、刷新指令缓存或确保指令的顺序性的场景中，使用ISB来刷新流水线中的指令，确保后续指令按照正确的顺序执行。它比 DMB 指令和 DSB 指令严格，ISB 指令通常用来保证上下文切换的效果，如 VMID、ASID 更改、 TLB 维护操作等等。这部分涉及到ARM上下文的切换，课题非常宏大，我们会规划专门的文章，这里暂时不展开讨论了。

1.3 内存屏障的使用场景

我们简单梳理两个具象化的场景，帮助大家进一步理解内存屏障工作的原理。

(1)多线程编程中的数据同步

在多线程编程中，多个线程可能会同时访问共享数据。为了确保数据的正确性和一致性，需要使用内存屏障指令来确保对共享数据的访问顺序。例如，在更新共享数据后，使用DMB或DSB指令来确保更新对其他线程可见，如图1-9所示。

图1-9 多线程编程中的内存屏障

在这个例子中，我们讲定PE0和PE1组成一组Observer观察组。PE0中smp_wmb()确保value的写入在flag的写入之前完成，而PE1中smp_rmb()确保在读取value之前，flag的读取已经完成。如果没有这些内存屏障，处理器或编译器可能会重排序这些操作，导致消费者可能读取到未初始化或部分初始化的数据。

(2)设备驱动程序中的寄存器访问

在设备驱动程序编程中，对设备寄存器的读写操作需要严格按照一定的顺序进行。使用内存屏障指令可以确保对设备寄存器的读写顺序和同步，从而避免潜在的错误和副作用。

图1-10 驱动程序中的内存屏障

这个例子中，我们假定执行驱动代码的PE-Core和DMA控制器组成一组Observer观察组。此时，vmb()保证了DMA要传送的数据块的启始地址优先于开始发送的指令被DMA Controller观察到，保证后续DMA控制器执行传送时，会发送正确的数据块。

结语

ARM的内存屏障机制除了上面介绍的内容之外，还有很多点值得研究：

(1) 对一些系统寄存器的操作，也会对Observer看到顺序有影响。

(2) 更多内存屏障应用场景，如上下文切换、内存管理、多处理器系统中的同步。

(3) 内存屏障的性能考虑，怎么才算合理使用内存屏障。

(4) 内存执行完成的标准有哪些应用场景。

(5) 等等

由于笔者自身的时间和水平所限，不能一一呈现给大家。希望借以此文抛砖引玉，能让更多的小伙伴对ARM的技术产生兴趣加入进来，大家一起研究。一个小伙伴后台给我留言，说是不敢提问，怕问的问题不专业....。我想说，三人行必有我师，知识都是在反复的学习中才会获得，能力只有在不断的交流中才能提高，真理也只有在思辨和实践中才能显现。大家不要怕犯错，胆子大一些和笔者一样才能有所收获，请大家保持关注。

Reference

[00] <corelink_dmc520_technical_reference_manual_en.pdf>

[01] <corelink_dmc620_dynamic_memory_controller_trm.pdf>

[02] <IP-Controller/DDI0331G_dmc340_r4p0_trm.pdf>

[03] <80-ARM-IP-cs0001_ARMv8基础篇-400系列控制器IP.pdf>

[04] <arm_cortex_a725_core_trm_107652_0001_04_en.pdf>

[05] <DDI0487K_a_a-profile_architecture_reference_manual.pdf>

[06] <armv8_a_address_translation.pdf>

[07] <cortex_a55_trm_100442_0200_02_en.pdf>

[08] <learn_the_architecture_aarch64_memory_management_guide_en.pdf>

[09] <learn_the_architecture_armv8-a_memory_systems_en.pdf>

[10] <79-LX-LK-z0002_奔跑吧Linux内核-V-2-卷1_基础架构.pdf>

[11] <79-LX-LD-s003-Linux设备驱动开发详解4_0内核-3rd.pdf>

[12] <learn_the_architecture_memory_systems_ordering_and_barriers.pdf>

[13] <arm_dsu_120_trm_102547_0201_07_en.pdf>

[14] <80-ARM-MM-AL0001_内存学习(三):物理地址空间.pdf>

[15] <80-LX-MM-cs0002_Linux内存屏障.pdf>

Glossary

MMU - Memory Management Unit

TLB - translation lookaside buffer

VIPT - Virtual Index Physical Tag

VIVT - Virtual Index Virtual Tag

PIPT - Physical Index Physical Tag

VA - Virtual Address

PA - Physical Address

IPS - Intermediate Physical Space

IPA - Intermediate Physical Address

VMID - virtual machine identifier

TLB - translation lookaside buffer(地址变换高速缓存)

VTTBR_EL2 - Virtualization Translation Table Base Registers

ASID - Address Space Identifier (ASID)

DMC - Dynamic Memory Controller

DDR SDRAM - Double Data Rate Synchronous Dynamic Random Access Memory

TBI - Top Byte Ignore

DMB - Data Memory Barrier

DSB - Data Synchronization Barrier

ISB - Instruction Synchronization Barrier

DSU - DynamIQ ™ Shared Unit

SOC - System on Chip

浩瀚架构师

和大家一起探索这个神奇的世界。