eBPF Talk: 耗时 10 个月，修复了又一个 tailcall 的 bug

文摘 2024-07-22 08:10 新加坡

在上周六（ 7 月 20 日）早上醒来的时候，发现个好消息：修复 tailcall hierarchy 问题的 PATCH 被合并啦。

并得到 BPF 社区大佬 Alexei 的肯定：

Thank you for sticking to it. Applied to bpf-next.
It took almost a year, but the final fix is imo much better than all the previous attempts. I suspect some tail_call users may see a small performance degradation, but it's a trade off we had to make. The evolution of the fix made the perf hit as small as possible.

哪里来的 又一个 bug？之前，在 2023 年 9 月 12 日，修复 tailcall 无限循环问题的 PATCH 被合并：

eBPF Talk: 内核 bpf 子系统贡献之修复 tailcall 无限循环

太长不读：以下记录一下修 tailcall hierarchy 问题的升仙历程，中途几度想要放弃。

第一回合：确认问题 / 2023 年 9 月 4 日

在跟 s390x BPF JIT maintainer Ilya Leoshkevich 讨论修复 tailcall 无限循环问题时，我写了个 demo 确认了 tailcall hierarchy 问题。

struct {
    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    __uint(max_entries, 4);
    __uint(key_size, sizeof(__u32));
    __uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

static __noinline
int subprog_tail_01(struct __sk_buff *skb)
{
    if (load_byte(skb, 0))
        bpf_tail_call_static(skb, &jmp_table, 1);
    else
        bpf_tail_call_static(skb, &jmp_table, 0);
    return 1;
}

static __noinline
int subprog_tail_23(struct __sk_buff *skb)
{
    if (load_byte(skb, 0))
        bpf_tail_call_static(skb, &jmp_table, 3);
    else
        bpf_tail_call_static(skb, &jmp_table, 2);
    return 1;
}

int count0 = 0;

SEC("tc")
int classifier_01(struct __sk_buff *skb)
{
    count0++;
    return subprog_tail_01(skb);
}

int count1 = 0;

SEC("tc")
int classifier_23(struct __sk_buff *skb)
{
    count1++;
    return subprog_tail_23(skb);
}

static __noinline
int subprog_tailcall(struct __sk_buff *skb, int index)
{
    volatile int retval = 0;
    bpf_tail_call(skb, &jmp_table, index);
    return retval;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
    subprog_tailcall(skb, 0);
    subprog_tailcall(skb, 2);

    return 0;
}

Finally, count0 is 33. And count1 is 33, too. It breaks the
MAX_TAIL_CALL_CNT limit by the way tailcall in subprog.

详情：Re: [RFC PATCH bpf-next v4 0/4] bpf, x64: Fix tailcall infinite loop^[1]。

第二回合：RFC PATCH v1 / 2023 年 10 月 5 日

在 2023 年 9 月 12 日修复 tailcall 无限循环问题的 PATCH 被合并了之后，便投入到修复 tailcall hierarchy 问题中。

不过不过，该怎么修复呢？

想法是，在栈上保存 tail call count pointer 和 tail call count，然后在传递时就只传 tail call count pointer。

 |  STACK  |
 +---------+ RBP
 |         |
 |         |
 |         |
 | tcc_ptr |
 |   tcc   |
 |   rbx   |
 +---------+ RSP

不过，看着那 diff，就觉得复杂，肯定会出问题；所以，后面就没继续这个方向了。

详情：[RFC PATCH bpf-next 0/3] bpf, x64: Fix tailcall hierarchy^[2]。

P.S. 最终被合并的 PATCH 的思路来源于这个想法；兜兜转转的，绕了不少弯路。

第三回合：RFC PATCH v2 / 2023 年 10 月 11 日

不满意前一个思路，又想了个新的思路：

    |  STACK  |
    |         |
    |   rip   |
 +->|   tcc   |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |         |
 |  |         |
 |  |         |
 +--| tcc_ptr |
    |   rbx   |
    +---------+ RSP

使用一个伪栈帧来保存 tail call count，然后传递的还是 tail call count pointer。

相比于前一个思路，这个思路更简单；但觉得 BPF 社区不会接受这样的思路，因为伪栈帧可能会破坏 unwind。

但，问题比较艰深、复杂，在跟 reviewer Maciej Fijalkowski 讨论的过程中，做了更加详细的解释，譬如：

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 33
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |    *    |
 |  |    *    |
 |  |    *    |
 |  +---------+ RBP
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP

详情：[RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy^[3]。

第四回合：PATCH v1 / 2024 年 1 月 4 日

RFC PATCH v2 在讨论了许久后，Maciej Fijalkowski 终于肯定了这个思路。于是，在写了更好的 commit message 之后，便投递了 PATCH v1。

修复问题的核心之处：

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index fe30b9ebb8de4..67fa337fc2e0c 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -406,14 +406,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
      */
     emit_nops(&prog, X86_PATCH_SIZE);
     if (!ebpf_from_cbpf) {
-       if (tail_call_reachable && !is_subprog)
+       if (tail_call_reachable && !is_subprog) {
             /* When it's the entry of the whole tailcall context,
              * zeroing rax means initialising tail_call_cnt.
              */
-           EMIT2(0x31, 0xC0); /* xor eax, eax */
-       else
-           /* Keep the same instruction layout. */
-           EMIT2(0x66, 0x90); /* nop2 */
+           EMIT2(0x31, 0xC0);       /* xor eax, eax */
+           EMIT1(0x50);             /* push rax */
+           /* Make rax as ptr that points to tail_call_cnt. */
+           EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
+           EMIT1_off32(0xE8, 2);    /* call main prog */
+           EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
+           EMIT1(0xC3);             /* ret */
+       } else {
+           /* Keep the same instruction size. */
+           emit_nops(&prog, 13);
+       }
     }
     /* Exception callback receives FP as third parameter */
     if (is_exception_cb) {
@@ -439,6 +446,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
     if (stack_depth)
         EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
     if (tail_call_reachable)
+       /* Here, rax is tail_call_cnt_ptr. */
         EMIT1(0x50);         /* push rax */
     *pprog = prog;
 }

果真如我担心的那样，Alexei 担心这里的 call 指令可能会破坏 unwind 以及其它许多东西，并提出：

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 0bdbbbeab155..0b45571559be 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -910,7 +910,7 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
        if (IS_ERR(prog))
                return prog;

-       if (!bpf_prog_map_compatible(map, prog)) {
+       if (!bpf_prog_map_compatible(map, prog) || prog->aux->func_cnt) {
                bpf_prog_put(prog);
                return ERR_PTR(-EINVAL);
        }

我：提起 RFC PATCH v1 的思路。

Alexei: prologue 里使用 15 字节的 nop 的做法太重了，想一想适合所有 BPF JIT 的更简单的办法吧。

我：提出了在 __bpf_prog_run_dfunc() 里使用 bpf_prog_run_ctx 的思路。

Alexei: 在 __bpf_prog_run_dfunc() 里不能使用额外的 if 条件判断；要不试试 PERCPU tail call count 的思路？

详情：[PATCH bpf-next 0/4] bpf, x64: Fix tailcall hierarchy^[4]。

在尝试了 PERCPU tail call count 的思路后，便递交了 PATCH v2。

第五回合：PATCH v2 / 2024 年 2 月 22 日

在深入思考 PATCH v2 的更多 case 后，发现该思路存在缺陷：不同 bpf prog 运行在同一 CPU 上，会共享同一个 tail call count。

详情：[PATCH bpf-next v2 0/2] bpf, x64: Fix tailcall hierarchy^[5]。

第六回合：PATCH v3 / 2024 年 4 月 2 日

继续在 PERCPU tail call count 的思路上前进，比如将 tail call count 挪到 task_struct 里。

继续跟 Alexei 讨论，他发现其中的缺陷：当一个带有 tail call 的 bpf prog 的 subprog 被 fentry、且 fentry 里也有 tail call 时，会共享同一个 tail call count。

详情：[PATCH bpf-next v3 0/3] bpf, x64: Fix tailcall hierarchy^[6]。

第七回合：PATCH v4 / 2024 年 5 月 9 日

在 PATCH v4 里，将 arm64 BPF JIT 也一并修复了吧，就不需要等待 arm64 BPF JIT maintainer 的加入了。

将思路摇摆回 bpf_prog_run_ctx 的方向上，但发现该思路存在缺陷：如果 freplace bpf prog 里使用了 tail call 呢？

再度陷入困境。。。

详情：[PATCH bpf-next v4 0/5] bpf: Fix tailcall hierarchy^[7]。

第八回合：PATCH v5 / 2024 年 7 月 24 日

既然 PERCPU tail call count 和 bpf_prog_run_ctx 的思路都存在缺陷，那么，就剩下 RFC PATCH v1 的思路了；解决 RFC PATCH v1 的缺陷，大概就可以了吧。

失败的尝试："push tail_call_cnt and propagate its pointer"

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 5159c7a229229..f9f3aea5d9d2a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -420,14 +420,18 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
         */
        emit_nops(&prog, X86_PATCH_SIZE);
        if (!ebpf_from_cbpf) {
-               if (tail_call_reachable && !is_subprog)
+               if (tail_call_reachable && !is_subprog) {
                        /* When it's the entry of the whole tailcall context,
                         * zeroing rax means initialising tail_call_cnt.
                         */
-                       EMIT2(0x31, 0xC0); /* xor eax, eax */
-               else
+                       EMIT2(0x31, 0xC0);       /* xor eax, eax */
+                       EMIT1(0x50);             /* push rax */
+                       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
+               } else {
                        /* Keep the same instruction layout. */
-                       EMIT2(0x66, 0x90); /* nop2 */
+                       emit_nops(&prog, 5);     /* nop5 */
+                       EMIT1(0x50);             /* push rax */
+               }
        }
        /* Exception callback receives FP as third parameter */
        if (is_exception_cb) {
@@ -439,6 +443,11 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
                 */
                pop_callee_regs(&prog, all_callee_regs_used);
                pop_r12(&prog);
+               if (tail_call_reachable) {
+                       /* rax is tail_call_cnt_ptr */
+                       EMIT1(0x58);     /* pop rax */
+                       EMIT2_off32(0xC7, 0x00, 0); /* mov dword ptr [rax], 0 */
+               }
                /* Reset the stack frame. */
                EMIT3(0x48, 0x89, 0xEC); /* mov rsp, rbp */
        } else {
@@ -453,6 +462,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
        if (stack_depth)
                EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
        if (tail_call_reachable)
+               /* rax is tail_call_cnt_ptr */
                EMIT1(0x50);         /* push rax */
        *pprog = prog;
 }
@@ -2321,6 +2333,7 @@ st:                    if (is_imm8(insn->off))
                                        pop_r12(&prog);
                        }
                        EMIT1(0xC9);         /* leave */
+                       EMIT1(0x59);         /* pop rcx, get rid of tail_call_cnt */
                        emit_return(&prog, image + addrs[i - 1] + (prog - temp));
                        break;

如上代码片段，在 prologue 里，初始化 rax 寄存器并 push 到栈上、并将 rax 寄存器当作指向栈帧的指针；在 epilogue 里，在 ret 之前，需要 pop 一下。

但，这个思路真的破坏了 unwind 机制。。。

BPF exceptions 特性依赖 unwind 机制，我便联系 BPF exceptions 的作者 Kumar Kartikeya Dwivedi，想要讨论如何解决这个问题。

成功的尝试："use rax to propagate tail call count or its pointer"

不过，在联系他的时候，我想到了一个更好的思路：只使用 rax 寄存器传递 tail call count 信息。

在 x86_64 BPF JIT 里，保留使用 rax 寄存器来传递 tail call count 信息，而不是只传递 tail call count pointer；那么，该如何区分 rax 寄存器里的值是 tail call count 还是 tail call count pointer 呢？

因为 tail_call_cnt 的最大值是 MAX_TAIL_CALL_CNT，所以，只要 rax 寄存器里的值小于等于 MAX_TAIL_CALL_CNT，就说明是 tail call count；否则，就说明是 tail call count pointer。

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index d25d81c8ecc00..074b41fafbe3f 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -403,6 +403,37 @@ static void emit_cfi(u8 **pprog, u32 hash)
    *pprog = prog;
 }

+static void emit_prologue_tail_call(u8 **pprog, bool is_subprog)
+{
+   u8 *prog = *pprog;
+
+   if (!is_subprog) {
+       /* cmp rax, MAX_TAIL_CALL_CNT */
+       EMIT4(0x48, 0x83, 0xF8, MAX_TAIL_CALL_CNT);
+       EMIT2(X86_JA, 6);        /* ja 6 */
+       /* rax is tail_call_cnt if <= MAX_TAIL_CALL_CNT.
+        * case1: entry of main prog.
+        * case2: tail callee of main prog.
+        */
+       EMIT1(0x50);             /* push rax */
+       /* Make rax as tail_call_cnt_ptr. */
+       EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
+       EMIT2(0xEB, 1);          /* jmp 1 */
+       /* rax is tail_call_cnt_ptr if > MAX_TAIL_CALL_CNT.
+        * case: tail callee of subprog.
+        */
+       EMIT1(0x50);             /* push rax */
+       /* push tail_call_cnt_ptr */
+       EMIT1(0x50);             /* push rax */
+   } else { /* is_subprog */
+       /* rax is tail_call_cnt_ptr. */
+       EMIT1(0x50);             /* push rax */
+       EMIT1(0x50);             /* push rax */
+   }
+
+   *pprog = prog;
+}

如上代码片段，对于主 prog 的 prologue，需要判断 rax 寄存器是 tail call count 还是 tail call count pointer；对于 subprog 的 prologue，只需要将 rax 寄存器当作 tail call count pointer 即可。

x86_64 BPF JIT 的修改得到社区大佬 Eduard Zingerman 的肯定；就是要补充一下 JITed prog 的反汇编讲解：

When bpftool p d j id 42:

int entry(struct __sk_buff * skb):
bpf_prog_0c0f4c2413ef19b1_entry:
; int entry(struct __sk_buff *skb)
   0:   endbr64
   4:   nopl    (%rax,%rax)
   9:   xorq    %rax, %rax      ;; rax = 0 (tail_call_cnt)
   c:   pushq   %rbp
   d:   movq    %rsp, %rbp
  10:   endbr64
  14:   cmpq    $33, %rax       ;; if rax > 33, rax = tcc_ptr
  18:   ja      0x20            ;; if rax > 33 goto 0x20 ---+
  1a:   pushq   %rax            ;; [rbp - 8] = rax = 0      |
  1b:   movq    %rsp, %rax      ;; rax = rbp - 8            |
  1e:   jmp     0x21            ;; ---------+               |
  20:   pushq   %rax            ;; <--------|---------------+
  21:   pushq   %rax            ;; <--------+ [rbp - 16] = rax
  22:   pushq   %rbx            ;; callee saved
  23:   movq    %rdi, %rbx      ;; rbx = skb (callee saved)
; count++;
  26:   movabsq $-82417199407104, %rdi
  30:   movl    (%rdi), %esi
  33:   addl    $1, %esi
  36:   movl    %esi, (%rdi)
; subprog_tail(skb);
  39:   movq    %rbx, %rdi      ;; rdi = skb
  3c:   movq    -16(%rbp), %rax     ;; rax = tcc_ptr
  43:   callq   0x80            ;; call subprog_tail()
; subprog_tail(skb);
  48:   movq    %rbx, %rdi      ;; rdi = skb
  4b:   movq    -16(%rbp), %rax     ;; rax = tcc_ptr
  52:   callq   0x80            ;; call subprog_tail()
; return ret;
  57:   movl    $1, %eax
  5c:   popq    %rbx
  5d:   leave
  5e:   retq

int subprog_tail(struct __sk_buff * skb):
bpf_prog_3a140cef239a4b4f_subprog_tail:
; int subprog_tail(struct __sk_buff *skb)
   0:   endbr64
   4:   nopl    (%rax,%rax)
   9:   nopl    (%rax)          ;; do not touch tail_call_cnt
   c:   pushq   %rbp
   d:   movq    %rsp, %rbp
  10:   endbr64
  14:   pushq   %rax            ;; [rbp - 8]  = rax (tcc_ptr)
  15:   pushq   %rax            ;; [rbp - 16] = rax (tcc_ptr)
  16:   pushq   %rbx            ;; callee saved
  17:   pushq   %r13            ;; callee saved
  19:   movq    %rdi, %rbx      ;; rbx = skb
; asm volatile("r1 = %[ctx]\n\t"
  1c:   movabsq $-105487587488768, %r13 ;; r13 = jmp_table
  26:   movq    %rbx, %rdi      ;; 1st arg, skb
  29:   movq    %r13, %rsi      ;; 2nd arg, jmp_table
  2c:   xorl    %edx, %edx      ;; 3rd arg, index = 0
  2e:   movq    -16(%rbp), %rax     ;; rax = [rbp - 16] (tcc_ptr)
  35:   cmpq    $33, (%rax)
  39:   jae     0x4e            ;; if *tcc_ptr >= 33 goto 0x4e --------+
  3b:   jmp     0x4e            ;; jmp bypass, toggled by poking       |
  40:   addq    $1, (%rax)      ;; (*tcc_ptr)++                        |
  44:   popq    %r13            ;; callee saved                        |
  46:   popq    %rbx            ;; callee saved                        |
  47:   popq    %rax            ;; undo rbp-16 push                    |
  48:   popq    %rax            ;; undo rbp-8  push                    |
  49:   nopl    (%rax,%rax)     ;; tail call target, toggled by poking |
; return 0;                     ;;                                     |
  4e:   popq    %r13            ;; restore callee saved <--------------+
  50:   popq    %rbx            ;; restore callee saved
  51:   leave
  52:   retq

而 arm64 BPF JIT 的修改，则得到 arm64 BPF JIT maintainer Puranjay Mohan 的 ack。

详情：[PATCH v5 bpf-next 0/3] bpf: Fix tailcall hierarchy^[8]。

第九回合：PATCH v6 / 2024 年 7 月 14 日

补充了 JITed prog 的反汇编讲解之后，便递交了 PATCH v6。

在 7 月 20 日上午，PATCH v6 被合并啦。

详情：[PATCH v6 bpf-next 0/3] bpf: Fix tailcall hierarchy^[9]。

总结

这次修复 tailcall hierarchy 问题，耗时 10 个月，中途几度想要放弃，但最终还是坚持了下来，坚持将这个问题给修复了。

接下来，还会继续修 tailcall 的问题！

参考资料

[1]

Re: [RFC PATCH bpf-next v4 0/4] bpf, x64: Fix tailcall infinite loop: https://lore.kernel.org/bpf/6203dd01-789d-f02c-5293-def4c1b18aef@gmail.com/

[2]

[RFC PATCH bpf-next 0/3] bpf, x64: Fix tailcall hierarchy: https://lore.kernel.org/bpf/20231005145814.83122-1-hffilwlqm@gmail.com/

[3]

[RFC PATCH bpf-next v2 0/4] bpf, x64: Fix tailcall hierarchy: https://lore.kernel.org/bpf/20231011152725.95895-1-hffilwlqm@gmail.com/

[4]

[PATCH bpf-next 0/4] bpf, x64: Fix tailcall hierarchy: https://lore.kernel.org/bpf/20240104142226.87869-1-hffilwlqm@gmail.com/

[5]

[PATCH bpf-next v2 0/2] bpf, x64: Fix tailcall hierarchy: https://lore.kernel.org/bpf/20240222085232.62483-1-hffilwlqm@gmail.com/

[6]

[PATCH bpf-next v3 0/3] bpf, x64: Fix tailcall hierarchy: https://lore.kernel.org/bpf/20240402152638.31377-1-hffilwlqm@gmail.com/

[7]

[PATCH bpf-next v4 0/5] bpf: Fix tailcall hierarchy: https://lore.kernel.org/bpf/20240509150541.81799-1-hffilwlqm@gmail.com/

[8]

[PATCH v5 bpf-next 0/3] bpf: Fix tailcall hierarchy: https://lore.kernel.org/bpf/20240623161528.68946-1-hffilwlqm@gmail.com/

[9]

[PATCH v6 bpf-next 0/3] bpf: Fix tailcall hierarchy: https://lore.kernel.org/bpf/20240714123902.32305-1-hffilwlqm@gmail.com/

http://mp.weixin.qq.com/s?__biz=MjM5MTQxNTk5MA==&mid=2247485636&idx=1&sn=b5eb3ab9b0be8759ac3c2aab6152ef16

eBPF Talk

专注于 eBPF 技术，以及 Linux 网络上的 eBPF 技术应用