cilium/pwru 源码分析-1 如何追踪所有网络包

文摘 2024-07-17 18:15 江苏

项目介绍

pwru（packet, where are you?）^[1]是一个基于 eBPF 技术的工具，专为 Linux 内核中的网络数据包追踪设计，具备高级过滤功能。它允许用户对内核状态进行细粒度的追踪，以便于调试网络连接问题。以下是其主要功能和亮点：

主要功能

• 网络数据包追踪：利用 eBPF 技术追踪网络数据包的路径，帮助诊断丢包等网络问题。
• 高级过滤能力：提供多种过滤选项，如按内核模块、函数名、网络命名空间、skb标记等进行过滤，以便用户能够针对特定条件进行数据包追踪。
• 灵活的后端支持：支持不同的追踪后端（如 kprobe 和 kprobe-multi），并自动检测最适合的后端。
• 多种输出格式：支持多种输出格式，包括 JSON、skb 数据、调用函数名等，以适应不同的分析需求。
• 跨平台可执行文件：提供针对 x86_64 和 arm64 的静态链接可执行文件，简化安装过程。

亮点

• 易于安装和使用：用户可以从发布页面下载静态链接的可执行文件，或通过 Docker 和 Kubernetes 等容器技术轻松部署和运行。
• 广泛的运行环境支持：除了直接在 Linux 系统上运行外，还支持在 Docker、Kubernetes 和 Vagrant 环境中运行，提高了工具的适用性和灵活性。
• 开放源代码：作为一个开源项目，pwru 鼓励社区贡献，同时提供了详细的开发和贡献指南，促进了技术的共享和进步。
• 社区支持：通过 Slack 频道，开发者、维护者和用户可以交流、提问和分享经验，构建了一个活跃的社区环境。

运行要求

• Linux 内核版本需要 >= 5.3。
• 对于 --output-skb 功能，需要 >= 5.9 内核。
• 对于 --backend=kprobe-multi 功能，需要 >= 5.18 内核。
• 需要将 debugfs 挂载在 /sys/kernel/debug。
• 需要启用特定的内核配置选项，如 CONFIG_DEBUG_INFO_BTF=y、CONFIG_KPROBES=y 等。

Option	Backend	Note
CONFIG_DEBUG_INFO_BTF=y	both	available since >= 5.3
CONFIG_KPROBES=y	both
CONFIG_PERF_EVENTS=y	both
CONFIG_BPF=y	both
CONFIG_BPF_SYSCALL=y	both
CONFIG_FUNCTION_TRACER=y	kprobe-multi	/sys/kernel/debug/tracing/available_filter_functions
CONFIG_FPROBE=y	kprobe-multi	available since >= 5.18

如何追踪所有网络包

该代码^[2]中PWRU_ADD_KPROBE宏的定义了一个模式，用于生成一系列的函数，这些函数用于处理不同位置的skb参数。它利用X参数来动态创建函数名和指定skb参数的位置。

#define PWRU_ADD_KPROBE(X)                                                     \
  SEC(PWRU_KPROBE_TYPE "/skb-" #X)                                             \
  int kprobe_skb_##X(struct pt_regs *ctx) {                                    \
    struct sk_buff *skb = (struct sk_buff *) PT_REGS_PARM##X(ctx);             \
    return kprobe_skb(skb, ctx, PWRU_HAS_GET_FUNC_IP, NULL);                         \
  }

PWRU_ADD_KPROBE宏被连续调用了五次，每次调用都传入了一个不同的数字（从1到5），用于生成五个不同的处理函数：

PWRU_ADD_KPROBE(1)
PWRU_ADD_KPROBE(2)
PWRU_ADD_KPROBE(3)
PWRU_ADD_KPROBE(4)
PWRU_ADD_KPROBE(5)

这些调用分别生成了kprobe_skb_1到kprobe_skb_5的函数，分别读取挂载点函数的第 1～5 参数。从上面代码里面可以看出将第 1～5 参数都转换为 sk_buff 格式数据，啥挂载点5个入参都是 sk_buff 格式？而且挂载点函数名字kprobe/skb-1 看起来怪怪的，这种挂载点真的合法吗？为什么只取前5个参数呢？

为什么只取前5个参数呢？

PT_REGS_PARM

PT_REGS_PARM 宏的主要作用是方便地从 pt_regs 结构体中提取参数。pt_regs 结构体保存了在内核上下文切换时寄存器的状态，eBPF 程序可以通过这些宏来读取特定寄存器的值。

假设我们有一个 eBPF 程序需要捕获 sys_open 系统调用，并读取它的参数。可以使用 PT_REGS_PARM 宏来提取这些参数：

SEC("kprobe/sys_open")
int bpf_prog1(struct pt_regs *ctx) {
    const char *filename = (const char *)PT_REGS_PARM2(ctx);
    int flags = PT_REGS_PARM3(ctx);
    // 处理 filename 和 flags
    return 0;
}

在这个例子中：

• PT_REGS_PARM2(ctx) 用于获取 sys_open 的第一个参数 filename。
• PT_REGS_PARM3(ctx) 用于获取 sys_open 的第二个参数 flags。

寄存器

该表^[3]总结了不同架构和 ABI（应用二进制接口）中用于传递系统调用参数的寄存器。每一行展示了各个架构中从第一个参数到第六个参数所使用的寄存器。

The second table shows the registers used to pass the system call arguments.

       Arch/ABI      arg1  arg2  arg3  arg4  arg5  arg6  arg7  Notes
       --------------------------------------------------------------
       alpha         a0    a1    a2    a3    a4    a5    -
       arc           r0    r1    r2    r3    r4    r5    -
       arm/OABI      r0    r1    r2    r3    r4    r5    r6

       arm/EABI      r0    r1    r2    r3    r4    r5    r6
       arm64         x0    x1    x2    x3    x4    x5    -
       blackfin      R0    R1    R2    R3    R4    R5    -
       i386          ebx   ecx   edx   esi   edi   ebp   -
       ia64          out0  out1  out2  out3  out4  out5  -
       m68k          d1    d2    d3    d4    d5    a0    -
       microblaze    r5    r6    r7    r8    r9    r10   -
       mips/o32      a0    a1    a2    a3    -     -     -     1
       mips/n32,64   a0    a1    a2    a3    a4    a5    -
       nios2         r4    r5    r6    r7    r8    r9    -
       parisc        r26   r25   r24   r23   r22   r21   -
       powerpc       r3    r4    r5    r6    r7    r8    r9
       powerpc64     r3    r4    r5    r6    r7    r8    -
       riscv         a0    a1    a2    a3    a4    a5    -
       s390          r2    r3    r4    r5    r6    r7    -
       s390x         r2    r3    r4    r5    r6    r7    -
       superh        r4    r5    r6    r7    r0    r1    r2
       sparc/32      o0    o1    o2    o3    o4    o5    -
       sparc/64      o0    o1    o2    o3    o4    o5    -
       tile          R00   R01   R02   R03   R04   R05   -
       x86-64        rdi   rsi   rdx   r10   r8    r9    -
       x32           rdi   rsi   rdx   r10   r8    r9    -
       xtensa        a6    a3    a4    a5    a8    a9    -

System V AMD64 ABI：

1. System V ABI 规定了函数调用时的寄存器使用方式，前六个整数或指针参数依次通过 RDI、RSI、RDX、RCX、R8 和 R9 传递。
2. 系统调用使用的寄存器略有不同，特别是第四个参数使用 R10 而不是 RCX，这是因为 syscall 指令会破坏 RCX 寄存器。

在 x86_64 架构上，系统调用的参数传递通过以下寄存器进行：

1. RDI：第一个参数
2. RSI：第二个参数
3. RDX：第三个参数
4. R10：第四个参数（注意这里与函数调用不同，函数调用第四个参数在 RCX 中）
5. R8：第五个参数
6. R9：第六个参数
7. RAX：系统调用号

系统调用返回值通过 RAX 寄存器返回，如果调用失败，RAX 包含负的错误码。

以下是一个使用汇编语言进行系统调用的简单示例，演示了如何传递 6 个参数：

section .data
    msg db 'Hello, world!',0

section .text
    global _start

_start:
    ; write(int fd, const void *buf, size_t count)
    mov rax, 1          ; 系统调用号 (sys_write)
    mov rdi, 1          ; 文件描述符 (stdout)
    mov rsi, msg        ; 缓冲区地址
    mov rdx, 13         ; 字符数
    syscall             ; 执行系统调用

    ; exit(int status)
    mov rax, 60         ; 系统调用号 (sys_exit)
    xor rdi, rdi        ; 返回码 0
    syscall             ; 执行系统调用

至此，其实当前寄存器都是支持传递 6 位参数，但是 libbpf 项目中仅支持 5 个参数。通过阅读以下 issue 可以发现，社区还是有需求扩展读取寄存器参数，而且目前最新的 libbpf 库已经支持到读取 8 位参数。

1. why not provide __PT_PARM6_REG^[4]

there is no technical reason, just no one added 6th arg macro. See libbpf/libbpf-bootstrap#111 (comment) and yes, please do send a patch.

2. provide __PT_PARM6_REG. libbpf/libbpf#574^[5]

Libbpf's BPF_KPROBE macro currently doesn't support more than 5 arguments. Please contribute the patch to extend it.

For now to unblock yourself you can add this before BPF_KPROBE macro use:

#define ___bpf_kprobe_args6(x, args...) \
    ___bpf_kprobe_args5(args), (void *)(ctx)->r9
** But note that this will eventually be added libbpf (probably pretty soon) and at that point your code will stop compiling again, most probably. ** So it's best to fix this in libbpf properly.

3. [v2,bpf-next,01/25] libbpf: add support for fetching up to 8 arguments in kprobes^[6]

Add BPF_KPROBE() and PT_REGS_PARMx() support for up to 8 arguments, if
target architecture supports this. Currently all architectures are
limited to only 5 register-placed arguments, which is limiting even on
x86-64.

This patch adds generic macro machinery to support up to 8 arguments
both when explicitly fetching it from pt_regs through PT_REGS_PARMx()
macros, as well as more ergonomic access in BPF_KPROBE().

Also, for i386 architecture we now don't have to define fake PARM4 and
PARM5 definitions, they will be generically substituted, just like for
PARM6 through PARM8.

Subsequent patches will fill out architecture-specific definitions,
where appropriate.

如何挂载？

加载 BTF

    var btfSpec *btf.Spec
    var err error
    if flags.KernelBTF != "" {
    // 从自定义目录获取 BTF 
    // 如果本身操作系统没有开启 BTF ，可以通过 btfhub 或者自己生成 BTF
    // 具体生成方法翻看该公众号前面的文章
        btfSpec, err = btf.LoadSpec(flags.KernelBTF)
    } else {
    // 从默认地址（/sys/kernel/btf/vmlinux）获取 BTF
        btfSpec, err = btf.LoadKernelSpec()
    }
    if err != nil {
        log.Fatalf("Failed to load BTF spec: %s", err)
    }

代码段^[7]主要负责加载 BPF Type Format (BTF) 信息，以便后续在程序中使用这些类型信息。具体步骤如下：

1. 定义一个指向 btf.Spec 类型的指针 btfSpec，用于存储加载的 BTF 信息。
2. 定义一个错误变量 err，用于捕获加载过程中可能发生的错误。
3. 使用 flags.KernelBTF 字段检查是否指定了内核 BTF 文件的路径。如果指定了路径，则尝试从该路径加载 BTF 信息；如果没有指定，则尝试加载当前运行内核的 BTF 信息。
4. 使用 btf.LoadSpec 函数从指定路径加载 BTF 信息，或者使用 btf.LoadKernelSpec 从当前内核加载 BTF 信息。
5. 检查加载过程中是否发生错误。如果有错误发生，使用 log.Fatalf 打印错误信息并终止程序。

筛选函数

    for _, it := range iters {
        for it.iter.Next() {
            typ := it.iter.Type
      // 检查类型是否为函数（btf.Func）。如果不是，跳过当前迭代
            fn, ok := typ.(*btf.Func)
            if !ok {
                continue
            }

      // 获取当前函数的名称
            fnName := string(fn.Name)

      // 如果函数名称不匹配跳过
            if pattern != "" && reg.FindString(fnName) != fnName {
                continue
            }

            fnProto := fn.Type.(*btf.FuncProto)
            i := 1
      // 遍历函数原型（fnProto）的参数，检查每个参数是否为指向sk_buff结构体的指针。如果是，并且参数位置在前5个之内，将函数名称（考虑模块名）和参数位置记录到funcs映射中。
            for _, p := range fnProto.Params {
                if ptr, ok := p.Type.(*btf.Pointer); ok {
                    if strct, ok := ptr.Target.(*btf.Struct); ok {
                        if strct.Name == "sk_buff" && i <= 5 {
                            name := fnName
                            if kprobeMulti && it.kmod != "" {
                                name = fmt.Sprintf("%s [%s]", fnName, it.kmod)
                            }
                            funcs[name] = i
                            continue
                        }
                    }
                }
                i += 1
            }
        }
    }

函数筛选之后 funcs 的数据结构中，key 为内核函数名称，value 为该函数第几个参数类型为 sk_buff，具体返回数据如下：

Index	Function	Parm Index
0	__tcf_em_tree_match	1
1	eth_type_trans	1
2	icmp_unreach	1
3	neigh_hh_output	2

挂载自定义函数

func NewKprober(ctx context.Context, funcs Funcs, coll *ebpf.Collection, a2n Addr2Name, useKprobeMulti bool, batch uint) *kprober {
    ...
    pwruKprobes := make([]Kprobe, 0, len(funcs))
    funcsByPos := GetFuncsByPos(funcs)
    for pos, fns := range funcsByPos {
        fn, ok := coll.Programs[fmt.Sprintf("kprobe_skb_%d", pos)]
        if ok {
            pwruKprobes = append(pwruKprobes, Kprobe{HookFuncs: fns, Prog: fn})
        } else {
            ignored += len(fns)
            bar.Add(len(fns))
        }
    }
  
    ...
}

该代码段功能是根据提供的函数参数位置信息，为每个位置创建一组 Kprobe 结构体实例。这个过程包括以下步骤：

1. 初始化一个 Kprobe 类型的切片 pwruKprobes，用于存储将要创建的 Kprobe 实例。
2. 调用 GetFuncsByPos 函数，根据参数位置信息生成一个映射（funcsByPos），其中键是参数位置，值是对应的函数名称列表。
3. 遍历 funcsByPos 映射，对于每个参数位置和函数名称列表：

• 使用参数位置信息和预定义的格式（kprobe_skb_%d）构造 ebpf.Collection 中程序的键名。
• 尝试从 ebpf.Collection 中获取对应的程序。如果成功，创建一个 Kprobe 实例，设置其 HookFuncs 为当前位置的函数名称列表，Prog 为获取到的 eBPF 程序，并将该实例添加到 pwruKprobes 切片中。
• 如果在 ebpf.Collection 中找不到对应的 eBPF 程序，将该位置的函数数量加到 ignored 计数器中，并在进度条上增加相应的数量。

funcsByPos 数据示例如下：

Parm Index	Functions
1	__tcf_em_tree_match, eth_type_trans, icmp_unreach, rtnl_fill_vf ...
2	neigh_hh_output, xfrm4_prepare_output, skb_consume_udp ...
3	ip_forward_finish, ip_local_deliver_finish, ip6_pkt_discard_out ...
4	tcp_collapse,tcf_chain_dump ...
5	dcbnl_ieee_set,dcbnl_pgtx_getcfg ...

pwruKprobes 数据示例如下：

HookFuncs	name
netlink_broadcast_filtered ...	kprobe_skb_2
dcbnl_pgtx_getcfg ...	kprobe_skb_5
trace_event_get_offsets_devlink_trap_report ...	kprobe_skb_3
devlink_nl_cmd_sb_occ_snapshot_doit ...	kprobe_skb_1
ipmr_queue_xmit ...	kprobe_skb_4

引用链接

[1] pwru（packet, where are you?）: https://github.com/cilium/pwru
[2] 代码: https://github.com/cilium/pwru/blob/main/bpf/kprobe_pwru.c#L495-L529
[3] 该表: https://github.com/libbpf/libbpf/issues/616#issuecomment-1396170448
[4] why not provide __PT_PARM6_REG: https://github.com/libbpf/libbpf/issues/575#top
[5] provide __PT_PARM6_REG. libbpf/libbpf#574: https://github.com/libbpf/libbpf/pull/574
[6] [v2,bpf-next,01/25] libbpf: add support for fetching up to 8 arguments in kprobes: https://patchwork.kernel.org/project/netdevbpf/patch/20230120200914.3008030-2-andrii@kernel.org/
[7] 代码段: https://github.com/cilium/pwru/blob/ce305f0a7af92dcc3b8d535a913559232435a7ad/main.go#L55-L62

朱慧君

大龄yaml工程师逼逼叨