All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 0/3] bpf, x64: Fix tailcall hierarchy
@ 2024-04-02 15:26 Leon Hwang
  2024-04-02 15:26 ` [PATCH bpf-next v3 1/3] bpf: Add bpf_tail_call_cnt to task_struct Leon Hwang
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Leon Hwang @ 2024-04-02 15:26 UTC (permalink / raw
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, pulehui,
	hengqi.chen, hffilwlqm, kernel-patches-bot

The patchset fixes a tailcall hierarchy issue.

The issue is confirmed in the discussions of "bpf, x64: Fix tailcall infinite
loop"[0].

The issue is only resolved on x86.

This CI history[1] confirms the issue on aarch64.

I provide a long commit message in the second patch to describe how the issue
happens and how this patchset resolves the issue in details.

In short, it stores tail_call_cnt at task_struct.

First, at the prologue of bpf prog, it initialise the tail_call_cnt at
task_struct like "current->bpf_tail_call_cnt = 0;".

Then, when a tailcall happens, it compares the tail_call_cnt with
MAX_TAIL_CALL_CNT, and then increment it.

v2 -> v3:
  * Solution changes from percpu tail_call_cnt to tail_call_cnt at task_struct.

v1 -> v2:
  * Solution changes from extra run-time call insn to percpu tail_call_cnt.
  * Address comments from Alexei:
    * Use percpu tail_call_cnt.
    * Use asm to make sure no callee saved registers are touched.

RFC v2 -> v1:
  * Solution changes from propagating tail_call_cnt with its pointer to extra
    run-time call insn.
  * Address comments from Maciej:
    * Replace all memcpy(prog, x86_nops[5], X86_PATCH_SIZE) with
        emit_nops(&prog, X86_PATCH_SIZE)

RFC v1 -> RFC v2:
  * Address comments from Stanislav:
    * Separate moving emit_nops() as first patch.

Links:
[0] https://lore.kernel.org/bpf/6203dd01-789d-f02c-5293-def4c1b18aef@gmail.com/
[1] https://github.com/kernel-patches/bpf/pull/6721/checks

Leon Hwang (3):
  bpf: Add bpf_tail_call_cnt to task_struct
  bpf, x64: Fix tailcall hierarchy
  selftests/bpf: Add testcases for tailcall hierarchy fixing

 arch/x86/net/bpf_jit_comp.c                   | 137 +++---
 include/linux/sched.h                         |   2 +
 .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  38 ++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  63 +++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  50 +++
 6 files changed, 652 insertions(+), 56 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c


base-commit: 8f1ff3cf139bc1269eebae5d43ffbe482675f360
-- 
2.44.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH bpf-next v3 1/3] bpf: Add bpf_tail_call_cnt to task_struct
  2024-04-02 15:26 [PATCH bpf-next v3 0/3] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2024-04-02 15:26 ` Leon Hwang
  2024-04-02 15:26 ` [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy Leon Hwang
  2024-04-02 15:26 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang
  2 siblings, 0 replies; 10+ messages in thread
From: Leon Hwang @ 2024-04-02 15:26 UTC (permalink / raw
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, pulehui,
	hengqi.chen, hffilwlqm, kernel-patches-bot

In order to get rid of propagating tail call counter by rax and stack,
it's to store tail call counter at task_struct.

Then, at the prologue of bpf prog, it has to initialise the tail call
counter at task_struct. And when tail call happens, it has to compare
and increment the tail call counter at task_struct.

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 include/linux/sched.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b49..d0696fcabf14f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1501,6 +1501,8 @@ struct task_struct {
 	struct bpf_local_storage __rcu	*bpf_storage;
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
+	/* Used for BPF run time */
+	u32				bpf_tail_call_cnt;
 #endif
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy
  2024-04-02 15:26 [PATCH bpf-next v3 0/3] bpf, x64: Fix tailcall hierarchy Leon Hwang
  2024-04-02 15:26 ` [PATCH bpf-next v3 1/3] bpf: Add bpf_tail_call_cnt to task_struct Leon Hwang
@ 2024-04-02 15:26 ` Leon Hwang
  2024-04-05  1:03   ` Alexei Starovoitov
  2024-04-02 15:26 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang
  2 siblings, 1 reply; 10+ messages in thread
From: Leon Hwang @ 2024-04-02 15:26 UTC (permalink / raw
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, pulehui,
	hengqi.chen, hffilwlqm, kernel-patches-bot

From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT"), the tailcall on x64 works better than before.

From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
for x64 JIT"), tailcall is able to run in BPF subprograms on x64.

How about:

1. More than 1 subprograms are called in a bpf program.
2. The tailcalls in the subprograms call the bpf program.

Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.

Let's take a look into an example:

\#include <linux/bpf.h>
\#include <bpf/bpf_helpers.h>
\#include "bpf_legacy.h"

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1);
	__uint(key_size, sizeof(__u32));
	__uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

int count = 0;

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
	bpf_tail_call_static(skb, &jmp_table, 0);
	return 0;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
	volatile int ret = 1;

	count++;
	subprog_tail(skb); /* subprog call1 */
	subprog_tail(skb); /* subprog call2 */

	return ret;
}

char __license[] SEC("license") = "GPL";

And the entry bpf prog is populated to the 0th slot of jmp_table. Then,
what happens when entry bpf prog runs? The CPU will be stalled because
of too many tailcalls, e.g. the test_progs failed to run on aarch64 and
s390x because of "rcu: INFO: rcu_sched self-detected stall on CPU".

So, if CPU does not stall because of too many tailcalls, how many
tailcalls will be there for this case? And why MAX_TAIL_CALL_CNT limit
does not work for this case?

Let's step into some running steps.

At the very first time when subprog_tail() is called, subprog_tail() does
tailcall the entry bpf prog. Then, subprog_taill() is called at second time
at the position subprog call1, and it tailcalls the entry bpf prog again.

Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit
works, subprog_tail() has been called for 34 times at the position subprog
call1. And at this time, the tail_call_cnt is 33 in subprog_tail().

Next, the 34th subprog_tail() returns to entry() because of
MAX_TAIL_CALL_CNT limit.

In entry(), the 34th entry(), at the time after the 34th subprog_tail() at
the position subprog call1 finishes and before the 1st subprog_tail() at
the position subprog call2 calls in entry(), what's the value of
tail_call_cnt in entry()? It's 33.

As we know, tail_all_cnt is pushed on the stack of entry(), and propagates
to subprog_tail() by %rax from stack.

Then, at the time when subprog_tail() at the position subprog call2 is
called for its first time, tail_call_cnt 33 propagates to subprog_tail()
by %rax. And the tailcall in subprog_tail() is aborted because of
tail_call_cnt >= MAX_TAIL_CALL_CNT too.

Then, subprog_tail() at the position subprog call2 ends, and the 34th
entry() ends. And it returns to the 33rd subprog_tail() called from the
position subprog call1. But wait, at this time, what's the value of
tail_call_cnt under the stack of subprog_tail()? It's 33.

Then, in the 33rd entry(), at the time after the 33th subprog_tail() at
the position subprog call1 finishes and before the 2nd subprog_tail() at
the position subprog call2 calls, what's the value of tail_call_cnt
in current entry()? It's *32*. Why not 33?

Before stepping into subprog_tail() at the position subprog call2 in 33rd
entry(), like stopping the time machine, let's have a look at the stack
memory:

  |  STACK  |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  +---------+ RSP  <-- current rsp
  |   rip   | STACK of 34rd entry()
  |   rbp   | reuse the STACK of 33rd subprog_tail() at the position
  |   ret   |                                        subprog call1
  |   tcc   | its value is 33
  +---------+ rsp
  |   rip   | STACK of 1st subprog_tail() at the position subprog call2
  |   rbp   |
  |   tcc   | its value is 33
  +---------+ rsp

Why not 33? It's because tail_call_cnt does not back-propagate from
subprog_tail() to entry().

Then, while stepping into subprog_tail() at the position subprog call2 in
33rd entry():

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 32; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Then, while pausing after tailcalling in 2nd subprog_tail() at the position
subprog call2:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 33; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Note: what happens to tail_call_cnt:
	/*
	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
	 *	goto out;
	 */
It's to check >= MAX_TAIL_CALL_CNT first and then increment tail_call_cnt.

So, current tailcall is allowed to run.

Then, entry() is tailcalled. And the stack memory status is:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 35th entry(); reuse STACK of subprog_tail() at the
  |   tcc   | its value is 33                   the position subprog call2
  +---------+ RSP  <-- current rsp

So, the tailcalls in the 35th entry() will be aborted.

And, ..., again and again.  :(

And, I hope you have understood the reason why MAX_TAIL_CALL_CNT limit
does not work for this case.

And, how many tailcalls are there for this case if CPU does not stall?

From top-down view, does it look like hierarchy layer and layer?

I think it is a hierarchy layer model with 2+4+8+...+2**33 tailcalls. As a
result, if CPU does not stall, there will be 2**34 - 2 = 17,179,869,182
tailcalls. That's the guy making CPU stalled.

What about there are N subprog_tail() in entry()? If CPU does not stall
because of too many tailcalls, there will be almost N**34 tailcalls.

As we learn about the issue, how does this patch resolve it?

In this patch, it stores tail_call_cnt at task_struct. When a tailcall
happens, the caller and callee bpf progs are in the same thread context,
which means the current task_struct won't change after a tailcall.

First, at the prologue of bpf prog, it initialise the tail_call_cnt at
task_struct.

Then, when a tailcall happens, it compares tail_call_cnt with
MAX_TAIL_CALL_CNT, and then increments it.

Additionally, in order to avoid touching other registers excluding %rax,
it uses asm to handle tail_call_cnt at task_struct.

As a result, the previous tailcall way can be removed totally, including

1. "push rax" at prologue.
2. load tail_call_cnt to rax before calling function.
3. "pop rax" before jumping to tailcallee when tailcall.
4. "push rax" and load tail_call_cnt to rax at trampoline.

Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 137 +++++++++++++++++++++---------------
 1 file changed, 81 insertions(+), 56 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 3b639d6f2f54d..cd06e02e83b64 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -11,6 +11,7 @@
 #include <linux/bpf.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
+#include <linux/sched.h>
 #include <asm/extable.h>
 #include <asm/ftrace.h>
 #include <asm/set_memory.h>
@@ -18,6 +19,8 @@
 #include <asm/text-patching.h>
 #include <asm/unwind.h>
 #include <asm/cfi.h>
+#include <asm/current.h>
+#include <asm/percpu.h>
 
 static bool all_callee_regs_used[4] = {true, true, true, true};
 
@@ -273,7 +276,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(14 + ENDBR_INSN_SIZE)
 
 static void push_r12(u8 **pprog)
 {
@@ -403,6 +406,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
 	*pprog = prog;
 }
 
+static int emit_call(u8 **pprog, void *func, void *ip);
+static __used void bpf_tail_call_cnt_init(void);
+
 /*
  * Emit x86-64 prologue code for BPF program.
  * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
@@ -410,9 +416,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
  */
 static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 			  bool tail_call_reachable, bool is_subprog,
-			  bool is_exception_cb)
+			  bool is_exception_cb, u8 *ip)
 {
-	u8 *prog = *pprog;
+	u8 *prog = *pprog, *start = *pprog;
 
 	emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
 	/* BPF trampoline can be made to work without these nops,
@@ -421,13 +427,14 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
 		if (tail_call_reachable && !is_subprog)
-			/* When it's the entry of the whole tailcall context,
-			 * zeroing rax means initialising tail_call_cnt.
+			/* Call bpf_tail_call_cnt_init to initilise
+			 * tail_call_cnt.
 			 */
-			EMIT2(0x31, 0xC0); /* xor eax, eax */
+			emit_call(&prog, bpf_tail_call_cnt_init,
+				  ip + (prog - start));
 		else
 			/* Keep the same instruction layout. */
-			EMIT2(0x66, 0x90); /* nop2 */
+			emit_nops(&prog, X86_PATCH_SIZE);
 	}
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
@@ -452,8 +459,6 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	/* sub rsp, rounded_stack_depth */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	if (tail_call_reachable)
-		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }
 
@@ -589,13 +594,61 @@ static void emit_return(u8 **pprog, u8 *ip)
 	*pprog = prog;
 }
 
+static __used void bpf_tail_call_cnt_init(void)
+{
+	/* The following asm equals to
+	 *
+	 * u32 *tcc_ptr = &current->bpf_tail_call_cnt;
+	 *
+	 * *tcc_ptr = 0;
+	 */
+
+	asm volatile (
+	    "addq " __percpu_arg(0) ", %1\n\t"
+	    "addq %2, %1\n\t"
+	    "movq (%1), %1\n\t"
+	    "addq %3, %1\n\t"
+	    "movl $0, (%1)\n\t"
+	    :
+	    : "m" (this_cpu_off), "r" (&pcpu_hot),
+	      "i" (offsetof(struct pcpu_hot, current_task)),
+	      "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
+	);
+}
+
+static __used u32 *bpf_tail_call_cnt_ptr(void)
+{
+	u32 *tcc_ptr;
+
+	/* The following asm equals to
+	 *
+	 * u32 *tcc_ptr = &current->bpf_tail_call_cnt;
+	 *
+	 * return tcc_ptr;
+	 */
+
+	asm volatile (
+	    "addq " __percpu_arg(1) ", %2\n\t"
+	    "addq %3, %2\n\t"
+	    "movq (%2), %2\n\t"
+	    "addq %4, %2\n\t"
+	    "movq %2, %0\n\t"
+	    : "=r" (tcc_ptr)
+	    : "m" (this_cpu_off), "r" (&pcpu_hot),
+	      "i" (offsetof(struct pcpu_hot, current_task)),
+	      "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
+	);
+
+	return tcc_ptr;
+}
+
 /*
  * Generate the following code:
  *
  * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
  *   if (index >= array->map.max_entries)
  *     goto out;
- *   if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
+ *   if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
  *     goto out;
  *   prog = array->ptrs[index];
  *   if (prog == NULL)
@@ -608,7 +661,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 					u32 stack_depth, u8 *ip,
 					struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
@@ -630,16 +682,16 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 	EMIT2(X86_JBE, offset);                   /* jbe out */
 
 	/*
-	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
+	 * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
+	/* call bpf_tail_call_cnt_ptr */
+	emit_call(&prog, bpf_tail_call_cnt_ptr, ip + (prog - start));
+	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
 
 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                   /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT2(0xFF, 0x00);                        /* inc dword ptr [rax] */
 
 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
@@ -663,7 +715,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 			pop_r12(&prog);
 	}
 
-	EMIT1(0x58);                              /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
 			    round_up(stack_depth, 8));
@@ -691,21 +742,20 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 				      bool *callee_regs_used, u32 stack_depth,
 				      struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
 	/*
-	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
+	 * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
+	/* call bpf_tail_call_cnt_ptr */
+	emit_call(&prog, bpf_tail_call_cnt_ptr, ip);
+	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
 
 	offset = ctx->tail_call_direct_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                       /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT2(0xFF, 0x00);                            /* inc dword ptr [rax] */
 
 	poke->tailcall_bypass = ip + (prog - start);
 	poke->adj_off = X86_TAIL_CALL_OFFSET;
@@ -724,7 +774,6 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 			pop_r12(&prog);
 	}
 
-	EMIT1(0x58);                                  /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
 
@@ -1262,10 +1311,6 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
 
 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
 
-/* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
-#define RESTORE_TAIL_CALL_CNT(stack)				\
-	EMIT3_off32(0x48, 0x8B, 0x85, -round_up(stack, 8) - 8)
-
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
 		  int oldproglen, struct jit_context *ctx, bool jmp_padding)
 {
@@ -1293,7 +1338,8 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
 		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
-		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
+		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb,
+		      image);
 	/* Exception callback will clobber callee regs for its own use, and
 	 * restore the original callee regs from main prog's stack frame.
 	 */
@@ -1973,17 +2019,11 @@ st:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_CALL: {
 			int offs;
 
+			if (!imm32)
+				return -EINVAL;
+
 			func = (u8 *) __bpf_call_base + imm32;
-			if (tail_call_reachable) {
-				RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
-				if (!imm32)
-					return -EINVAL;
-				offs = 7 + x86_call_depth_emit_accounting(&prog, func);
-			} else {
-				if (!imm32)
-					return -EINVAL;
-				offs = x86_call_depth_emit_accounting(&prog, func);
-			}
+			offs = x86_call_depth_emit_accounting(&prog, func);
 			if (emit_call(&prog, func, image + addrs[i - 1] + offs))
 				return -EINVAL;
 			break;
@@ -2773,7 +2813,6 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 	 *                     [ ...        ]
 	 *                     [ stack_arg2 ]
 	 * RBP - arg_stack_off [ stack_arg1 ]
-	 * RSP                 [ tail_call_cnt ] BPF_TRAMP_F_TAIL_CALL_CTX
 	 */
 
 	/* room for return value of orig_call or fentry prog */
@@ -2845,8 +2884,6 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		/* sub rsp, stack_size */
 		EMIT4(0x48, 0x83, 0xEC, stack_size);
 	}
-	if (flags & BPF_TRAMP_F_TAIL_CALL_CTX)
-		EMIT1(0x50);		/* push rax */
 	/* mov QWORD PTR [rbp - rbx_off], rbx */
 	emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_6, -rbx_off);
 
@@ -2901,16 +2938,9 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		restore_regs(m, &prog, regs_off);
 		save_args(m, &prog, arg_stack_off, true);
 
-		if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
-			/* Before calling the original function, restore the
-			 * tail_call_cnt from stack to rax.
-			 */
-			RESTORE_TAIL_CALL_CNT(stack_size);
-		}
-
 		if (flags & BPF_TRAMP_F_ORIG_STACK) {
-			emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, 8);
-			EMIT2(0xff, 0xd3); /* call *rbx */
+			emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, 8);
+			EMIT2(0xff, 0xd0); /* call *rax */
 		} else {
 			/* call original function */
 			if (emit_rsb_call(&prog, orig_call, image + (prog - (u8 *)rw_image))) {
@@ -2963,11 +2993,6 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 			ret = -EINVAL;
 			goto cleanup;
 		}
-	} else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
-		/* Before running the original function, restore the
-		 * tail_call_cnt from stack to rax.
-		 */
-		RESTORE_TAIL_CALL_CNT(stack_size);
 	}
 
 	/* restore return value of orig_call or fentry prog back into RAX */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next v3 3/3] selftests/bpf: Add testcases for tailcall hierarchy fixing
  2024-04-02 15:26 [PATCH bpf-next v3 0/3] bpf, x64: Fix tailcall hierarchy Leon Hwang
  2024-04-02 15:26 ` [PATCH bpf-next v3 1/3] bpf: Add bpf_tail_call_cnt to task_struct Leon Hwang
  2024-04-02 15:26 ` [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2024-04-02 15:26 ` Leon Hwang
  2 siblings, 0 replies; 10+ messages in thread
From: Leon Hwang @ 2024-04-02 15:26 UTC (permalink / raw
  To: bpf
  Cc: ast, daniel, andrii, maciej.fijalkowski, jakub, pulehui,
	hengqi.chen, hffilwlqm, kernel-patches-bot

Add some test cases to confirm the tailcall hierarchy issue has been fixed.

tools/testing/selftests/bpf/test_progs -t tailcalls
311/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
311/19  tailcalls/tailcall_bpf2bpf_hierarchy_fentry:OK
311/20  tailcalls/tailcall_bpf2bpf_hierarchy_fexit:OK
311/21  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_fexit:OK
311/22  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
311/23  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
311     tailcalls:OK
Summary: 1/23 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 .../selftests/bpf/prog_tests/tailcalls.c      | 418 ++++++++++++++++++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy1.c   |  38 ++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy2.c   |  63 +++
 .../bpf/progs/tailcall_bpf2bpf_hierarchy3.c   |  50 +++
 4 files changed, 569 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tailcalls.c b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
index 59993fc9c0d7e..6b7baafb855af 100644
--- a/tools/testing/selftests/bpf/prog_tests/tailcalls.c
+++ b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
@@ -1187,6 +1187,412 @@ static void test_tailcall_poke(void)
 	tailcall_poke__destroy(call);
 }
 
+static void test_tailcall_hierarchy_count(const char *which, bool test_fentry,
+					  bool test_fexit)
+{
+	int err, map_fd, prog_fd, main_data_fd, fentry_data_fd, fexit_data_fd, i, val;
+	struct bpf_object *obj = NULL, *fentry_obj = NULL, *fexit_obj = NULL;
+	struct bpf_link *fentry_link = NULL, *fexit_link = NULL;
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load(which, BPF_PROG_TYPE_SCHED_CLS, &obj,
+				 &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "prog_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "map_fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	if (test_fentry) {
+		fentry_obj = bpf_object__open_file("tailcall_bpf2bpf_fentry.bpf.o",
+						   NULL);
+		if (!ASSERT_OK_PTR(fentry_obj, "open fentry_obj file"))
+			goto out;
+
+		prog = bpf_object__find_program_by_name(fentry_obj, "fentry");
+		if (!ASSERT_OK_PTR(prog, "find fentry prog"))
+			goto out;
+
+		err = bpf_program__set_attach_target(prog, prog_fd,
+						     "subprog_tail");
+		if (!ASSERT_OK(err, "set_attach_target subprog_tail"))
+			goto out;
+
+		err = bpf_object__load(fentry_obj);
+		if (!ASSERT_OK(err, "load fentry_obj"))
+			goto out;
+
+		fentry_link = bpf_program__attach_trace(prog);
+		if (!ASSERT_OK_PTR(fentry_link, "attach_trace"))
+			goto out;
+	}
+
+	if (test_fexit) {
+		fexit_obj = bpf_object__open_file("tailcall_bpf2bpf_fexit.bpf.o",
+						  NULL);
+		if (!ASSERT_OK_PTR(fexit_obj, "open fexit_obj file"))
+			goto out;
+
+		prog = bpf_object__find_program_by_name(fexit_obj, "fexit");
+		if (!ASSERT_OK_PTR(prog, "find fexit prog"))
+			goto out;
+
+		err = bpf_program__set_attach_target(prog, prog_fd,
+						     "subprog_tail");
+		if (!ASSERT_OK(err, "set_attach_target subprog_tail"))
+			goto out;
+
+		err = bpf_object__load(fexit_obj);
+		if (!ASSERT_OK(err, "load fexit_obj"))
+			goto out;
+
+		fexit_link = bpf_program__attach_trace(prog);
+		if (!ASSERT_OK_PTR(fexit_link, "attach_trace"))
+			goto out;
+	}
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find data_map"))
+		goto out;
+
+	main_data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(main_data_fd, 0, "main_data_fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(main_data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 34, "tailcall count");
+
+	if (test_fentry) {
+		data_map = bpf_object__find_map_by_name(fentry_obj, ".bss");
+		if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+				  "find tailcall_bpf2bpf_fentry.bss map"))
+			goto out;
+
+		fentry_data_fd = bpf_map__fd(data_map);
+		if (!ASSERT_GE(fentry_data_fd, 0,
+				  "find tailcall_bpf2bpf_fentry.bss map fd"))
+			goto out;
+
+		i = 0;
+		err = bpf_map_lookup_elem(fentry_data_fd, &i, &val);
+		ASSERT_OK(err, "fentry count");
+		ASSERT_EQ(val, 68, "fentry count");
+	}
+
+	if (test_fexit) {
+		data_map = bpf_object__find_map_by_name(fexit_obj, ".bss");
+		if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+				  "find tailcall_bpf2bpf_fexit.bss map"))
+			goto out;
+
+		fexit_data_fd = bpf_map__fd(data_map);
+		if (!ASSERT_GE(fexit_data_fd, 0,
+				  "find tailcall_bpf2bpf_fexit.bss map fd"))
+			goto out;
+
+		i = 0;
+		err = bpf_map_lookup_elem(fexit_data_fd, &i, &val);
+		ASSERT_OK(err, "fexit count");
+		ASSERT_EQ(val, 68, "fexit count");
+	}
+
+	i = 0;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (!ASSERT_OK(err, "delete_elem from jmp_table"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	i = 0;
+	err = bpf_map_lookup_elem(main_data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 35, "tailcall count");
+
+	if (test_fentry) {
+		i = 0;
+		err = bpf_map_lookup_elem(fentry_data_fd, &i, &val);
+		ASSERT_OK(err, "fentry count");
+		ASSERT_EQ(val, 70, "fentry count");
+	}
+
+	if (test_fexit) {
+		i = 0;
+		err = bpf_map_lookup_elem(fexit_data_fd, &i, &val);
+		ASSERT_OK(err, "fexit count");
+		ASSERT_EQ(val, 70, "fexit count");
+	}
+
+out:
+	bpf_link__destroy(fentry_link);
+	bpf_link__destroy(fexit_link);
+	bpf_object__close(fentry_obj);
+	bpf_object__close(fexit_obj);
+	bpf_object__close(obj);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_1 checks that the count value of the tail
+ * call limit enforcement matches with expectations when tailcalls are preceded
+ * with two bpf2bpf calls.
+ *
+ *              subprog --tailcall-> entry prog
+ * entry prog <
+ *              subprog --tailcall-> entry prog
+ */
+static void test_tailcall_bpf2bpf_hierarchy_1(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      false, false);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fentry checks that the count value of the
+ * tail call limit enforcement matches with expectations when tailcalls are
+ * preceded with two bpf2bpf calls, and the two subprogs are traced by fentry.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fentry(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      true, false);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fexit checks that the count value of the tail
+ * call limit enforcement matches with expectations when tailcalls are preceded
+ * with two bpf2bpf calls, and the two subprogs are traced by fexit.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fexit(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      false, true);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_fentry_fexit checks that the count value of
+ * the tail call limit enforcement matches with expectations when tailcalls are
+ * preceded with two bpf2bpf calls, and the two subprogs are traced by both
+ * fentry and fexit.
+ */
+static void test_tailcall_bpf2bpf_hierarchy_fentry_fexit(void)
+{
+	test_tailcall_hierarchy_count("tailcall_bpf2bpf_hierarchy1.bpf.o",
+				      true, true);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_2 checks that the count value of the tail
+ * call limit enforcement matches with expectations:
+ *
+ *         subprog_tail0 --tailcall-> classifier_0 -> subprog_tail0
+ * entry <
+ *         subprog_tail1 --tailcall-> classifier_1 -> subprog_tail1
+ */
+static void test_tailcall_bpf2bpf_hierarchy_2(void)
+{
+	int err, map_fd, prog_fd, data_fd, main_fd, i, val[2];
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_object *obj = NULL;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load("tailcall_bpf2bpf_hierarchy2.bpf.o",
+				 BPF_PROG_TYPE_SCHED_CLS,
+				 &obj, &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(main_fd, 0, "main_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table map fd"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_0");
+	if (!ASSERT_OK_PTR(prog, "find classifier_0 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_0 prog fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_1");
+	if (!ASSERT_OK_PTR(prog, "find classifier_1 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_1 prog fd"))
+		goto out;
+
+	i = 1;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(main_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find .bss map"))
+		goto out;
+
+	data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(data_fd, 0, "find .bss map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall counts");
+	ASSERT_EQ(val[0], 33, "tailcall count0");
+	ASSERT_EQ(val[1], 0, "tailcall count1");
+
+out:
+	bpf_object__close(obj);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_3 checks that the count value of the tail
+ * call limit enforcement matches with expectations:
+ *
+ *                                   subprog with jmp_table0 to classifier_0
+ * entry --tailcall-> classifier_0 <
+ *                                   subprog with jmp_table1 to classifier_0
+ */
+static void test_tailcall_bpf2bpf_hierarchy_3(void)
+{
+	int err, map_fd, prog_fd, data_fd, main_fd, i, val;
+	struct bpf_map *prog_array, *data_map;
+	struct bpf_object *obj = NULL;
+	struct bpf_program *prog;
+	char buff[128] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = buff,
+		.data_size_in = sizeof(buff),
+		.repeat = 1,
+	);
+
+	err = bpf_prog_test_load("tailcall_bpf2bpf_hierarchy3.bpf.o",
+				 BPF_PROG_TYPE_SCHED_CLS,
+				 &obj, &prog_fd);
+	if (!ASSERT_OK(err, "load obj"))
+		return;
+
+	prog = bpf_object__find_program_by_name(obj, "entry");
+	if (!ASSERT_OK_PTR(prog, "find entry prog"))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(main_fd, 0, "main_fd"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table0");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table0 map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table0 map fd"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "classifier_0");
+	if (!ASSERT_OK_PTR(prog, "find classifier_0 prog"))
+		goto out;
+
+	prog_fd = bpf_program__fd(prog);
+	if (!ASSERT_GE(prog_fd, 0, "find classifier_0 prog fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table0"))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table1");
+	if (!ASSERT_OK_PTR(prog_array, "find jmp_table1 map"))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (!ASSERT_GE(map_fd, 0, "find jmp_table1 map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table1"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(main_fd, &topts);
+	ASSERT_OK(err, "tailcall");
+	ASSERT_EQ(topts.retval, 1, "tailcall retval");
+
+	data_map = bpf_object__find_map_by_name(obj, ".bss");
+	if (!ASSERT_FALSE(!data_map || !bpf_map__is_internal(data_map),
+			  "find .bss map"))
+		goto out;
+
+	data_fd = bpf_map__fd(data_map);
+	if (!ASSERT_GE(data_fd, 0, "find .bss map fd"))
+		goto out;
+
+	i = 0;
+	err = bpf_map_lookup_elem(data_fd, &i, &val);
+	ASSERT_OK(err, "tailcall count");
+	ASSERT_EQ(val, 33, "tailcall count");
+
+out:
+	bpf_object__close(obj);
+}
+
 void test_tailcalls(void)
 {
 	if (test__start_subtest("tailcall_1"))
@@ -1223,4 +1629,16 @@ void test_tailcalls(void)
 		test_tailcall_bpf2bpf_fentry_entry();
 	if (test__start_subtest("tailcall_poke"))
 		test_tailcall_poke();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_1"))
+		test_tailcall_bpf2bpf_hierarchy_1();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fentry"))
+		test_tailcall_bpf2bpf_hierarchy_fentry();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fexit"))
+		test_tailcall_bpf2bpf_hierarchy_fexit();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_fentry_fexit"))
+		test_tailcall_bpf2bpf_hierarchy_fentry_fexit();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_2"))
+		test_tailcall_bpf2bpf_hierarchy_2();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_3"))
+		test_tailcall_bpf2bpf_hierarchy_3();
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
new file mode 100644
index 0000000000000..375b486573395
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy1.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+int count = 0;
+
+static __noinline
+int subprog_tail(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	int ret = 1;
+
+	if (count >= 100)
+		/* exit for abnormal case */
+		return count;
+
+	count++;
+	subprog_tail(skb);
+	subprog_tail(skb);
+
+	return ret;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
new file mode 100644
index 0000000000000..4bf65d1c73d98
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy2.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 2);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+int count0 = 0;
+int count1 = 0;
+
+static __noinline
+int subprog_tail0(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int classifier_0(struct __sk_buff *skb)
+{
+	if (count0 >= 100)
+		/* exit for abnormal case */
+		return count0;
+
+	count0++;
+	subprog_tail0(skb);
+	return 0;
+}
+
+static __noinline
+int subprog_tail1(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 1);
+	return 0;
+}
+
+SEC("tc")
+int classifier_1(struct __sk_buff *skb)
+{
+	if (count1 >= 100)
+		/* exit for abnormal case */
+		return count1;
+
+	count1++;
+	subprog_tail1(skb);
+	return 0;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	subprog_tail0(skb);
+	subprog_tail1(skb);
+
+	return 1;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
new file mode 100644
index 0000000000000..68c69d4e97fc9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy3.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_legacy.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table0 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table1 SEC(".maps");
+
+int count = 0;
+
+static __noinline
+int subprog_tail(struct __sk_buff *skb, void *jmp_table)
+{
+	bpf_tail_call_static(skb, jmp_table, 0);
+	return 0;
+}
+
+SEC("tc")
+int classifier_0(struct __sk_buff *skb)
+{
+	if (count >= 100)
+		/* exit for abnormal case */
+		return count;
+
+	count++;
+	subprog_tail(skb, &jmp_table0);
+	subprog_tail(skb, &jmp_table1);
+	return 1;
+}
+
+SEC("tc")
+int entry(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table0, 0);
+
+	return 0;
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy
  2024-04-02 15:26 ` [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy Leon Hwang
@ 2024-04-05  1:03   ` Alexei Starovoitov
  2024-04-07 11:34     ` Leon Hwang
  0 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2024-04-05  1:03 UTC (permalink / raw
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Pu Lehui, Hengqi Chen,
	kernel-patches-bot

>   * Solution changes from percpu tail_call_cnt to tail_call_cnt at task_struct.

Please remind us what was wrong with per-cpu approach?

Also notice we have pseudo per-cpu bpf insns now,
so things might be easier today.

On Tue, Apr 2, 2024 at 8:27 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
> handling in JIT"), the tailcall on x64 works better than before.
...
>
> As a result, the previous tailcall way can be removed totally, including
>
> 1. "push rax" at prologue.
> 2. load tail_call_cnt to rax before calling function.
> 3. "pop rax" before jumping to tailcallee when tailcall.
> 4. "push rax" and load tail_call_cnt to rax at trampoline.

Please trim it.
It looks like you've been copy pasting it and it's no longer
accurate.
Short description of the problem will do.

> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 137 +++++++++++++++++++++---------------
>  1 file changed, 81 insertions(+), 56 deletions(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 3b639d6f2f54d..cd06e02e83b64 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -11,6 +11,7 @@
>  #include <linux/bpf.h>
>  #include <linux/memory.h>
>  #include <linux/sort.h>
> +#include <linux/sched.h>
>  #include <asm/extable.h>
>  #include <asm/ftrace.h>
>  #include <asm/set_memory.h>
> @@ -18,6 +19,8 @@
>  #include <asm/text-patching.h>
>  #include <asm/unwind.h>
>  #include <asm/cfi.h>
> +#include <asm/current.h>
> +#include <asm/percpu.h>
>
>  static bool all_callee_regs_used[4] = {true, true, true, true};
>
> @@ -273,7 +276,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE         5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET   (14 + ENDBR_INSN_SIZE)
>
>  static void push_r12(u8 **pprog)
>  {
> @@ -403,6 +406,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
>         *pprog = prog;
>  }
>
> +static int emit_call(u8 **pprog, void *func, void *ip);
> +static __used void bpf_tail_call_cnt_init(void);
> +
>  /*
>   * Emit x86-64 prologue code for BPF program.
>   * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
> @@ -410,9 +416,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
>   */
>  static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>                           bool tail_call_reachable, bool is_subprog,
> -                         bool is_exception_cb)
> +                         bool is_exception_cb, u8 *ip)
>  {
> -       u8 *prog = *pprog;
> +       u8 *prog = *pprog, *start = *pprog;
>
>         emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
>         /* BPF trampoline can be made to work without these nops,
> @@ -421,13 +427,14 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>         emit_nops(&prog, X86_PATCH_SIZE);
>         if (!ebpf_from_cbpf) {
>                 if (tail_call_reachable && !is_subprog)
> -                       /* When it's the entry of the whole tailcall context,
> -                        * zeroing rax means initialising tail_call_cnt.
> +                       /* Call bpf_tail_call_cnt_init to initilise
> +                        * tail_call_cnt.
>                          */
> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
> +                       emit_call(&prog, bpf_tail_call_cnt_init,
> +                                 ip + (prog - start));

You're repeating the same bug we discussed before.
There is nothing in bpf_tail_call_cnt_init() that
prevents the compiler from scratching rdi,rsi,...
bpf_tail_call_cnt_init() is a normal function from compiler pov
and it's allowed to use those regs.
Must have been lucky that CI is not showing crashes.

>                 else
>                         /* Keep the same instruction layout. */
> -                       EMIT2(0x66, 0x90); /* nop2 */
> +                       emit_nops(&prog, X86_PATCH_SIZE);
>         }
>         /* Exception callback receives FP as third parameter */
>         if (is_exception_cb) {
> @@ -452,8 +459,6 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>         /* sub rsp, rounded_stack_depth */
>         if (stack_depth)
>                 EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
> -       if (tail_call_reachable)
> -               EMIT1(0x50);         /* push rax */
>         *pprog = prog;
>  }
>
> @@ -589,13 +594,61 @@ static void emit_return(u8 **pprog, u8 *ip)
>         *pprog = prog;
>  }
>
> +static __used void bpf_tail_call_cnt_init(void)
> +{
> +       /* The following asm equals to
> +        *
> +        * u32 *tcc_ptr = &current->bpf_tail_call_cnt;
> +        *
> +        * *tcc_ptr = 0;
> +        */
> +
> +       asm volatile (
> +           "addq " __percpu_arg(0) ", %1\n\t"
> +           "addq %2, %1\n\t"
> +           "movq (%1), %1\n\t"
> +           "addq %3, %1\n\t"
> +           "movl $0, (%1)\n\t"
> +           :
> +           : "m" (this_cpu_off), "r" (&pcpu_hot),
> +             "i" (offsetof(struct pcpu_hot, current_task)),
> +             "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
> +       );
> +}
> +
> +static __used u32 *bpf_tail_call_cnt_ptr(void)
> +{
> +       u32 *tcc_ptr;
> +
> +       /* The following asm equals to
> +        *
> +        * u32 *tcc_ptr = &current->bpf_tail_call_cnt;
> +        *
> +        * return tcc_ptr;
> +        */
> +
> +       asm volatile (
> +           "addq " __percpu_arg(1) ", %2\n\t"
> +           "addq %3, %2\n\t"
> +           "movq (%2), %2\n\t"
> +           "addq %4, %2\n\t"
> +           "movq %2, %0\n\t"
> +           : "=r" (tcc_ptr)
> +           : "m" (this_cpu_off), "r" (&pcpu_hot),
> +             "i" (offsetof(struct pcpu_hot, current_task)),
> +             "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
> +       );
> +
> +       return tcc_ptr;
> +}
> +
>  /*
>   * Generate the following code:
>   *
>   * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
>   *   if (index >= array->map.max_entries)
>   *     goto out;
> - *   if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
> + *   if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
>   *     goto out;
>   *   prog = array->ptrs[index];
>   *   if (prog == NULL)
> @@ -608,7 +661,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>                                         u32 stack_depth, u8 *ip,
>                                         struct jit_context *ctx)
>  {
> -       int tcc_off = -4 - round_up(stack_depth, 8);
>         u8 *prog = *pprog, *start = *pprog;
>         int offset;
>
> @@ -630,16 +682,16 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>         EMIT2(X86_JBE, offset);                   /* jbe out */
>
>         /*
> -        * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
> +        * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
>          *      goto out;
>          */
> -       EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
> -       EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> +       /* call bpf_tail_call_cnt_ptr */
> +       emit_call(&prog, bpf_tail_call_cnt_ptr, ip + (prog - start));

same issue.

> +       EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>
>         offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>         EMIT2(X86_JAE, offset);                   /* jae out */
> -       EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> -       EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
> +       EMIT2(0xFF, 0x00);                        /* inc dword ptr [rax] */
>
>         /* prog = array->ptrs[index]; */
>         EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
> @@ -663,7 +715,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>                         pop_r12(&prog);
>         }
>
> -       EMIT1(0x58);                              /* pop rax */
>         if (stack_depth)
>                 EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
>                             round_up(stack_depth, 8));
> @@ -691,21 +742,20 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>                                       bool *callee_regs_used, u32 stack_depth,
>                                       struct jit_context *ctx)
>  {
> -       int tcc_off = -4 - round_up(stack_depth, 8);
>         u8 *prog = *pprog, *start = *pprog;
>         int offset;
>
>         /*
> -        * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
> +        * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
>          *      goto out;
>          */
> -       EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
> -       EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
> +       /* call bpf_tail_call_cnt_ptr */
> +       emit_call(&prog, bpf_tail_call_cnt_ptr, ip);

and here as well.

pw-bot: cr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy
  2024-04-05  1:03   ` Alexei Starovoitov
@ 2024-04-07 11:34     ` Leon Hwang
  2024-04-07 16:30       ` Alexei Starovoitov
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Hwang @ 2024-04-07 11:34 UTC (permalink / raw
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Pu Lehui, Hengqi Chen,
	kernel-patches-bot



On 2024/4/5 09:03, Alexei Starovoitov wrote:
>>   * Solution changes from percpu tail_call_cnt to tail_call_cnt at task_struct.
> 
> Please remind us what was wrong with per-cpu approach?

There are some cases that the per-cpu approach cannot handle properly.
Especialy, on non-SMP machine, if there are many bpf progs to run with
tail call simultaneously, MAX_TAIL_CALL_CNT limit is unable to limit the
tail call expectedly.

> 
> Also notice we have pseudo per-cpu bpf insns now,
> so things might be easier today.

Great to hear that. With pseudo per-cpu bpf insns, it is able to get
tcc_ptr from task_struct without a function call.

> 
> On Tue, Apr 2, 2024 at 8:27 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
>> handling in JIT"), the tailcall on x64 works better than before.
> ...
>>
>> As a result, the previous tailcall way can be removed totally, including
>>
>> 1. "push rax" at prologue.
>> 2. load tail_call_cnt to rax before calling function.
>> 3. "pop rax" before jumping to tailcallee when tailcall.
>> 4. "push rax" and load tail_call_cnt to rax at trampoline.
> 
> Please trim it.
> It looks like you've been copy pasting it and it's no longer
> accurate.
> Short description of the problem will do.

Got it.

> 
>> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
>> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
>> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
>> ---
>>  arch/x86/net/bpf_jit_comp.c | 137 +++++++++++++++++++++---------------
>>  1 file changed, 81 insertions(+), 56 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index 3b639d6f2f54d..cd06e02e83b64 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/bpf.h>
>>  #include <linux/memory.h>
>>  #include <linux/sort.h>
>> +#include <linux/sched.h>
>>  #include <asm/extable.h>
>>  #include <asm/ftrace.h>
>>  #include <asm/set_memory.h>
>> @@ -18,6 +19,8 @@
>>  #include <asm/text-patching.h>
>>  #include <asm/unwind.h>
>>  #include <asm/cfi.h>
>> +#include <asm/current.h>
>> +#include <asm/percpu.h>
>>
>>  static bool all_callee_regs_used[4] = {true, true, true, true};
>>
>> @@ -273,7 +276,7 @@ struct jit_context {
>>  /* Number of bytes emit_patch() needs to generate instructions */
>>  #define X86_PATCH_SIZE         5
>>  /* Number of bytes that will be skipped on tailcall */
>> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
>> +#define X86_TAIL_CALL_OFFSET   (14 + ENDBR_INSN_SIZE)
>>
>>  static void push_r12(u8 **pprog)
>>  {
>> @@ -403,6 +406,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
>>         *pprog = prog;
>>  }
>>
>> +static int emit_call(u8 **pprog, void *func, void *ip);
>> +static __used void bpf_tail_call_cnt_init(void);
>> +
>>  /*
>>   * Emit x86-64 prologue code for BPF program.
>>   * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
>> @@ -410,9 +416,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
>>   */
>>  static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>                           bool tail_call_reachable, bool is_subprog,
>> -                         bool is_exception_cb)
>> +                         bool is_exception_cb, u8 *ip)
>>  {
>> -       u8 *prog = *pprog;
>> +       u8 *prog = *pprog, *start = *pprog;
>>
>>         emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
>>         /* BPF trampoline can be made to work without these nops,
>> @@ -421,13 +427,14 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>         emit_nops(&prog, X86_PATCH_SIZE);
>>         if (!ebpf_from_cbpf) {
>>                 if (tail_call_reachable && !is_subprog)
>> -                       /* When it's the entry of the whole tailcall context,
>> -                        * zeroing rax means initialising tail_call_cnt.
>> +                       /* Call bpf_tail_call_cnt_init to initilise
>> +                        * tail_call_cnt.
>>                          */
>> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
>> +                       emit_call(&prog, bpf_tail_call_cnt_init,
>> +                                 ip + (prog - start));
> 
> You're repeating the same bug we discussed before.
> There is nothing in bpf_tail_call_cnt_init() that
> prevents the compiler from scratching rdi,rsi,...
> bpf_tail_call_cnt_init() is a normal function from compiler pov
> and it's allowed to use those regs.
> Must have been lucky that CI is not showing crashes.

Oh, get it. In order to prevent the compiler from scratching
rdi,rsi,..., the asm clobbered register list in bpf_tail_call_cnt_init()
must be "rdi", "rsi", "rdx", "rcx", "r8". I learn it from the GCC doc[0].

static __used void bpf_tail_call_cnt_init(void)
{
	/* In short:
	 * current->bpf_tail_call_cnt = 0;
	 */

	asm volatile (
	    "addq " __percpu_arg(0) ", %1\n\t"
	    "movq (%1), %1\n\t"
	    "movl $0x0, %c2(%1)\n\t"
	    :
	    : "m" (__my_cpu_var(this_cpu_off)), "r" (&pcpu_hot.current_task),
	      "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
	    : "rdi", "rsi", "rdx", "rcx", "r8" /* to avoid scratching these regs */
	);
}

[0]
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers-1

> 
>>                 else
>>                         /* Keep the same instruction layout. */
>> -                       EMIT2(0x66, 0x90); /* nop2 */
>> +                       emit_nops(&prog, X86_PATCH_SIZE);
>>         }
>>         /* Exception callback receives FP as third parameter */
>>         if (is_exception_cb) {
>> @@ -452,8 +459,6 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>         /* sub rsp, rounded_stack_depth */
>>         if (stack_depth)
>>                 EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>> -       if (tail_call_reachable)
>> -               EMIT1(0x50);         /* push rax */
>>         *pprog = prog;
>>  }
>>
>> @@ -589,13 +594,61 @@ static void emit_return(u8 **pprog, u8 *ip)
>>         *pprog = prog;
>>  }
>>
>> +static __used void bpf_tail_call_cnt_init(void)
>> +{
>> +       /* The following asm equals to
>> +        *
>> +        * u32 *tcc_ptr = &current->bpf_tail_call_cnt;
>> +        *
>> +        * *tcc_ptr = 0;
>> +        */
>> +
>> +       asm volatile (
>> +           "addq " __percpu_arg(0) ", %1\n\t"
>> +           "addq %2, %1\n\t"
>> +           "movq (%1), %1\n\t"
>> +           "addq %3, %1\n\t"
>> +           "movl $0, (%1)\n\t"
>> +           :
>> +           : "m" (this_cpu_off), "r" (&pcpu_hot),
>> +             "i" (offsetof(struct pcpu_hot, current_task)),
>> +             "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
>> +       );
>> +}
>> +
>> +static __used u32 *bpf_tail_call_cnt_ptr(void)
>> +{
>> +       u32 *tcc_ptr;
>> +
>> +       /* The following asm equals to
>> +        *
>> +        * u32 *tcc_ptr = &current->bpf_tail_call_cnt;
>> +        *
>> +        * return tcc_ptr;
>> +        */
>> +
>> +       asm volatile (
>> +           "addq " __percpu_arg(1) ", %2\n\t"
>> +           "addq %3, %2\n\t"
>> +           "movq (%2), %2\n\t"
>> +           "addq %4, %2\n\t"
>> +           "movq %2, %0\n\t"
>> +           : "=r" (tcc_ptr)
>> +           : "m" (this_cpu_off), "r" (&pcpu_hot),
>> +             "i" (offsetof(struct pcpu_hot, current_task)),
>> +             "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
>> +       );
>> +
>> +       return tcc_ptr;
>> +}
>> +
>>  /*
>>   * Generate the following code:
>>   *
>>   * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
>>   *   if (index >= array->map.max_entries)
>>   *     goto out;
>> - *   if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>> + *   if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
>>   *     goto out;
>>   *   prog = array->ptrs[index];
>>   *   if (prog == NULL)
>> @@ -608,7 +661,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>                                         u32 stack_depth, u8 *ip,
>>                                         struct jit_context *ctx)
>>  {
>> -       int tcc_off = -4 - round_up(stack_depth, 8);
>>         u8 *prog = *pprog, *start = *pprog;
>>         int offset;
>>
>> @@ -630,16 +682,16 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>         EMIT2(X86_JBE, offset);                   /* jbe out */
>>
>>         /*
>> -        * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>> +        * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
>>          *      goto out;
>>          */
>> -       EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
>> -       EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
>> +       /* call bpf_tail_call_cnt_ptr */
>> +       emit_call(&prog, bpf_tail_call_cnt_ptr, ip + (prog - start));
> 
> same issue.

I will rewrite it to emit_bpf_tail_call_cnt_ptr(), which will use pseudo
per-cpu bpf insns to get tcc_ptr from task_struct.

static void emit_bpf_tail_call_cnt_ptr(u8 **pprog)
{
	u8 *prog = *pprog;

	/* In short:
	 * return &current->bpf_tail_call_cnt;
	 */

	/* mov rax, &pcpu_hot.current_task */
	EMIT3_off32(0x48, 0xC7, 0xC0, ((u32)(unsigned
long)&pcpu_hot.current_task));

#ifdef CONFIG_SMP
	/* add rax, gs:[&this_cpu_off] */
	EMIT1(0x65);
	EMIT4_off32(0x48, 0x03, 0x04, 0x25, ((u32)(unsigned long)&this_cpu_off));
#endif

	/* mov rax, qword ptr [rax] */
	EMIT3(0x48, 0x8B, 0x00);
	/* add rax, offsetof(struct task_struct, bpf_tail_call_cnt) */
	EMIT2_off32(0x48, 0x05, ((u32)offsetof(struct task_struct,
bpf_tail_call_cnt)));

	*pprog = prog;
}

> 
>> +       EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>>
>>         offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>>         EMIT2(X86_JAE, offset);                   /* jae out */
>> -       EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
>> -       EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
>> +       EMIT2(0xFF, 0x00);                        /* inc dword ptr [rax] */
>>
>>         /* prog = array->ptrs[index]; */
>>         EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
>> @@ -663,7 +715,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>                         pop_r12(&prog);
>>         }
>>
>> -       EMIT1(0x58);                              /* pop rax */
>>         if (stack_depth)
>>                 EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
>>                             round_up(stack_depth, 8));
>> @@ -691,21 +742,20 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>>                                       bool *callee_regs_used, u32 stack_depth,
>>                                       struct jit_context *ctx)
>>  {
>> -       int tcc_off = -4 - round_up(stack_depth, 8);
>>         u8 *prog = *pprog, *start = *pprog;
>>         int offset;
>>
>>         /*
>> -        * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>> +        * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
>>          *      goto out;
>>          */
>> -       EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
>> -       EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
>> +       /* call bpf_tail_call_cnt_ptr */
>> +       emit_call(&prog, bpf_tail_call_cnt_ptr, ip);
> 
> and here as well.

Replace with emit_bpf_tail_call_cnt_ptr(), too.

> 
> pw-bot: cr

I would like to send next version with these update.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy
  2024-04-07 11:34     ` Leon Hwang
@ 2024-04-07 16:30       ` Alexei Starovoitov
  2024-04-10 14:09         ` Leon Hwang
  0 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2024-04-07 16:30 UTC (permalink / raw
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Pu Lehui, Hengqi Chen,
	kernel-patches-bot

On Sun, Apr 7, 2024 at 4:34 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
>
> On 2024/4/5 09:03, Alexei Starovoitov wrote:
> >>   * Solution changes from percpu tail_call_cnt to tail_call_cnt at task_struct.
> >
> > Please remind us what was wrong with per-cpu approach?
>
> There are some cases that the per-cpu approach cannot handle properly.
> Especialy, on non-SMP machine, if there are many bpf progs to run with
> tail call simultaneously, MAX_TAIL_CALL_CNT limit is unable to limit the
> tail call expectedly.

That's not a very helpful explanation.
Last, I recall, the concern was that tcc will be a bit off.
The per-cpu tcc will limit recursion sooner when multiple progs
tail_call-ing on the same cpu?
If so, I think it's a trade-off that should be discussed.
tcc in task_struct will have the same issue.
It will limit tailcalls sooner in some cases.

There were some issues with overriding of per-cpu tcc.
The same concerns apply to per-task tcc.

> >
> > Also notice we have pseudo per-cpu bpf insns now,
> > so things might be easier today.
>
> Great to hear that. With pseudo per-cpu bpf insns, it is able to get
> tcc_ptr from task_struct without a function call.
>
> >
> > On Tue, Apr 2, 2024 at 8:27 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >>
> >> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
> >> handling in JIT"), the tailcall on x64 works better than before.
> > ...
> >>
> >> As a result, the previous tailcall way can be removed totally, including
> >>
> >> 1. "push rax" at prologue.
> >> 2. load tail_call_cnt to rax before calling function.
> >> 3. "pop rax" before jumping to tailcallee when tailcall.
> >> 4. "push rax" and load tail_call_cnt to rax at trampoline.
> >
> > Please trim it.
> > It looks like you've been copy pasting it and it's no longer
> > accurate.
> > Short description of the problem will do.
>
> Got it.
>
> >
> >> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
> >> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
> >> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> >> ---
> >>  arch/x86/net/bpf_jit_comp.c | 137 +++++++++++++++++++++---------------
> >>  1 file changed, 81 insertions(+), 56 deletions(-)
> >>
> >> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> >> index 3b639d6f2f54d..cd06e02e83b64 100644
> >> --- a/arch/x86/net/bpf_jit_comp.c
> >> +++ b/arch/x86/net/bpf_jit_comp.c
> >> @@ -11,6 +11,7 @@
> >>  #include <linux/bpf.h>
> >>  #include <linux/memory.h>
> >>  #include <linux/sort.h>
> >> +#include <linux/sched.h>
> >>  #include <asm/extable.h>
> >>  #include <asm/ftrace.h>
> >>  #include <asm/set_memory.h>
> >> @@ -18,6 +19,8 @@
> >>  #include <asm/text-patching.h>
> >>  #include <asm/unwind.h>
> >>  #include <asm/cfi.h>
> >> +#include <asm/current.h>
> >> +#include <asm/percpu.h>
> >>
> >>  static bool all_callee_regs_used[4] = {true, true, true, true};
> >>
> >> @@ -273,7 +276,7 @@ struct jit_context {
> >>  /* Number of bytes emit_patch() needs to generate instructions */
> >>  #define X86_PATCH_SIZE         5
> >>  /* Number of bytes that will be skipped on tailcall */
> >> -#define X86_TAIL_CALL_OFFSET   (11 + ENDBR_INSN_SIZE)
> >> +#define X86_TAIL_CALL_OFFSET   (14 + ENDBR_INSN_SIZE)
> >>
> >>  static void push_r12(u8 **pprog)
> >>  {
> >> @@ -403,6 +406,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
> >>         *pprog = prog;
> >>  }
> >>
> >> +static int emit_call(u8 **pprog, void *func, void *ip);
> >> +static __used void bpf_tail_call_cnt_init(void);
> >> +
> >>  /*
> >>   * Emit x86-64 prologue code for BPF program.
> >>   * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
> >> @@ -410,9 +416,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
> >>   */
> >>  static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >>                           bool tail_call_reachable, bool is_subprog,
> >> -                         bool is_exception_cb)
> >> +                         bool is_exception_cb, u8 *ip)
> >>  {
> >> -       u8 *prog = *pprog;
> >> +       u8 *prog = *pprog, *start = *pprog;
> >>
> >>         emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
> >>         /* BPF trampoline can be made to work without these nops,
> >> @@ -421,13 +427,14 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >>         emit_nops(&prog, X86_PATCH_SIZE);
> >>         if (!ebpf_from_cbpf) {
> >>                 if (tail_call_reachable && !is_subprog)
> >> -                       /* When it's the entry of the whole tailcall context,
> >> -                        * zeroing rax means initialising tail_call_cnt.
> >> +                       /* Call bpf_tail_call_cnt_init to initilise
> >> +                        * tail_call_cnt.
> >>                          */
> >> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
> >> +                       emit_call(&prog, bpf_tail_call_cnt_init,
> >> +                                 ip + (prog - start));
> >
> > You're repeating the same bug we discussed before.
> > There is nothing in bpf_tail_call_cnt_init() that
> > prevents the compiler from scratching rdi,rsi,...
> > bpf_tail_call_cnt_init() is a normal function from compiler pov
> > and it's allowed to use those regs.
> > Must have been lucky that CI is not showing crashes.
>
> Oh, get it. In order to prevent the compiler from scratching
> rdi,rsi,..., the asm clobbered register list in bpf_tail_call_cnt_init()
> must be "rdi", "rsi", "rdx", "rcx", "r8". I learn it from the GCC doc[0].
>
> static __used void bpf_tail_call_cnt_init(void)
> {
>         /* In short:
>          * current->bpf_tail_call_cnt = 0;
>          */
>
>         asm volatile (
>             "addq " __percpu_arg(0) ", %1\n\t"
>             "movq (%1), %1\n\t"
>             "movl $0x0, %c2(%1)\n\t"
>             :
>             : "m" (__my_cpu_var(this_cpu_off)), "r" (&pcpu_hot.current_task),
>               "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
>             : "rdi", "rsi", "rdx", "rcx", "r8" /* to avoid scratching these regs */

That will only prevent the compiler from allocating these regs
into "r" constraint, but the compiler can still use them outside of asm.
You need __naked too.

>         );
> }
>
> [0]
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers-1
>
> >
> >>                 else
> >>                         /* Keep the same instruction layout. */
> >> -                       EMIT2(0x66, 0x90); /* nop2 */
> >> +                       emit_nops(&prog, X86_PATCH_SIZE);
> >>         }
> >>         /* Exception callback receives FP as third parameter */
> >>         if (is_exception_cb) {
> >> @@ -452,8 +459,6 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >>         /* sub rsp, rounded_stack_depth */
> >>         if (stack_depth)
> >>                 EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
> >> -       if (tail_call_reachable)
> >> -               EMIT1(0x50);         /* push rax */
> >>         *pprog = prog;
> >>  }
> >>
> >> @@ -589,13 +594,61 @@ static void emit_return(u8 **pprog, u8 *ip)
> >>         *pprog = prog;
> >>  }
> >>
> >> +static __used void bpf_tail_call_cnt_init(void)
> >> +{
> >> +       /* The following asm equals to
> >> +        *
> >> +        * u32 *tcc_ptr = &current->bpf_tail_call_cnt;
> >> +        *
> >> +        * *tcc_ptr = 0;
> >> +        */
> >> +
> >> +       asm volatile (
> >> +           "addq " __percpu_arg(0) ", %1\n\t"
> >> +           "addq %2, %1\n\t"
> >> +           "movq (%1), %1\n\t"
> >> +           "addq %3, %1\n\t"
> >> +           "movl $0, (%1)\n\t"
> >> +           :
> >> +           : "m" (this_cpu_off), "r" (&pcpu_hot),
> >> +             "i" (offsetof(struct pcpu_hot, current_task)),
> >> +             "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
> >> +       );
> >> +}
> >> +
> >> +static __used u32 *bpf_tail_call_cnt_ptr(void)
> >> +{
> >> +       u32 *tcc_ptr;
> >> +
> >> +       /* The following asm equals to
> >> +        *
> >> +        * u32 *tcc_ptr = &current->bpf_tail_call_cnt;
> >> +        *
> >> +        * return tcc_ptr;
> >> +        */
> >> +
> >> +       asm volatile (
> >> +           "addq " __percpu_arg(1) ", %2\n\t"
> >> +           "addq %3, %2\n\t"
> >> +           "movq (%2), %2\n\t"
> >> +           "addq %4, %2\n\t"
> >> +           "movq %2, %0\n\t"
> >> +           : "=r" (tcc_ptr)
> >> +           : "m" (this_cpu_off), "r" (&pcpu_hot),
> >> +             "i" (offsetof(struct pcpu_hot, current_task)),
> >> +             "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
> >> +       );
> >> +
> >> +       return tcc_ptr;
> >> +}
> >> +
> >>  /*
> >>   * Generate the following code:
> >>   *
> >>   * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
> >>   *   if (index >= array->map.max_entries)
> >>   *     goto out;
> >> - *   if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
> >> + *   if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
> >>   *     goto out;
> >>   *   prog = array->ptrs[index];
> >>   *   if (prog == NULL)
> >> @@ -608,7 +661,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
> >>                                         u32 stack_depth, u8 *ip,
> >>                                         struct jit_context *ctx)
> >>  {
> >> -       int tcc_off = -4 - round_up(stack_depth, 8);
> >>         u8 *prog = *pprog, *start = *pprog;
> >>         int offset;
> >>
> >> @@ -630,16 +682,16 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
> >>         EMIT2(X86_JBE, offset);                   /* jbe out */
> >>
> >>         /*
> >> -        * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
> >> +        * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
> >>          *      goto out;
> >>          */
> >> -       EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
> >> -       EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> >> +       /* call bpf_tail_call_cnt_ptr */
> >> +       emit_call(&prog, bpf_tail_call_cnt_ptr, ip + (prog - start));
> >
> > same issue.
>
> I will rewrite it to emit_bpf_tail_call_cnt_ptr(), which will use pseudo
> per-cpu bpf insns to get tcc_ptr from task_struct.
>
> static void emit_bpf_tail_call_cnt_ptr(u8 **pprog)
> {
>         u8 *prog = *pprog;
>
>         /* In short:
>          * return &current->bpf_tail_call_cnt;
>          */
>
>         /* mov rax, &pcpu_hot.current_task */
>         EMIT3_off32(0x48, 0xC7, 0xC0, ((u32)(unsigned
> long)&pcpu_hot.current_task));
>
> #ifdef CONFIG_SMP
>         /* add rax, gs:[&this_cpu_off] */
>         EMIT1(0x65);
>         EMIT4_off32(0x48, 0x03, 0x04, 0x25, ((u32)(unsigned long)&this_cpu_off));
> #endif
>
>         /* mov rax, qword ptr [rax] */
>         EMIT3(0x48, 0x8B, 0x00);
>         /* add rax, offsetof(struct task_struct, bpf_tail_call_cnt) */
>         EMIT2_off32(0x48, 0x05, ((u32)offsetof(struct task_struct,
> bpf_tail_call_cnt)));
>
>         *pprog = prog;
> }

I think it's cleaner to use __naked func with asm volatile
and explicit 'rax'.

The suggestion to consider BPF_ADDR_PERCPU insn was in the context
of generating it in the verifier, so that JITs don't need
to do anything special with tcc.
Like if the verifier emits tcc checks that JIT's
emit_bpf_tail_call_[in]direct() will only deal with the actual call.
That was a bit of an orthogonal optimization/cleanup idea.

>
> >
> >> +       EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
> >>
> >>         offset = ctx->tail_call_indirect_label - (prog + 2 - start);
> >>         EMIT2(X86_JAE, offset);                   /* jae out */
> >> -       EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> >> -       EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
> >> +       EMIT2(0xFF, 0x00);                        /* inc dword ptr [rax] */
> >>
> >>         /* prog = array->ptrs[index]; */
> >>         EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
> >> @@ -663,7 +715,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
> >>                         pop_r12(&prog);
> >>         }
> >>
> >> -       EMIT1(0x58);                              /* pop rax */
> >>         if (stack_depth)
> >>                 EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
> >>                             round_up(stack_depth, 8));
> >> @@ -691,21 +742,20 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
> >>                                       bool *callee_regs_used, u32 stack_depth,
> >>                                       struct jit_context *ctx)
> >>  {
> >> -       int tcc_off = -4 - round_up(stack_depth, 8);
> >>         u8 *prog = *pprog, *start = *pprog;
> >>         int offset;
> >>
> >>         /*
> >> -        * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
> >> +        * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
> >>          *      goto out;
> >>          */
> >> -       EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
> >> -       EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
> >> +       /* call bpf_tail_call_cnt_ptr */
> >> +       emit_call(&prog, bpf_tail_call_cnt_ptr, ip);
> >
> > and here as well.
>
> Replace with emit_bpf_tail_call_cnt_ptr(), too.
>
> >
> > pw-bot: cr
>
> I would like to send next version with these update.

pw-bot is a special keyword that is recognized by "patchwork bot".
"cr" stands for "changes requested".
It's a patch status in patchwork.
It means that the patch will not be applied as-is.
So it means that you have to make changes and resend.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy
  2024-04-07 16:30       ` Alexei Starovoitov
@ 2024-04-10 14:09         ` Leon Hwang
  2024-04-11  3:42           ` Alexei Starovoitov
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Hwang @ 2024-04-10 14:09 UTC (permalink / raw
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Pu Lehui, Hengqi Chen,
	kernel-patches-bot



On 2024/4/8 00:30, Alexei Starovoitov wrote:
> On Sun, Apr 7, 2024 at 4:34 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>>
>> On 2024/4/5 09:03, Alexei Starovoitov wrote:
>>>>   * Solution changes from percpu tail_call_cnt to tail_call_cnt at task_struct.
>>>
>>> Please remind us what was wrong with per-cpu approach?
>>
>> There are some cases that the per-cpu approach cannot handle properly.
>> Especialy, on non-SMP machine, if there are many bpf progs to run with
>> tail call simultaneously, MAX_TAIL_CALL_CNT limit is unable to limit the
>> tail call expectedly.
> 
> That's not a very helpful explanation.

I apologize for my poor communication skill. I hope I can help to fix
this issue.

Why did I raise this approach, tcc in task_struct? When I tried to
figure out a better position to store tcc instead as a stack variable or
a per-cpu variable, why not store it in runtime context?
Around a tail call, the tail caller and the tail callee run on the same
thread, and the thread won't be migrated because of migrate_disable(),
if I understand correctly. As a consequence, it's better to store tcc in
thread struct or in thread local storage. In kernel, task_struct is the
thread struct, if I understand correctly. Thereafter, when multiple
progs tail_call-ing on the same cpu, the per-task tcc should limit them
independently, e.g.

   prog1     prog2
  thread1   thread2
     |         |
     |--sched->|
     |         |
     |<-sched--|
     |         |
   ---------------
        CPU1

NOTE: prog1 is diff from prog2. And they have tail call to handle while
they're scheduled.

The tcc in thread2 would not override the tcc in thread1.

When the same scenario as the above diagram shows happens to per-cpu tcc
approach, the tcc in thread2 will override the tcc in thread1. As a
result, per-cpu tcc cannot limit the tail call in thread1 and thread2
independently. This is what I concern about per-cpu tcc approach.


> Last, I recall, the concern was that tcc will be a bit off.
> The per-cpu tcc will limit recursion sooner when multiple progs
> tail_call-ing on the same cpu?

Yes.

> If so, I think it's a trade-off that should be discussed.
> tcc in task_struct will have the same issue.
> It will limit tailcalls sooner in some cases.

Could you explain one of them in details?

> 
> There were some issues with overriding of per-cpu tcc.
> The same concerns apply to per-task tcc.
> 
>>>
>>> Also notice we have pseudo per-cpu bpf insns now,
>>> so things might be easier today.
>>
>> Great to hear that. With pseudo per-cpu bpf insns, it is able to get
>> tcc_ptr from task_struct without a function call.
>>

[SNIP]

>>>>                 if (tail_call_reachable && !is_subprog)
>>>> -                       /* When it's the entry of the whole tailcall context,
>>>> -                        * zeroing rax means initialising tail_call_cnt.
>>>> +                       /* Call bpf_tail_call_cnt_init to initilise
>>>> +                        * tail_call_cnt.
>>>>                          */
>>>> -                       EMIT2(0x31, 0xC0); /* xor eax, eax */
>>>> +                       emit_call(&prog, bpf_tail_call_cnt_init,
>>>> +                                 ip + (prog - start));
>>>
>>> You're repeating the same bug we discussed before.
>>> There is nothing in bpf_tail_call_cnt_init() that
>>> prevents the compiler from scratching rdi,rsi,...
>>> bpf_tail_call_cnt_init() is a normal function from compiler pov
>>> and it's allowed to use those regs.
>>> Must have been lucky that CI is not showing crashes.
>>
>> Oh, get it. In order to prevent the compiler from scratching
>> rdi,rsi,..., the asm clobbered register list in bpf_tail_call_cnt_init()
>> must be "rdi", "rsi", "rdx", "rcx", "r8". I learn it from the GCC doc[0].
>>
>> static __used void bpf_tail_call_cnt_init(void)
>> {
>>         /* In short:
>>          * current->bpf_tail_call_cnt = 0;
>>          */
>>
>>         asm volatile (
>>             "addq " __percpu_arg(0) ", %1\n\t"
>>             "movq (%1), %1\n\t"
>>             "movl $0x0, %c2(%1)\n\t"
>>             :
>>             : "m" (__my_cpu_var(this_cpu_off)), "r" (&pcpu_hot.current_task),
>>               "i" (offsetof(struct task_struct, bpf_tail_call_cnt))
>>             : "rdi", "rsi", "rdx", "rcx", "r8" /* to avoid scratching these regs */
> 
> That will only prevent the compiler from allocating these regs
> into "r" constraint, but the compiler can still use them outside of asm.
> You need __naked too.

Got it.

> 
>>         );
>> }
>>
>> [0]
>> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers-1
>>
>>>
>>>>                 else
>>>>                         /* Keep the same instruction layout. */
>>>> -                       EMIT2(0x66, 0x90); /* nop2 */
>>>> +                       emit_nops(&prog, X86_PATCH_SIZE);
>>>>         }
>>>>         /* Exception callback receives FP as third parameter */
>>>>         if (is_exception_cb) {
>>>> @@ -452,8 +459,6 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>>>         /* sub rsp, rounded_stack_depth */
>>>>         if (stack_depth)
>>>>                 EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>>>> -       if (tail_call_reachable)
>>>> -               EMIT1(0x50);         /* push rax */
>>>>         *pprog = prog;
>>>>  }
>>>>
>>>> @@ -589,13 +594,61 @@ static void emit_return(u8 **pprog, u8 *ip)
>>>>         *pprog = prog;
>>>>  }
>>>>

[SNIP]

>>>> +       emit_call(&prog, bpf_tail_call_cnt_ptr, ip + (prog - start));
>>>
>>> same issue.
>>
>> I will rewrite it to emit_bpf_tail_call_cnt_ptr(), which will use pseudo
>> per-cpu bpf insns to get tcc_ptr from task_struct.
>>
>> static void emit_bpf_tail_call_cnt_ptr(u8 **pprog)
>> {
>>         u8 *prog = *pprog;
>>
>>         /* In short:
>>          * return &current->bpf_tail_call_cnt;
>>          */
>>
>>         /* mov rax, &pcpu_hot.current_task */
>>         EMIT3_off32(0x48, 0xC7, 0xC0, ((u32)(unsigned
>> long)&pcpu_hot.current_task));
>>
>> #ifdef CONFIG_SMP
>>         /* add rax, gs:[&this_cpu_off] */
>>         EMIT1(0x65);
>>         EMIT4_off32(0x48, 0x03, 0x04, 0x25, ((u32)(unsigned long)&this_cpu_off));
>> #endif
>>
>>         /* mov rax, qword ptr [rax] */
>>         EMIT3(0x48, 0x8B, 0x00);
>>         /* add rax, offsetof(struct task_struct, bpf_tail_call_cnt) */
>>         EMIT2_off32(0x48, 0x05, ((u32)offsetof(struct task_struct,
>> bpf_tail_call_cnt)));
>>
>>         *pprog = prog;
>> }
> 
> I think it's cleaner to use __naked func with asm volatile
> and explicit 'rax'.

Yeah. With asm volatile, these two bpf_tail_call_cnt functions are
similar. And it's easier to understand them together.

> 
> The suggestion to consider BPF_ADDR_PERCPU insn was in the context
> of generating it in the verifier, so that JITs don't need
> to do anything special with tcc.
> Like if the verifier emits tcc checks that JIT's
> emit_bpf_tail_call_[in]direct() will only deal with the actual call.
> That was a bit of an orthogonal optimization/cleanup idea.

Awesome! It's great to handle it in verifier with BPF_ADDR_PERCPU insn.

> 
>>
>>>
>>>> +       EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>>>>
>>>>         offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>>>>         EMIT2(X86_JAE, offset);                   /* jae out */
>>>> -       EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
>>>> -       EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
>>>> +       EMIT2(0xFF, 0x00);                        /* inc dword ptr [rax] */
>>>>
>>>>         /* prog = array->ptrs[index]; */
>>>>         EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
>>>> @@ -663,7 +715,6 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>>>                         pop_r12(&prog);
>>>>         }
>>>>
>>>> -       EMIT1(0x58);                              /* pop rax */
>>>>         if (stack_depth)
>>>>                 EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
>>>>                             round_up(stack_depth, 8));
>>>> @@ -691,21 +742,20 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>>>>                                       bool *callee_regs_used, u32 stack_depth,
>>>>                                       struct jit_context *ctx)
>>>>  {
>>>> -       int tcc_off = -4 - round_up(stack_depth, 8);
>>>>         u8 *prog = *pprog, *start = *pprog;
>>>>         int offset;
>>>>
>>>>         /*
>>>> -        * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>>>> +        * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
>>>>          *      goto out;
>>>>          */
>>>> -       EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
>>>> -       EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
>>>> +       /* call bpf_tail_call_cnt_ptr */
>>>> +       emit_call(&prog, bpf_tail_call_cnt_ptr, ip);
>>>
>>> and here as well.
>>
>> Replace with emit_bpf_tail_call_cnt_ptr(), too.
>>
>>>
>>> pw-bot: cr
>>
>> I would like to send next version with these update.
> 
> pw-bot is a special keyword that is recognized by "patchwork bot".
> "cr" stands for "changes requested".
> It's a patch status in patchwork.
> It means that the patch will not be applied as-is.
> So it means that you have to make changes and resend.

Thanks for your explanation. Thanks for your patience.


Thanks,
Leon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy
  2024-04-10 14:09         ` Leon Hwang
@ 2024-04-11  3:42           ` Alexei Starovoitov
  2024-04-14 11:47             ` Leon Hwang
  0 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2024-04-11  3:42 UTC (permalink / raw
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Pu Lehui, Hengqi Chen,
	kernel-patches-bot

On Wed, Apr 10, 2024 at 7:09 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
>
> On 2024/4/8 00:30, Alexei Starovoitov wrote:
> > On Sun, Apr 7, 2024 at 4:34 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >>
> >>
> >>
> >> On 2024/4/5 09:03, Alexei Starovoitov wrote:
> >>>>   * Solution changes from percpu tail_call_cnt to tail_call_cnt at task_struct.
> >>>
> >>> Please remind us what was wrong with per-cpu approach?
> >>
> >> There are some cases that the per-cpu approach cannot handle properly.
> >> Especialy, on non-SMP machine, if there are many bpf progs to run with
> >> tail call simultaneously, MAX_TAIL_CALL_CNT limit is unable to limit the
> >> tail call expectedly.
> >
> > That's not a very helpful explanation.
>
> I apologize for my poor communication skill. I hope I can help to fix
> this issue.
>
> Why did I raise this approach, tcc in task_struct? When I tried to
> figure out a better position to store tcc instead as a stack variable or
> a per-cpu variable, why not store it in runtime context?
> Around a tail call, the tail caller and the tail callee run on the same
> thread, and the thread won't be migrated because of migrate_disable(),
> if I understand correctly. As a consequence, it's better to store tcc in
> thread struct or in thread local storage. In kernel, task_struct is the
> thread struct, if I understand correctly. Thereafter, when multiple
> progs tail_call-ing on the same cpu, the per-task tcc should limit them
> independently, e.g.
>
>    prog1     prog2
>   thread1   thread2
>      |         |
>      |--sched->|
>      |         |
>      |<-sched--|
>      |         |
>    ---------------
>         CPU1
>
> NOTE: prog1 is diff from prog2. And they have tail call to handle while
> they're scheduled.
>
> The tcc in thread2 would not override the tcc in thread1.
>
> When the same scenario as the above diagram shows happens to per-cpu tcc
> approach, the tcc in thread2 will override the tcc in thread1. As a
> result, per-cpu tcc cannot limit the tail call in thread1 and thread2
> independently. This is what I concern about per-cpu tcc approach.

The same issue exists with per-task tcc.
In the above example prog1 and prog2 can be in the same thread1.
Example: prog1 is a kprobe-prog and prog2 is fentry prog that attaches
to something that prog1 is going to call.
When prog2 starts it will overwrite tcc in task.
So same issue as with per-cpu tcc.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy
  2024-04-11  3:42           ` Alexei Starovoitov
@ 2024-04-14 11:47             ` Leon Hwang
  0 siblings, 0 replies; 10+ messages in thread
From: Leon Hwang @ 2024-04-14 11:47 UTC (permalink / raw
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Fijalkowski, Maciej, Jakub Sitnicki, Pu Lehui, Hengqi Chen,
	kernel-patches-bot



On 2024/4/11 11:42, Alexei Starovoitov wrote:
> On Wed, Apr 10, 2024 at 7:09 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>> Why did I raise this approach, tcc in task_struct? When I tried to
>> figure out a better position to store tcc instead as a stack variable or
>> a per-cpu variable, why not store it in runtime context?
>> Around a tail call, the tail caller and the tail callee run on the same
>> thread, and the thread won't be migrated because of migrate_disable(),
>> if I understand correctly. As a consequence, it's better to store tcc in
>> thread struct or in thread local storage. In kernel, task_struct is the
>> thread struct, if I understand correctly. Thereafter, when multiple
>> progs tail_call-ing on the same cpu, the per-task tcc should limit them
>> independently, e.g.
>>
>>    prog1     prog2
>>   thread1   thread2
>>      |         |
>>      |--sched->|
>>      |         |
>>      |<-sched--|
>>      |         |
>>    ---------------
>>         CPU1
>>
>> NOTE: prog1 is diff from prog2. And they have tail call to handle while
>> they're scheduled.
>>
>> The tcc in thread2 would not override the tcc in thread1.
>>
>> When the same scenario as the above diagram shows happens to per-cpu tcc
>> approach, the tcc in thread2 will override the tcc in thread1. As a
>> result, per-cpu tcc cannot limit the tail call in thread1 and thread2
>> independently. This is what I concern about per-cpu tcc approach.
> 
> The same issue exists with per-task tcc.
> In the above example prog1 and prog2 can be in the same thread1.
> Example: prog1 is a kprobe-prog and prog2 is fentry prog that attaches
> to something that prog1 is going to call.
> When prog2 starts it will overwrite tcc in task.
> So same issue as with per-cpu tcc.

Oh, it's a horrible case for per-cpu/per-task approach.

It pushes me back to tcc_ptr-propagating approach, even though it is not
as elegant as per-cpu approach. But it works.

It stores tcc on stack of dispatcher function, like

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5034c1b4ded7b..c53e81102c150 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1225,7 +1225,7 @@ struct bpf_dispatcher {
 #define __bpfcall __nocfi
 #endif

-static __always_inline __bpfcall unsigned int bpf_dispatcher_nop_func(
+static notrace __used __bpfcall unsigned int bpf_dispatcher_nop_func(
 	const void *ctx,
 	const struct bpf_insn *insnsi,
 	bpf_func_t bpf_func)
@@ -1233,6 +1233,25 @@ static __always_inline __bpfcall unsigned int
bpf_dispatcher_nop_func(
 	return bpf_func(ctx, insnsi);
 }

+struct bpf_tail_call_run_ctx {
+	const void *ctx;
+	u32 *tail_call_cnt;
+};
+
+static notrace __used __bpfcall unsigned int bpf_dispatcher_tail_call_func(
+	const void *ctx,
+	const struct bpf_insn *insnsi,
+	bpf_func_t bpf_func)
+{
+	struct bpf_tail_call_run_ctx run_ctx = {};
+	u32 tail_call_cnt = 0;
+
+	run_ctx.ctx = ctx;
+	run_ctx.tail_call_cnt = &tail_call_cnt;
+
+	return bpf_func(&run_ctx, insnsi);
+}
+
 /* the implementation of the opaque uapi struct bpf_dynptr */
 struct bpf_dynptr_kern {
 	void *data;


Then, it propagates the original ctx with tcc_ptr in
bpf_tail_call_run_ctx by using the original ctx position.

And, it gets tcc_ptr and recovers the original ctx at prologue, like

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 2b5a475c4dd0d..a8ef1dbf141cc 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -273,7 +273,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(16 + ENDBR_INSN_SIZE)

 static void push_r12(u8 **pprog)
 {
@@ -420,14 +420,16 @@ static void emit_prologue(u8 **pprog, u32
stack_depth, bool ebpf_from_cbpf,
 	 */
 	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
-		if (tail_call_reachable && !is_subprog)
-			/* When it's the entry of the whole tailcall context,
-			 * zeroing rax means initialising tail_call_cnt.
-			 */
-			EMIT2(0x31, 0xC0); /* xor eax, eax */
-		else
+		if (tail_call_reachable && !is_subprog) {
+			/* Make rax as tcc_ptr. */
+			/* mov rax, qword ptr [rdi + 8] */
+			EMIT4(0x48, 0x8B, 0x47, 0x08);
+			/* Recover the original ctx. */
+			EMIT3(0x48, 0x8B, 0x3F); /* mov rdi, qword ptr [rdi] */
+		} else {
 			/* Keep the same instruction layout. */
-			EMIT2(0x66, 0x90); /* nop2 */
+			emit_nops(&prog, 7);
+		}
 	}


Thereafter, it propagates tcc_ptr by rax and stack.

But, when does it use bpf_dispatcher_tail_call_func()?

It stores bpf prog's dispatcher function in prog->aux at bpf prog
loading time's bpf_prog_select_runtime().

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5034c1b4ded7b..c53e81102c150 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1425,6 +1444,10 @@ struct btf_mod_pair {

 struct bpf_kfunc_desc_tab;

+typedef unsigned int (*bpf_dispatcher_func)(const void *ctx,
+					    const struct bpf_insn *insnsi,
+					    bpf_func_t bpf_func);
+
 struct bpf_prog_aux {
 	atomic64_t refcnt;
 	u32 used_map_cnt;
@@ -1485,6 +1508,7 @@ struct bpf_prog_aux {
 	struct bpf_map *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
 	char name[BPF_OBJ_NAME_LEN];
 	u64 (*bpf_exception_cb)(u64 cookie, u64 sp, u64 bp, u64, u64);
+	bpf_dispatcher_func dfunc;
 #ifdef CONFIG_SECURITY
 	void *security;
 #endif

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a41718eaeefe7..00cd48eb70de0 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2368,6 +2368,19 @@ static void bpf_prog_select_func(struct bpf_prog *fp)
 #endif
 }

+static void bpf_prog_select_dispatcher_func(struct bpf_prog *fp)
+{
+	if (fp->aux->tail_call_reachable && fp->jited &&
+	    bpf_jit_supports_tail_call_cnt_ptr()) {
+		fp->aux->dfunc = bpf_dispatcher_tail_call_func;
+		return;
+	}
+
+	fp->aux->dfunc = fp->type == BPF_PROG_TYPE_XDP ?
+			 BPF_DISPATCHER_FUNC(xdp) :
+			 bpf_dispatcher_nop_func;
+}
+
 /**
  *	bpf_prog_select_runtime - select exec runtime for BPF program
  *	@fp: bpf_prog populated with BPF program
@@ -2429,6 +2442,10 @@ struct bpf_prog *bpf_prog_select_runtime(struct
bpf_prog *fp, int *err)
 	 * all eBPF JITs might immediately support all features.
 	 */
 	*err = bpf_check_tail_call(fp);
+	if (*err)
+		return fp;
+
+	bpf_prog_select_dispatcher_func(fp);

 	return fp;
 }


Yeah, here, it adds bpf_jit_supports_tail_call_cnt_ptr() to determine
whether the arch JIT supports tcc_ptr.

Finally, when to run bpf prog, it uses the dispatcher function in
prog->aux to run it.

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7a27f19bf44d0..b0278305bdc51 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -662,14 +662,9 @@ extern int (*nfct_btf_struct_access)(struct
bpf_verifier_log *log,
 				     const struct bpf_reg_state *reg,
 				     int off, int size);

-typedef unsigned int (*bpf_dispatcher_fn)(const void *ctx,
-					  const struct bpf_insn *insnsi,
-					  unsigned int (*bpf_func)(const void *,
-								   const struct bpf_insn *));
-
 static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
 					  const void *ctx,
-					  bpf_dispatcher_fn dfunc)
+					  bpf_dispatcher_func dfunc)
 {
 	u32 ret;

@@ -695,7 +690,7 @@ static __always_inline u32 __bpf_prog_run(const
struct bpf_prog *prog,

 static __always_inline u32 bpf_prog_run(const struct bpf_prog *prog,
const void *ctx)
 {
-	return __bpf_prog_run(prog, ctx, bpf_dispatcher_nop_func);
+	return __bpf_prog_run(prog, ctx, prog->aux->dfunc);
 }


With these changes in POC, it is able to pass all selftests[0] on x86_64.

[0] https://github.com/kernel-patches/bpf/pull/6794/checks

Besides these changes, there are some details it has to handle for this
approach.

I would like to send this approach as next version patchset.

Thanks,
Leon

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-04-14 11:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-02 15:26 [PATCH bpf-next v3 0/3] bpf, x64: Fix tailcall hierarchy Leon Hwang
2024-04-02 15:26 ` [PATCH bpf-next v3 1/3] bpf: Add bpf_tail_call_cnt to task_struct Leon Hwang
2024-04-02 15:26 ` [PATCH bpf-next v3 2/3] bpf, x64: Fix tailcall hierarchy Leon Hwang
2024-04-05  1:03   ` Alexei Starovoitov
2024-04-07 11:34     ` Leon Hwang
2024-04-07 16:30       ` Alexei Starovoitov
2024-04-10 14:09         ` Leon Hwang
2024-04-11  3:42           ` Alexei Starovoitov
2024-04-14 11:47             ` Leon Hwang
2024-04-02 15:26 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.