BPF Archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/2] riscv, bpf: Support per-CPU insn and inline bpf_get_smp_processor_id()
@ 2024-04-30 17:58 Puranjay Mohan
  2024-04-30 17:58 ` [PATCH bpf-next v2 1/2] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
  2024-04-30 17:58 ` [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id() Puranjay Mohan
  0 siblings, 2 replies; 11+ messages in thread
From: Puranjay Mohan @ 2024-04-30 17:58 UTC (permalink / raw
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui
  Cc: puranjay12

Changes in v1->v2:
v1: https://lore.kernel.org/all/20240405124348.27644-1-puranjay12@gmail.com/
- Use emit_addr() in place of emit_imm() in the first patch.
- Add the second patch to inline bpf_get_smp_processor_id()

The first patch add the support for the internal-only MOV instruction to
resolve per-CPU addrs to absolute addresses.
RISC-V uses generic per-cpu implementation where the offsets for CPUs are
kept in an array called __per_cpu_offset[cpu_number]. RISCV stores the
address of the task_struct in TP register. The first element in task_struct
is struct thread_info, and we can get the cpu number by reading from the TP
register + offsetof(struct thread_info, cpu).

The second patch inlines the calls bpf_get_smp_processor_id() in the JIT
by emitting a load from the TP + offsetof(struct thread_info, cpu).

Benchmark using [1] on Qemu.

  ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

  +---------------+------------------+------------------+--------------+
  |      Name     |     Before       |       After      |   % change   |
  |---------------+------------------+------------------+--------------|
  | glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
  | arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
  | hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
  +---------------+------------------+------------------+--------------+

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Puranjay Mohan (2):
  riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  riscv, bpf: inline bpf_get_smp_processor_id()

 arch/riscv/net/bpf_jit_comp64.c | 50 +++++++++++++++++++++++++++++++++
 include/linux/filter.h          |  1 +
 kernel/bpf/core.c               | 11 ++++++++
 kernel/bpf/verifier.c           |  2 ++
 4 files changed, 64 insertions(+)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v2 1/2] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-04-30 17:58 [PATCH bpf-next v2 0/2] riscv, bpf: Support per-CPU insn and inline bpf_get_smp_processor_id() Puranjay Mohan
@ 2024-04-30 17:58 ` Puranjay Mohan
  2024-05-01 16:39   ` Andrii Nakryiko
  2024-04-30 17:58 ` [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id() Puranjay Mohan
  1 sibling, 1 reply; 11+ messages in thread
From: Puranjay Mohan @ 2024-04-30 17:58 UTC (permalink / raw
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui
  Cc: puranjay12

Support an instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF
JITs.

RISC-V uses generic per-cpu implementation where the offsets for CPUs
are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
the address of the task_struct in TP register. The first element in
task_struct is struct thread_info, and we can get the cpu number by
reading from the TP register + offsetof(struct thread_info, cpu).

Once we have the cpu number in a register we read the offset for that
cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
offset to the destination register.

To measure the improvement from this change, the benchmark in [1] was
used on Qemu:

Before:
glob-arr-inc   :    1.127 ± 0.013M/s
arr-inc        :    1.121 ± 0.004M/s
hash-inc       :    0.681 ± 0.052M/s

After:
glob-arr-inc   :    1.138 ± 0.011M/s
arr-inc        :    1.366 ± 0.006M/s
hash-inc       :    0.676 ± 0.001M/s

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index 15e482f2c657..99d7006f1420 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -12,6 +12,7 @@
 #include <linux/stop_machine.h>
 #include <asm/patch.h>
 #include <asm/cfi.h>
+#include <asm/percpu.h>
 #include "bpf_jit.h"
 
 #define RV_FENTRY_NINSNS 2
@@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 			emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
 			emit_mv(rd, RV_REG_T1, ctx);
 			break;
+		} else if (insn_is_mov_percpu_addr(insn)) {
+			if (rd != rs)
+				emit_mv(rd, rs, ctx);
+#ifdef CONFIG_SMP
+				/* Load current CPU number in T1 */
+				emit_ld(RV_REG_T1, offsetof(struct thread_info, cpu),
+					RV_REG_TP, ctx);
+				/* << 3 because offsets are 8 bytes */
+				emit_slli(RV_REG_T1, RV_REG_T1, 3, ctx);
+				/* Load address of __per_cpu_offset array in T2 */
+				emit_addr(RV_REG_T2, (u64)&__per_cpu_offset, extra_pass, ctx);
+				/* Add offset of current CPU to  __per_cpu_offset */
+				emit_add(RV_REG_T1, RV_REG_T2, RV_REG_T1, ctx);
+				/* Load __per_cpu_offset[cpu] in T1 */
+				emit_ld(RV_REG_T1, 0, RV_REG_T1, ctx);
+				/* Add the offset to Rd */
+				emit_add(rd, rd, RV_REG_T1, ctx);
+#endif
 		}
 		if (imm == 1) {
 			/* Special mov32 for zext */
@@ -2038,3 +2057,8 @@ bool bpf_jit_supports_arena(void)
 {
 	return true;
 }
+
+bool bpf_jit_supports_percpu_insn(void)
+{
+	return true;
+}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id()
  2024-04-30 17:58 [PATCH bpf-next v2 0/2] riscv, bpf: Support per-CPU insn and inline bpf_get_smp_processor_id() Puranjay Mohan
  2024-04-30 17:58 ` [PATCH bpf-next v2 1/2] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
@ 2024-04-30 17:58 ` Puranjay Mohan
  2024-04-30 19:18   ` Kumar Kartikeya Dwivedi
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Puranjay Mohan @ 2024-04-30 17:58 UTC (permalink / raw
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui
  Cc: puranjay12

Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.

RISCV saves the pointer to the CPU's task_struct in the TP (thread
pointer) register. This makes it trivial to get the CPU's processor id.
As thread_info is the first member of task_struct, we can read the
processor id from TP + offsetof(struct thread_info, cpu).

          RISCV64 JIT output for `call bpf_get_smp_processor_id`
	  ======================================================

                Before                           After
               --------                         -------

         auipc   t1,0x848c                  ld    a5,32(tp)
         jalr    604(t1)
         mv      a5,a0

Benchmark using [1] on Qemu.

./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

+---------------+------------------+------------------+--------------+
|      Name     |     Before       |       After      |   % change   |
|---------------+------------------+------------------+--------------|
| glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
| arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
| hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
+---------------+------------------+------------------+--------------+

NOTE: This benchmark includes changes from this patch and the previous
      patch that implemented the per-cpu insn.

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 arch/riscv/net/bpf_jit_comp64.c | 26 ++++++++++++++++++++++++++
 include/linux/filter.h          |  1 +
 kernel/bpf/core.c               | 11 +++++++++++
 kernel/bpf/verifier.c           |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index 99d7006f1420..5789b7afae47 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -1493,6 +1493,22 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 		bool fixed_addr;
 		u64 addr;
 
+		/* Inline calls to bpf_get_smp_processor_id()
+		 *
+		 * RV_REG_TP holds the address of the current CPU's task_struct and thread_info is
+		 * at offset 0 in task_struct.
+		 * Load cpu from thread_info:
+		 *     Set R0 to ((struct thread_info *)(RV_REG_TP))->cpu
+		 *
+		 * This replicates the implementation of raw_smp_processor_id() on RISCV
+		 */
+		if (insn->src_reg == 0 && insn->imm == BPF_FUNC_get_smp_processor_id) {
+			/* Load current CPU number in R0 */
+			emit_ld(bpf_to_rv_reg(BPF_REG_0, ctx), offsetof(struct thread_info, cpu),
+				RV_REG_TP, ctx);
+			break;
+		}
+
 		mark_call(ctx);
 		ret = bpf_jit_get_func_addr(ctx->prog, insn, extra_pass,
 					    &addr, &fixed_addr);
@@ -2062,3 +2078,13 @@ bool bpf_jit_supports_percpu_insn(void)
 {
 	return true;
 }
+
+bool bpf_jit_inlines_helper_call(s32 imm)
+{
+	switch (imm) {
+	case BPF_FUNC_get_smp_processor_id:
+		return true;
+	}
+
+	return false;
+}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7a27f19bf44d..3e19bb62ed1a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -993,6 +993,7 @@ u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
 bool bpf_jit_needs_zext(void);
+bool bpf_jit_inlines_helper_call(s32 imm);
 bool bpf_jit_supports_subprog_tailcalls(void);
 bool bpf_jit_supports_percpu_insn(void);
 bool bpf_jit_supports_kfunc_call(void);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 99b8b1c9a248..aa59af9f9bd9 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2941,6 +2941,17 @@ bool __weak bpf_jit_needs_zext(void)
 	return false;
 }
 
+/* Return true if the JIT inlines the call to the helper corresponding to
+ * the imm.
+ *
+ * The verifier will not patch the insn->imm for the call to the helper if
+ * this returns true.
+ */
+bool __weak bpf_jit_inlines_helper_call(s32 imm)
+{
+	return false;
+}
+
 /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool __weak bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5d42db05315e..e78f766d7f91 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20013,6 +20013,8 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
+		if (bpf_jit_inlines_helper_call(insn->imm))
+			goto next_insn;
 		if (insn->imm == BPF_FUNC_get_route_realm)
 			prog->dst_needed = 1;
 		if (insn->imm == BPF_FUNC_get_prandom_u32)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id()
  2024-04-30 17:58 ` [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id() Puranjay Mohan
@ 2024-04-30 19:18   ` Kumar Kartikeya Dwivedi
  2024-05-01 16:46   ` Andrii Nakryiko
  2024-05-02 16:19   ` Björn Töpel
  2 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-04-30 19:18 UTC (permalink / raw
  To: Puranjay Mohan
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui, puranjay12

On Tue, 30 Apr 2024 at 20:00, Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
>
> RISCV saves the pointer to the CPU's task_struct in the TP (thread
> pointer) register. This makes it trivial to get the CPU's processor id.
> As thread_info is the first member of task_struct, we can read the
> processor id from TP + offsetof(struct thread_info, cpu).
>
>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
>           ======================================================
>
>                 Before                           After
>                --------                         -------
>
>          auipc   t1,0x848c                  ld    a5,32(tp)
>          jalr    604(t1)
>          mv      a5,a0
>
> Benchmark using [1] on Qemu.
>
> ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
>
> +---------------+------------------+------------------+--------------+
> |      Name     |     Before       |       After      |   % change   |
> |---------------+------------------+------------------+--------------|
> | glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
> | arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
> | hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
> +---------------+------------------+------------------+--------------+
>
> NOTE: This benchmark includes changes from this patch and the previous
>       patch that implemented the per-cpu insn.
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---

For non-riscv bits (& fwiw):

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v2 1/2] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-04-30 17:58 ` [PATCH bpf-next v2 1/2] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
@ 2024-05-01 16:39   ` Andrii Nakryiko
  2024-05-02 16:18     ` Björn Töpel
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2024-05-01 16:39 UTC (permalink / raw
  To: Puranjay Mohan
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui, puranjay12

On Tue, Apr 30, 2024 at 10:58 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Support an instruction for resolving absolute addresses of per-CPU
> data from their per-CPU offsets. This instruction is internal-only and
> users are not allowed to use them directly. They will only be used for
> internal inlining optimizations for now between BPF verifier and BPF
> JITs.
>
> RISC-V uses generic per-cpu implementation where the offsets for CPUs
> are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
> the address of the task_struct in TP register. The first element in
> task_struct is struct thread_info, and we can get the cpu number by
> reading from the TP register + offsetof(struct thread_info, cpu).
>
> Once we have the cpu number in a register we read the offset for that
> cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
> offset to the destination register.
>
> To measure the improvement from this change, the benchmark in [1] was
> used on Qemu:
>
> Before:
> glob-arr-inc   :    1.127 ± 0.013M/s
> arr-inc        :    1.121 ± 0.004M/s
> hash-inc       :    0.681 ± 0.052M/s
>
> After:
> glob-arr-inc   :    1.138 ± 0.011M/s
> arr-inc        :    1.366 ± 0.006M/s
> hash-inc       :    0.676 ± 0.001M/s
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> index 15e482f2c657..99d7006f1420 100644
> --- a/arch/riscv/net/bpf_jit_comp64.c
> +++ b/arch/riscv/net/bpf_jit_comp64.c
> @@ -12,6 +12,7 @@
>  #include <linux/stop_machine.h>
>  #include <asm/patch.h>
>  #include <asm/cfi.h>
> +#include <asm/percpu.h>
>  #include "bpf_jit.h"
>
>  #define RV_FENTRY_NINSNS 2
> @@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>                         emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
>                         emit_mv(rd, RV_REG_T1, ctx);
>                         break;
> +               } else if (insn_is_mov_percpu_addr(insn)) {
> +                       if (rd != rs)
> +                               emit_mv(rd, rs, ctx);
> +#ifdef CONFIG_SMP
> +                               /* Load current CPU number in T1 */
> +                               emit_ld(RV_REG_T1, offsetof(struct thread_info, cpu),
> +                                       RV_REG_TP, ctx);
> +                               /* << 3 because offsets are 8 bytes */
> +                               emit_slli(RV_REG_T1, RV_REG_T1, 3, ctx);
> +                               /* Load address of __per_cpu_offset array in T2 */
> +                               emit_addr(RV_REG_T2, (u64)&__per_cpu_offset, extra_pass, ctx);
> +                               /* Add offset of current CPU to  __per_cpu_offset */
> +                               emit_add(RV_REG_T1, RV_REG_T2, RV_REG_T1, ctx);
> +                               /* Load __per_cpu_offset[cpu] in T1 */
> +                               emit_ld(RV_REG_T1, 0, RV_REG_T1, ctx);
> +                               /* Add the offset to Rd */
> +                               emit_add(rd, rd, RV_REG_T1, ctx);

is this the right level of code indentation?

> +#endif
>                 }
>                 if (imm == 1) {
>                         /* Special mov32 for zext */
> @@ -2038,3 +2057,8 @@ bool bpf_jit_supports_arena(void)
>  {
>         return true;
>  }
> +
> +bool bpf_jit_supports_percpu_insn(void)
> +{
> +       return true;
> +}
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id()
  2024-04-30 17:58 ` [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id() Puranjay Mohan
  2024-04-30 19:18   ` Kumar Kartikeya Dwivedi
@ 2024-05-01 16:46   ` Andrii Nakryiko
  2024-05-02 13:16     ` Puranjay Mohan
  2024-05-02 16:19   ` Björn Töpel
  2 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2024-05-01 16:46 UTC (permalink / raw
  To: Puranjay Mohan
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui, puranjay12

On Tue, Apr 30, 2024 at 10:59 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
>
> RISCV saves the pointer to the CPU's task_struct in the TP (thread
> pointer) register. This makes it trivial to get the CPU's processor id.
> As thread_info is the first member of task_struct, we can read the
> processor id from TP + offsetof(struct thread_info, cpu).
>
>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
>           ======================================================
>
>                 Before                           After
>                --------                         -------
>
>          auipc   t1,0x848c                  ld    a5,32(tp)
>          jalr    604(t1)
>          mv      a5,a0
>

Nice, great find! Would you be able to do similar inlining for x86-64
as well? Disassembling bpf_get_smp_processor_id for x86-64 shows this:

Dump of assembler code for function bpf_get_smp_processor_id:
   0xffffffff810f91a0 <+0>:     0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
   0xffffffff810f91a5 <+5>:     65 8b 05 60 79 f3 7e    mov
%gs:0x7ef37960(%rip),%eax        # 0x30b0c <pcpu_hot+12>
   0xffffffff810f91ac <+12>:    48 98   cltq
   0xffffffff810f91ae <+14>:    c3      ret
End of assembler dump.

We should be able to do the same in x86-64 BPF JIT. (it's actually how
I started initially, I had a dedicated instruction reading per-cpu
memory, but ended up with more general "calculate per-cpu address").

Anyways, great work, a small nit below.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> Benchmark using [1] on Qemu.
>
> ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
>
> +---------------+------------------+------------------+--------------+
> |      Name     |     Before       |       After      |   % change   |
> |---------------+------------------+------------------+--------------|
> | glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
> | arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
> | hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
> +---------------+------------------+------------------+--------------+
>
> NOTE: This benchmark includes changes from this patch and the previous
>       patch that implemented the per-cpu insn.
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  arch/riscv/net/bpf_jit_comp64.c | 26 ++++++++++++++++++++++++++
>  include/linux/filter.h          |  1 +
>  kernel/bpf/core.c               | 11 +++++++++++
>  kernel/bpf/verifier.c           |  2 ++
>  4 files changed, 40 insertions(+)
>
> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> index 99d7006f1420..5789b7afae47 100644
> --- a/arch/riscv/net/bpf_jit_comp64.c
> +++ b/arch/riscv/net/bpf_jit_comp64.c
> @@ -1493,6 +1493,22 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>                 bool fixed_addr;
>                 u64 addr;
>
> +               /* Inline calls to bpf_get_smp_processor_id()
> +                *
> +                * RV_REG_TP holds the address of the current CPU's task_struct and thread_info is
> +                * at offset 0 in task_struct.
> +                * Load cpu from thread_info:
> +                *     Set R0 to ((struct thread_info *)(RV_REG_TP))->cpu
> +                *
> +                * This replicates the implementation of raw_smp_processor_id() on RISCV
> +                */
> +               if (insn->src_reg == 0 && insn->imm == BPF_FUNC_get_smp_processor_id) {
> +                       /* Load current CPU number in R0 */
> +                       emit_ld(bpf_to_rv_reg(BPF_REG_0, ctx), offsetof(struct thread_info, cpu),
> +                               RV_REG_TP, ctx);
> +                       break;
> +               }
> +
>                 mark_call(ctx);
>                 ret = bpf_jit_get_func_addr(ctx->prog, insn, extra_pass,
>                                             &addr, &fixed_addr);
> @@ -2062,3 +2078,13 @@ bool bpf_jit_supports_percpu_insn(void)
>  {
>         return true;
>  }
> +
> +bool bpf_jit_inlines_helper_call(s32 imm)
> +{
> +       switch (imm) {
> +       case BPF_FUNC_get_smp_processor_id:
> +               return true;
> +       }
> +
> +       return false;

nit: why not

default:
    return false;

to keep everything within the switch?

> +}
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 7a27f19bf44d..3e19bb62ed1a 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -993,6 +993,7 @@ u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
>  void bpf_jit_compile(struct bpf_prog *prog);
>  bool bpf_jit_needs_zext(void);
> +bool bpf_jit_inlines_helper_call(s32 imm);
>  bool bpf_jit_supports_subprog_tailcalls(void);
>  bool bpf_jit_supports_percpu_insn(void);
>  bool bpf_jit_supports_kfunc_call(void);
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 99b8b1c9a248..aa59af9f9bd9 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2941,6 +2941,17 @@ bool __weak bpf_jit_needs_zext(void)
>         return false;
>  }
>
> +/* Return true if the JIT inlines the call to the helper corresponding to
> + * the imm.
> + *
> + * The verifier will not patch the insn->imm for the call to the helper if
> + * this returns true.
> + */
> +bool __weak bpf_jit_inlines_helper_call(s32 imm)
> +{
> +       return false;
> +}
> +
>  /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
>  bool __weak bpf_jit_supports_subprog_tailcalls(void)
>  {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5d42db05315e..e78f766d7f91 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20013,6 +20013,8 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                         goto next_insn;
>                 }
>
> +               if (bpf_jit_inlines_helper_call(insn->imm))
> +                       goto next_insn;

It's nice to be able to allow BPF JIT to do a higher-performance
implementation. Let's add a short comment above to mention that this
is bypassing normal inlining because BPF JIT will do it better (I know
you have this description for the function definition, but a short
remark here would be helpful).

And please add an empty line after this check to logically separate it
from the rest of helper inlining logic in verifier, thanks!

pw-bot: cr

>                 if (insn->imm == BPF_FUNC_get_route_realm)
>                         prog->dst_needed = 1;
>                 if (insn->imm == BPF_FUNC_get_prandom_u32)
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id()
  2024-05-01 16:46   ` Andrii Nakryiko
@ 2024-05-02 13:16     ` Puranjay Mohan
  2024-05-02 16:03       ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Puranjay Mohan @ 2024-05-02 13:16 UTC (permalink / raw
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Apr 30, 2024 at 10:59 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
>>
>> RISCV saves the pointer to the CPU's task_struct in the TP (thread
>> pointer) register. This makes it trivial to get the CPU's processor id.
>> As thread_info is the first member of task_struct, we can read the
>> processor id from TP + offsetof(struct thread_info, cpu).
>>
>>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
>>           ======================================================
>>
>>                 Before                           After
>>                --------                         -------
>>
>>          auipc   t1,0x848c                  ld    a5,32(tp)
>>          jalr    604(t1)
>>          mv      a5,a0
>>
>
> Nice, great find! Would you be able to do similar inlining for x86-64
> as well? Disassembling bpf_get_smp_processor_id for x86-64 shows this:
>
> Dump of assembler code for function bpf_get_smp_processor_id:
>    0xffffffff810f91a0 <+0>:     0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
>    0xffffffff810f91a5 <+5>:     65 8b 05 60 79 f3 7e    mov
> %gs:0x7ef37960(%rip),%eax        # 0x30b0c <pcpu_hot+12>
>    0xffffffff810f91ac <+12>:    48 98   cltq
>    0xffffffff810f91ae <+14>:    c3      ret
> End of assembler dump.
> We should be able to do the same in x86-64 BPF JIT. (it's actually how
> I started initially, I had a dedicated instruction reading per-cpu
> memory, but ended up with more general "calculate per-cpu address").

I feel in x86-64's case JIT can not do a (much) better job compared to the
current approach in the verifier.

On RISC-V and ARM64, JIT was able to do it better because both of these
architectures save a pointer to the task struct in a special CPU
register. As x86-64 doesn't have enough extra registers, it uses a
percpu variable to store task struct, thread_info, and the cpu
number.

P.S. - While doing this for BPF, I realized that ARM64 kernel code is
also not optimal as it is using the percpu variable and is not reading
the CPU register directly. So, I sent a patch[1] to fix it in the kernel
and get rid of the per-cpu variable in ARM64.


[1] https://lore.kernel.org/all/20240502123449.2690-2-puranjay@kernel.org/

> Anyways, great work, a small nit below.
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id()
  2024-05-02 13:16     ` Puranjay Mohan
@ 2024-05-02 16:03       ` Andrii Nakryiko
  0 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2024-05-02 16:03 UTC (permalink / raw
  To: Puranjay Mohan
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui

On Thu, May 2, 2024 at 6:16 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Tue, Apr 30, 2024 at 10:59 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
> >>
> >> RISCV saves the pointer to the CPU's task_struct in the TP (thread
> >> pointer) register. This makes it trivial to get the CPU's processor id.
> >> As thread_info is the first member of task_struct, we can read the
> >> processor id from TP + offsetof(struct thread_info, cpu).
> >>
> >>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
> >>           ======================================================
> >>
> >>                 Before                           After
> >>                --------                         -------
> >>
> >>          auipc   t1,0x848c                  ld    a5,32(tp)
> >>          jalr    604(t1)
> >>          mv      a5,a0
> >>
> >
> > Nice, great find! Would you be able to do similar inlining for x86-64
> > as well? Disassembling bpf_get_smp_processor_id for x86-64 shows this:
> >
> > Dump of assembler code for function bpf_get_smp_processor_id:
> >    0xffffffff810f91a0 <+0>:     0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
> >    0xffffffff810f91a5 <+5>:     65 8b 05 60 79 f3 7e    mov
> > %gs:0x7ef37960(%rip),%eax        # 0x30b0c <pcpu_hot+12>
> >    0xffffffff810f91ac <+12>:    48 98   cltq
> >    0xffffffff810f91ae <+14>:    c3      ret
> > End of assembler dump.
> > We should be able to do the same in x86-64 BPF JIT. (it's actually how
> > I started initially, I had a dedicated instruction reading per-cpu
> > memory, but ended up with more general "calculate per-cpu address").
>
> I feel in x86-64's case JIT can not do a (much) better job compared to the
> current approach in the verifier.

This direct memory read (using gs segment) ought to be a bit faster
than calculating offset and then doing memory dereference, but yes,
the difference won't be as big as you got with RISC-V and ARM64. Ok,
never mind, we can always benchmark and add that later, no big deal.

>
> On RISC-V and ARM64, JIT was able to do it better because both of these
> architectures save a pointer to the task struct in a special CPU
> register. As x86-64 doesn't have enough extra registers, it uses a
> percpu variable to store task struct, thread_info, and the cpu
> number.
>
> P.S. - While doing this for BPF, I realized that ARM64 kernel code is
> also not optimal as it is using the percpu variable and is not reading
> the CPU register directly. So, I sent a patch[1] to fix it in the kernel
> and get rid of the per-cpu variable in ARM64.
>
>
> [1] https://lore.kernel.org/all/20240502123449.2690-2-puranjay@kernel.org/
>
> > Anyways, great work, a small nit below.
> >
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
> Thanks,
> Puranjay

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v2 1/2] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-05-01 16:39   ` Andrii Nakryiko
@ 2024-05-02 16:18     ` Björn Töpel
  2024-05-02 16:20       ` Puranjay Mohan
  0 siblings, 1 reply; 11+ messages in thread
From: Björn Töpel @ 2024-05-02 16:18 UTC (permalink / raw
  To: Andrii Nakryiko, Puranjay Mohan
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, bpf, linux-riscv,
	linux-kernel, Pu Lehui, puranjay12

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Apr 30, 2024 at 10:58 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Support an instruction for resolving absolute addresses of per-CPU
>> data from their per-CPU offsets. This instruction is internal-only and
>> users are not allowed to use them directly. They will only be used for
>> internal inlining optimizations for now between BPF verifier and BPF
>> JITs.
>>
>> RISC-V uses generic per-cpu implementation where the offsets for CPUs
>> are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
>> the address of the task_struct in TP register. The first element in
>> task_struct is struct thread_info, and we can get the cpu number by
>> reading from the TP register + offsetof(struct thread_info, cpu).
>>
>> Once we have the cpu number in a register we read the offset for that
>> cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
>> offset to the destination register.
>>
>> To measure the improvement from this change, the benchmark in [1] was
>> used on Qemu:
>>
>> Before:
>> glob-arr-inc   :    1.127 ± 0.013M/s
>> arr-inc        :    1.121 ± 0.004M/s
>> hash-inc       :    0.681 ± 0.052M/s
>>
>> After:
>> glob-arr-inc   :    1.138 ± 0.011M/s
>> arr-inc        :    1.366 ± 0.006M/s
>> hash-inc       :    0.676 ± 0.001M/s
>>
>> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>>
>> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>> ---
>>  arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
>>  1 file changed, 24 insertions(+)
>>
>> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
>> index 15e482f2c657..99d7006f1420 100644
>> --- a/arch/riscv/net/bpf_jit_comp64.c
>> +++ b/arch/riscv/net/bpf_jit_comp64.c
>> @@ -12,6 +12,7 @@
>>  #include <linux/stop_machine.h>
>>  #include <asm/patch.h>
>>  #include <asm/cfi.h>
>> +#include <asm/percpu.h>
>>  #include "bpf_jit.h"
>>
>>  #define RV_FENTRY_NINSNS 2
>> @@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>>                         emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
>>                         emit_mv(rd, RV_REG_T1, ctx);
>>                         break;
>> +               } else if (insn_is_mov_percpu_addr(insn)) {
>> +                       if (rd != rs)
>> +                               emit_mv(rd, rs, ctx);

No biggie, but you did not fold this check into emit_mv().

>> +#ifdef CONFIG_SMP
>> +                               /* Load current CPU number in T1 */
>> +                               emit_ld(RV_REG_T1, offsetof(struct thread_info, cpu),
>> +                                       RV_REG_TP, ctx);
>> +                               /* << 3 because offsets are 8 bytes */
>> +                               emit_slli(RV_REG_T1, RV_REG_T1, 3, ctx);
>> +                               /* Load address of __per_cpu_offset array in T2 */
>> +                               emit_addr(RV_REG_T2, (u64)&__per_cpu_offset, extra_pass, ctx);
>> +                               /* Add offset of current CPU to  __per_cpu_offset */
>> +                               emit_add(RV_REG_T1, RV_REG_T2, RV_REG_T1, ctx);
>> +                               /* Load __per_cpu_offset[cpu] in T1 */
>> +                               emit_ld(RV_REG_T1, 0, RV_REG_T1, ctx);
>> +                               /* Add the offset to Rd */
>> +                               emit_add(rd, rd, RV_REG_T1, ctx);
>
> is this the right level of code indentation?

Looks wrong.

When the indent is fixed, feel free to add:

Acked-by: Björn Töpel <bjorn@kernel.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id()
  2024-04-30 17:58 ` [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id() Puranjay Mohan
  2024-04-30 19:18   ` Kumar Kartikeya Dwivedi
  2024-05-01 16:46   ` Andrii Nakryiko
@ 2024-05-02 16:19   ` Björn Töpel
  2 siblings, 0 replies; 11+ messages in thread
From: Björn Töpel @ 2024-05-02 16:19 UTC (permalink / raw
  To: Puranjay Mohan, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou, bpf,
	linux-riscv, linux-kernel, Pu Lehui
  Cc: puranjay12

Puranjay Mohan <puranjay@kernel.org> writes:

> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
>
> RISCV saves the pointer to the CPU's task_struct in the TP (thread
> pointer) register. This makes it trivial to get the CPU's processor id.
> As thread_info is the first member of task_struct, we can read the
> processor id from TP + offsetof(struct thread_info, cpu).
>
>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
> 	  ======================================================
>
>                 Before                           After
>                --------                         -------
>
>          auipc   t1,0x848c                  ld    a5,32(tp)
>          jalr    604(t1)
>          mv      a5,a0
>
> Benchmark using [1] on Qemu.
>
> ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
>
> +---------------+------------------+------------------+--------------+
> |      Name     |     Before       |       After      |   % change   |
> |---------------+------------------+------------------+--------------|
> | glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
> | arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
> | hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
> +---------------+------------------+------------------+--------------+
>
> NOTE: This benchmark includes changes from this patch and the previous
>       patch that implemented the per-cpu insn.
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Acked-by: Björn Töpel <bjorn@kernel.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v2 1/2] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-05-02 16:18     ` Björn Töpel
@ 2024-05-02 16:20       ` Puranjay Mohan
  0 siblings, 0 replies; 11+ messages in thread
From: Puranjay Mohan @ 2024-05-02 16:20 UTC (permalink / raw
  To: Björn Töpel
  Cc: Andrii Nakryiko, Puranjay Mohan, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, bpf, linux-riscv, linux-kernel,
	Pu Lehui

On Thu, May 2, 2024 at 6:18 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Tue, Apr 30, 2024 at 10:58 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >> Support an instruction for resolving absolute addresses of per-CPU
> >> data from their per-CPU offsets. This instruction is internal-only and
> >> users are not allowed to use them directly. They will only be used for
> >> internal inlining optimizations for now between BPF verifier and BPF
> >> JITs.
> >>
> >> RISC-V uses generic per-cpu implementation where the offsets for CPUs
> >> are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
> >> the address of the task_struct in TP register. The first element in
> >> task_struct is struct thread_info, and we can get the cpu number by
> >> reading from the TP register + offsetof(struct thread_info, cpu).
> >>
> >> Once we have the cpu number in a register we read the offset for that
> >> cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
> >> offset to the destination register.
> >>
> >> To measure the improvement from this change, the benchmark in [1] was
> >> used on Qemu:
> >>
> >> Before:
> >> glob-arr-inc   :    1.127 ± 0.013M/s
> >> arr-inc        :    1.121 ± 0.004M/s
> >> hash-inc       :    0.681 ± 0.052M/s
> >>
> >> After:
> >> glob-arr-inc   :    1.138 ± 0.011M/s
> >> arr-inc        :    1.366 ± 0.006M/s
> >> hash-inc       :    0.676 ± 0.001M/s
> >>
> >> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
> >>
> >> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> >> ---
> >>  arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
> >>  1 file changed, 24 insertions(+)
> >>
> >> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> >> index 15e482f2c657..99d7006f1420 100644
> >> --- a/arch/riscv/net/bpf_jit_comp64.c
> >> +++ b/arch/riscv/net/bpf_jit_comp64.c
> >> @@ -12,6 +12,7 @@
> >>  #include <linux/stop_machine.h>
> >>  #include <asm/patch.h>
> >>  #include <asm/cfi.h>
> >> +#include <asm/percpu.h>
> >>  #include "bpf_jit.h"
> >>
> >>  #define RV_FENTRY_NINSNS 2
> >> @@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
> >>                         emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
> >>                         emit_mv(rd, RV_REG_T1, ctx);
> >>                         break;
> >> +               } else if (insn_is_mov_percpu_addr(insn)) {
> >> +                       if (rd != rs)
> >> +                               emit_mv(rd, rs, ctx);
>
> No biggie, but you did not fold this check into emit_mv().
>
> >> +#ifdef CONFIG_SMP
> >> +                               /* Load current CPU number in T1 */
> >> +                               emit_ld(RV_REG_T1, offsetof(struct thread_info, cpu),
> >> +                                       RV_REG_TP, ctx);
> >> +                               /* << 3 because offsets are 8 bytes */
> >> +                               emit_slli(RV_REG_T1, RV_REG_T1, 3, ctx);
> >> +                               /* Load address of __per_cpu_offset array in T2 */
> >> +                               emit_addr(RV_REG_T2, (u64)&__per_cpu_offset, extra_pass, ctx);
> >> +                               /* Add offset of current CPU to  __per_cpu_offset */
> >> +                               emit_add(RV_REG_T1, RV_REG_T2, RV_REG_T1, ctx);
> >> +                               /* Load __per_cpu_offset[cpu] in T1 */
> >> +                               emit_ld(RV_REG_T1, 0, RV_REG_T1, ctx);
> >> +                               /* Add the offset to Rd */
> >> +                               emit_add(rd, rd, RV_REG_T1, ctx);
> >
> > is this the right level of code indentation?
>
> Looks wrong.
>
> When the indent is fixed, feel free to add:
>
> Acked-by: Björn Töpel <bjorn@kernel.org>

I fixed the indent and sent it as part of:
https://lore.kernel.org/all/20240502151854.9810-1-puranjay@kernel.org/

Also, for the emit_mv() thing, I wanted to verify if we are using that
somewhere for zero-extension or something.
So, I thought I would send a separate patch for it.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-05-02 16:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-30 17:58 [PATCH bpf-next v2 0/2] riscv, bpf: Support per-CPU insn and inline bpf_get_smp_processor_id() Puranjay Mohan
2024-04-30 17:58 ` [PATCH bpf-next v2 1/2] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
2024-05-01 16:39   ` Andrii Nakryiko
2024-05-02 16:18     ` Björn Töpel
2024-05-02 16:20       ` Puranjay Mohan
2024-04-30 17:58 ` [PATCH bpf-next v2 2/2] riscv, bpf: inline bpf_get_smp_processor_id() Puranjay Mohan
2024-04-30 19:18   ` Kumar Kartikeya Dwivedi
2024-05-01 16:46   ` Andrii Nakryiko
2024-05-02 13:16     ` Puranjay Mohan
2024-05-02 16:03       ` Andrii Nakryiko
2024-05-02 16:19   ` Björn Töpel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).