LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Optimize the return_instance management of uretprobe
@ 2024-07-09  0:51 Liao Chang
  2024-07-09  0:51 ` [PATCH 1/2] uprobes: Optimize the return_instance related routines Liao Chang
  2024-07-09  0:51 ` [PATCH 2/2] selftests/bpf: Add uretprobe test for return_instance management Liao Chang
  0 siblings, 2 replies; 7+ messages in thread
From: Liao Chang @ 2024-07-09  0:51 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, ast, daniel, andrii,
	martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, mykolal, shuah, liaochang1
  Cc: linux-kernel, linux-perf-users, bpf, linux-kselftest

While exploring uretprobe syscall and trampoline for ARM64, we observed
a slight performance gain for Redis benchmark using uretprobe syscall.
This patchset aims to further improve the performance of uretprobe by
optimizing the management of struct return_instance data.

In details, uretprobe utilizes dynamically allocated memory for struct
return_instance data. These data track the call chain of instrumented
functions. This approach is not efficient, especially considering the
inherent locality of function invocation.

This patchset proposes a rework of the return_instances management. It
replaces dynamic memory allocation with a statically allocated array.
This approach leverages the stack-style usage of return_instance and
remove the need for kamlloc/kfree operations.

This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64
cores @ 2.4GHz. Redis benchmarks show a throughput gain by 2% for Redis
GET and SET commands:

------------------------------------------------------------------
Test case       | No uretprobes | uretprobes     | uretprobes
                |               | (current)      | (optimized)
==================================================================
Redis SET (RPS) | 47025         | 40619 (-13.6%) | 41529 (-11.6%)
------------------------------------------------------------------
Redis GET (RPS) | 46715         | 41426 (-11.3%) | 42306 (-9.4%)
------------------------------------------------------------------

Liao Chang (2):
  uprobes: Optimize the return_instance related routines
  selftests/bpf: Add uretprobe test for return_instance management

 include/linux/uprobes.h                       |  10 +-
 kernel/events/uprobes.c                       | 162 +++++++++++-------
 .../bpf/prog_tests/uretprobe_depth.c          | 150 ++++++++++++++++
 .../selftests/bpf/progs/uretprobe_depth.c     |  19 ++
 4 files changed, 274 insertions(+), 67 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_depth.c
 create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_depth.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] uprobes: Optimize the return_instance related routines
  2024-07-09  0:51 [PATCH 0/2] Optimize the return_instance management of uretprobe Liao Chang
@ 2024-07-09  0:51 ` Liao Chang
  2024-07-09 23:55   ` Andrii Nakryiko
  2024-07-09  0:51 ` [PATCH 2/2] selftests/bpf: Add uretprobe test for return_instance management Liao Chang
  1 sibling, 1 reply; 7+ messages in thread
From: Liao Chang @ 2024-07-09  0:51 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, ast, daniel, andrii,
	martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, mykolal, shuah, liaochang1
  Cc: linux-kernel, linux-perf-users, bpf, linux-kselftest

Reduce the runtime overhead for struct return_instance data managed by
uretprobe. This patch replaces the dynamic allocation with statically
allocated array, leverage two facts that are limited nesting depth of
uretprobe (max 64) and the function call style of return_instance usage
(create at entry, free at exit).

This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64
cores @ 2.4GHz. Redis benchmarks show a throughput gain by 2% for Redis
GET and SET commands:

------------------------------------------------------------------
Test case       | No uretprobes | uretprobes     | uretprobes
                |               | (current)      | (optimized)
==================================================================
Redis SET (RPS) | 47025         | 40619 (-13.6%) | 41529 (-11.6%)
------------------------------------------------------------------
Redis GET (RPS) | 46715         | 41426 (-11.3%) | 42306 (-9.4%)
------------------------------------------------------------------

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 include/linux/uprobes.h |  10 ++-
 kernel/events/uprobes.c | 162 ++++++++++++++++++++++++----------------
 2 files changed, 105 insertions(+), 67 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..ec50ff010b1d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -55,6 +55,12 @@ enum uprobe_task_state {
 	UTASK_SSTEP_TRAPPED,
 };
 
+struct return_frame {
+	/* the frames of return instances */
+	struct return_instance	*return_instance;
+	struct return_instance	*vaddr;
+};
+
 /*
  * uprobe_task: Metadata of a task while it singlesteps.
  */
@@ -76,7 +82,7 @@ struct uprobe_task {
 	struct uprobe			*active_uprobe;
 	unsigned long			xol_vaddr;
 
-	struct return_instance		*return_instances;
+	struct return_frame		frame;
 	unsigned int			depth;
 };
 
@@ -86,8 +92,6 @@ struct return_instance {
 	unsigned long		stack;		/* stack pointer */
 	unsigned long		orig_ret_vaddr; /* original return address */
 	bool			chained;	/* true, if instance is nested */
-
-	struct return_instance	*next;		/* keep as stack */
 };
 
 enum rp_check {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2c83ba776fc7..81c56fd2811c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1697,12 +1697,89 @@ unsigned long uprobe_get_trap_addr(struct pt_regs *regs)
 	return instruction_pointer(regs);
 }
 
-static struct return_instance *free_ret_instance(struct return_instance *ri)
+static inline
+struct return_instance *next_ret_instance(struct return_frame *frame,
+					  struct return_instance *ri)
+{
+	return ri == frame->vaddr ? NULL : ri - 1;
+}
+
+static inline
+struct return_instance *curr_ret_instance(struct uprobe_task *task)
+{
+	return task->frame.return_instance;
+}
+
+static struct return_instance *find_next_ret_chain(struct uprobe_task *utask,
+						   struct return_instance *ri)
+{
+	bool chained;
+
+	do {
+		chained = ri->chained;
+		ri = next_ret_instance(&utask->frame, ri);
+	} while (chained);
+
+	return ri;
+}
+
+static inline
+struct return_instance *free_ret_instance(struct uprobe_task *utask,
+					  struct return_instance *ri)
 {
-	struct return_instance *next = ri->next;
 	put_uprobe(ri->uprobe);
-	kfree(ri);
-	return next;
+	return next_ret_instance(&utask->frame, ri);
+}
+
+static void free_return_instances(struct uprobe_task *task)
+{
+	struct return_frame *frame = &task->frame;
+	struct return_instance *ri = frame->return_instance;
+
+	while (ri) {
+		put_uprobe(ri->uprobe);
+		ri = next_ret_instance(frame, ri);
+	}
+
+	kfree(frame->vaddr);
+}
+
+static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
+				     struct pt_regs *regs)
+{
+	struct return_frame *frame = &utask->frame;
+	struct return_instance *ri = frame->return_instance;
+	enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
+
+	while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
+		ri = next_ret_instance(frame, ri);
+		utask->depth--;
+	}
+	frame->return_instance = ri;
+}
+
+static struct return_instance *alloc_return_instance(struct uprobe_task *task)
+{
+	struct return_frame *frame = &task->frame;
+
+	if (!frame->vaddr) {
+		frame->vaddr = kcalloc(MAX_URETPROBE_DEPTH,
+				sizeof(struct return_instance), GFP_KERNEL);
+		if (!frame->vaddr)
+			return NULL;
+	}
+
+	if (!frame->return_instance) {
+		frame->return_instance = frame->vaddr;
+		return frame->return_instance;
+	}
+
+	return ++frame->return_instance;
+}
+
+static inline bool return_frame_empty(struct uprobe_task *task)
+{
+	return !task->frame.return_instance;
 }
 
 /*
@@ -1712,7 +1789,6 @@ static struct return_instance *free_ret_instance(struct return_instance *ri)
 void uprobe_free_utask(struct task_struct *t)
 {
 	struct uprobe_task *utask = t->utask;
-	struct return_instance *ri;
 
 	if (!utask)
 		return;
@@ -1720,10 +1796,7 @@ void uprobe_free_utask(struct task_struct *t)
 	if (utask->active_uprobe)
 		put_uprobe(utask->active_uprobe);
 
-	ri = utask->return_instances;
-	while (ri)
-		ri = free_ret_instance(ri);
-
+	free_return_instances(utask);
 	xol_free_insn_slot(t);
 	kfree(utask);
 	t->utask = NULL;
@@ -1747,26 +1820,20 @@ static struct uprobe_task *get_utask(void)
 static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 {
 	struct uprobe_task *n_utask;
-	struct return_instance **p, *o, *n;
+	struct return_instance *o, *n;
 
 	n_utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
 	if (!n_utask)
 		return -ENOMEM;
 	t->utask = n_utask;
 
-	p = &n_utask->return_instances;
-	for (o = o_utask->return_instances; o; o = o->next) {
-		n = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
-		if (!n)
-			return -ENOMEM;
-
+	o = curr_ret_instance(o_utask);
+	while (o) {
+		n = alloc_return_instance(n_utask);
+		n_utask->depth++;
 		*n = *o;
 		get_uprobe(n->uprobe);
-		n->next = NULL;
-
-		*p = n;
-		p = &n->next;
-		n_utask->depth++;
+		o = next_ret_instance(&o_utask->frame, o);
 	}
 
 	return 0;
@@ -1799,7 +1866,7 @@ void uprobe_copy_process(struct task_struct *t, unsigned long flags)
 
 	t->utask = NULL;
 
-	if (!utask || !utask->return_instances)
+	if (!utask || return_frame_empty(utask))
 		return;
 
 	if (mm == t->mm && !(flags & CLONE_VFORK))
@@ -1840,19 +1907,6 @@ static unsigned long get_trampoline_vaddr(void)
 	return trampoline_vaddr;
 }
 
-static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
-					struct pt_regs *regs)
-{
-	struct return_instance *ri = utask->return_instances;
-	enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
-
-	while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
-		ri = free_ret_instance(ri);
-		utask->depth--;
-	}
-	utask->return_instances = ri;
-}
-
 static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 {
 	struct return_instance *ri;
@@ -1874,10 +1928,6 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		return;
 	}
 
-	ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
-	if (!ri)
-		return;
-
 	trampoline_vaddr = get_trampoline_vaddr();
 	orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
 	if (orig_ret_vaddr == -1)
@@ -1893,7 +1943,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 	 * instances. This also makes breakpoint unwrapping easier.
 	 */
 	if (chained) {
-		if (!utask->return_instances) {
+		if (return_frame_empty(utask)) {
 			/*
 			 * This situation is not possible. Likely we have an
 			 * attack from user-space.
@@ -1901,22 +1951,19 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 			uprobe_warn(current, "handle tail call");
 			goto fail;
 		}
-		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
+		orig_ret_vaddr = curr_ret_instance(utask)->orig_ret_vaddr;
 	}
 
+	ri = alloc_return_instance(utask);
 	ri->uprobe = get_uprobe(uprobe);
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
 	ri->chained = chained;
-
 	utask->depth++;
-	ri->next = utask->return_instances;
-	utask->return_instances = ri;
 
-	return;
  fail:
-	kfree(ri);
+	return;
 }
 
 /* Prepare to single-step probed instruction out of line. */
@@ -2111,18 +2158,6 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 	up_read(&uprobe->register_rwsem);
 }
 
-static struct return_instance *find_next_ret_chain(struct return_instance *ri)
-{
-	bool chained;
-
-	do {
-		chained = ri->chained;
-		ri = ri->next;	/* can't be NULL if chained */
-	} while (chained);
-
-	return ri;
-}
-
 static void handle_trampoline(struct pt_regs *regs)
 {
 	struct uprobe_task *utask;
@@ -2133,7 +2168,7 @@ static void handle_trampoline(struct pt_regs *regs)
 	if (!utask)
 		goto sigill;
 
-	ri = utask->return_instances;
+	ri = curr_ret_instance(utask);
 	if (!ri)
 		goto sigill;
 
@@ -2144,25 +2179,24 @@ static void handle_trampoline(struct pt_regs *regs)
 		 * or NULL; the latter case means that nobody but ri->func
 		 * could hit this trampoline on return. TODO: sigaltstack().
 		 */
-		next = find_next_ret_chain(ri);
+		next = find_next_ret_chain(utask, ri);
 		valid = !next || arch_uretprobe_is_alive(next, RP_CHECK_RET, regs);
 
 		instruction_pointer_set(regs, ri->orig_ret_vaddr);
 		do {
 			if (valid)
 				handle_uretprobe_chain(ri, regs);
-			ri = free_ret_instance(ri);
+			ri = free_ret_instance(utask, ri);
 			utask->depth--;
 		} while (ri != next);
 	} while (!valid);
 
-	utask->return_instances = ri;
+	utask->frame.return_instance = ri;
 	return;
 
  sigill:
 	uprobe_warn(current, "handle uretprobe, sending SIGILL.");
 	force_sig(SIGILL);
-
 }
 
 bool __weak arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs)
@@ -2315,7 +2349,7 @@ int uprobe_pre_sstep_notifier(struct pt_regs *regs)
 		return 0;
 
 	if (!test_bit(MMF_HAS_UPROBES, &current->mm->flags) &&
-	    (!current->utask || !current->utask->return_instances))
+	    (!current->utask || return_frame_empty(current->utask)))
 		return 0;
 
 	set_thread_flag(TIF_UPROBE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] selftests/bpf: Add uretprobe test for return_instance management
  2024-07-09  0:51 [PATCH 0/2] Optimize the return_instance management of uretprobe Liao Chang
  2024-07-09  0:51 ` [PATCH 1/2] uprobes: Optimize the return_instance related routines Liao Chang
@ 2024-07-09  0:51 ` Liao Chang
  1 sibling, 0 replies; 7+ messages in thread
From: Liao Chang @ 2024-07-09  0:51 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, ast, daniel, andrii,
	martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, mykolal, shuah, liaochang1
  Cc: linux-kernel, linux-perf-users, bpf, linux-kselftest

This patch add three testcases to verify the proper management of
return_instance data by uretprobes:

- uretprobe_longjmp() verifies that longjmp() bypasses the uretprobe BPF
  program attached to the exit of instrumented function.

- uretprobe_cleanup_return_instance() verifies that uretprobe reclaim
  the return_instances data created before a longjmp(), which leads to
  kernel recount the nested depth of instrumented function.

- uretprobe_reach_nestedness_limit() confirms that the uretprobe doesn't
  reclaim the return_instances data created before longjmp() and hijack
  the return address of instrumented function upon the nestedness depth
  reache the limits (64).

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 .../bpf/prog_tests/uretprobe_depth.c          | 150 ++++++++++++++++++
 .../selftests/bpf/progs/uretprobe_depth.c     |  19 +++
 2 files changed, 169 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uretprobe_depth.c
 create mode 100644 tools/testing/selftests/bpf/progs/uretprobe_depth.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uretprobe_depth.c b/tools/testing/selftests/bpf/prog_tests/uretprobe_depth.c
new file mode 100644
index 000000000000..ba03b1868e37
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/uretprobe_depth.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+
+#include <unistd.h>
+#include <setjmp.h>
+#include <asm/ptrace.h>
+#include <linux/compiler.h>
+#include <linux/stringify.h>
+#include "uretprobe_depth.skel.h"
+
+#define RETVAL			0xFFFF
+#define JMPVAL			0x4B1D
+#define MAX_URETPROBE_DEPTH	64 // See include/linux/uprobes.h
+#define NR_OMITTED_URETPROBE	16
+static jmp_buf jmp;
+
+unsigned long __uretprobe_longjmp(int nest, int jmpval, int retval)
+{
+	if (nest) {
+		nest = ++retval < MAX_URETPROBE_DEPTH + NR_OMITTED_URETPROBE;
+		return __uretprobe_longjmp(nest, jmpval, retval);
+	}
+
+	if (jmpval == JMPVAL) {
+		longjmp(jmp, jmpval);
+		return 0;
+	} else
+		return retval;
+}
+
+static void uretprobe_longjmp(void)
+{
+	if (setjmp(jmp) == JMPVAL) {
+		__uretprobe_longjmp(0, 0, JMPVAL);
+		return;
+	}
+
+	__uretprobe_longjmp(0, JMPVAL, RETVAL);
+}
+
+static void uretprobe_cleanup_return_instances(void)
+{
+	if (setjmp(jmp) == JMPVAL) {
+		/*
+		 * Cleanup these return instance data created before longjmp
+		 * firstly. Then create 16 new return_instance data from here.
+		 */
+		__uretprobe_longjmp(1, 0, MAX_URETPROBE_DEPTH);
+		return;
+	}
+
+	/* Create 8 return_instance data from here. */
+	__uretprobe_longjmp(1, JMPVAL,
+			    MAX_URETPROBE_DEPTH + NR_OMITTED_URETPROBE / 2);
+}
+
+static void uretprobe_reach_nestedness_limit(void)
+{
+	if (setjmp(jmp) == JMPVAL) {
+		/*
+		 * Due to uretprobe reach to the nestedness limit, it doesn't
+		 * cleanup the return instance created before longjmp.
+		 */
+		__uretprobe_longjmp(1, 0, MAX_URETPROBE_DEPTH);
+		return;
+	}
+
+	/* Create 64 return_instance from here. */
+	__uretprobe_longjmp(1, JMPVAL, 0);
+}
+
+static void test_uretprobe_longjmp(void)
+{
+	struct uretprobe_depth *skel = NULL;
+	int err;
+
+	skel = uretprobe_depth__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uretprobe_depth__open_and_load"))
+		goto cleanup;
+
+	err = uretprobe_depth__attach(skel);
+	if (!ASSERT_OK(err, "uretprobe_depth__attach"))
+		goto cleanup;
+
+	skel->bss->retval = -1;
+
+	uretprobe_longjmp();
+
+	ASSERT_EQ(skel->bss->retval, JMPVAL, "return value");
+
+cleanup:
+	uretprobe_depth__destroy(skel);
+}
+
+static void test_uretprobe_reach_nestedness_limit(void)
+{
+	struct uretprobe_depth *skel = NULL;
+	int err;
+
+	skel = uretprobe_depth__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uretprobe_depth__open_and_load"))
+		goto cleanup;
+
+	err = uretprobe_depth__attach(skel);
+	if (!ASSERT_OK(err, "uretprobe_depth__attach"))
+		goto cleanup;
+
+	skel->bss->depth = 0;
+
+	uretprobe_reach_nestedness_limit();
+
+	ASSERT_EQ(skel->bss->depth, 0, "nest depth");
+
+cleanup:
+	uretprobe_depth__destroy(skel);
+}
+
+static void test_uretprobe_cleanup_return_instances(void)
+{
+	struct uretprobe_depth *skel = NULL;
+	int err;
+
+	skel = uretprobe_depth__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uretprobe_depth__open_and_load"))
+		goto cleanup;
+
+	err = uretprobe_depth__attach(skel);
+	if (!ASSERT_OK(err, "uretprobe_depth__attach"))
+		goto cleanup;
+
+	skel->bss->depth = 0;
+
+	uretprobe_cleanup_return_instances();
+
+	ASSERT_EQ(skel->bss->depth, NR_OMITTED_URETPROBE + 1, "nest depth");
+
+cleanup:
+	uretprobe_depth__destroy(skel);
+}
+
+void test_uretprobe_return_instance(void)
+{
+	if (test__start_subtest("uretprobe_longjmp"))
+		test_uretprobe_longjmp();
+	if (test__start_subtest("uretprobe_cleanup_return_instances"))
+		test_uretprobe_cleanup_return_instances();
+	if (test__start_subtest("uretprobe_reach_nestedness_limit"))
+		test_uretprobe_reach_nestedness_limit();
+}
diff --git a/tools/testing/selftests/bpf/progs/uretprobe_depth.c b/tools/testing/selftests/bpf/progs/uretprobe_depth.c
new file mode 100644
index 000000000000..b71f2de52b5e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uretprobe_depth.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <string.h>
+
+int depth;
+unsigned long retval;
+
+char _license[] SEC("license") = "GPL";
+
+SEC("uretprobe//proc/self/exe:__uretprobe_longjmp")
+int uretprobe(struct pt_regs *ctx)
+{
+	depth++;
+#if defined(__TARGET_ARCH_arm64) || defined(__aarch64__)
+	retval = ctx->regs[0];
+#endif
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] uprobes: Optimize the return_instance related routines
  2024-07-09  0:51 ` [PATCH 1/2] uprobes: Optimize the return_instance related routines Liao Chang
@ 2024-07-09 23:55   ` Andrii Nakryiko
  2024-07-10  8:19     ` Liao, Chang
  0 siblings, 1 reply; 7+ messages in thread
From: Andrii Nakryiko @ 2024-07-09 23:55 UTC (permalink / raw)
  To: Liao Chang
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, ast, daniel, andrii,
	martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, mykolal, shuah, linux-kernel, linux-perf-users, bpf,
	linux-kselftest

On Mon, Jul 8, 2024 at 6:00 PM Liao Chang <liaochang1@huawei.com> wrote:
>
> Reduce the runtime overhead for struct return_instance data managed by
> uretprobe. This patch replaces the dynamic allocation with statically
> allocated array, leverage two facts that are limited nesting depth of
> uretprobe (max 64) and the function call style of return_instance usage
> (create at entry, free at exit).
>
> This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64
> cores @ 2.4GHz. Redis benchmarks show a throughput gain by 2% for Redis
> GET and SET commands:
>
> ------------------------------------------------------------------
> Test case       | No uretprobes | uretprobes     | uretprobes
>                 |               | (current)      | (optimized)
> ==================================================================
> Redis SET (RPS) | 47025         | 40619 (-13.6%) | 41529 (-11.6%)
> ------------------------------------------------------------------
> Redis GET (RPS) | 46715         | 41426 (-11.3%) | 42306 (-9.4%)
> ------------------------------------------------------------------
>
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
>  include/linux/uprobes.h |  10 ++-
>  kernel/events/uprobes.c | 162 ++++++++++++++++++++++++----------------
>  2 files changed, 105 insertions(+), 67 deletions(-)
>

[...]

> +static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
> +                                    struct pt_regs *regs)
> +{
> +       struct return_frame *frame = &utask->frame;
> +       struct return_instance *ri = frame->return_instance;
> +       enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
> +
> +       while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
> +               ri = next_ret_instance(frame, ri);
> +               utask->depth--;
> +       }
> +       frame->return_instance = ri;
> +}
> +
> +static struct return_instance *alloc_return_instance(struct uprobe_task *task)
> +{
> +       struct return_frame *frame = &task->frame;
> +
> +       if (!frame->vaddr) {
> +               frame->vaddr = kcalloc(MAX_URETPROBE_DEPTH,
> +                               sizeof(struct return_instance), GFP_KERNEL);

Are you just pre-allocating MAX_URETPROBE_DEPTH instances always?
I.e., even if we need just one (because there is no recursion), you'd
still waste memory for all 64 ones?

That seems rather wasteful.

Have you considered using objpool for fast reuse across multiple CPUs?
Check lib/objpool.c.

> +               if (!frame->vaddr)
> +                       return NULL;
> +       }
> +
> +       if (!frame->return_instance) {
> +               frame->return_instance = frame->vaddr;
> +               return frame->return_instance;
> +       }
> +
> +       return ++frame->return_instance;
> +}
> +
> +static inline bool return_frame_empty(struct uprobe_task *task)
> +{
> +       return !task->frame.return_instance;
>  }
>
>  /*

[...]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] uprobes: Optimize the return_instance related routines
  2024-07-09 23:55   ` Andrii Nakryiko
@ 2024-07-10  8:19     ` Liao, Chang
  2024-07-10 21:21       ` Andrii Nakryiko
  0 siblings, 1 reply; 7+ messages in thread
From: Liao, Chang @ 2024-07-10  8:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, ast, daniel, andrii,
	martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, mykolal, shuah, linux-kernel, linux-perf-users, bpf,
	linux-kselftest



在 2024/7/10 7:55, Andrii Nakryiko 写道:
> On Mon, Jul 8, 2024 at 6:00 PM Liao Chang <liaochang1@huawei.com> wrote:
>>
>> Reduce the runtime overhead for struct return_instance data managed by
>> uretprobe. This patch replaces the dynamic allocation with statically
>> allocated array, leverage two facts that are limited nesting depth of
>> uretprobe (max 64) and the function call style of return_instance usage
>> (create at entry, free at exit).
>>
>> This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64
>> cores @ 2.4GHz. Redis benchmarks show a throughput gain by 2% for Redis
>> GET and SET commands:
>>
>> ------------------------------------------------------------------
>> Test case       | No uretprobes | uretprobes     | uretprobes
>>                 |               | (current)      | (optimized)
>> ==================================================================
>> Redis SET (RPS) | 47025         | 40619 (-13.6%) | 41529 (-11.6%)
>> ------------------------------------------------------------------
>> Redis GET (RPS) | 46715         | 41426 (-11.3%) | 42306 (-9.4%)
>> ------------------------------------------------------------------
>>
>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>> ---
>>  include/linux/uprobes.h |  10 ++-
>>  kernel/events/uprobes.c | 162 ++++++++++++++++++++++++----------------
>>  2 files changed, 105 insertions(+), 67 deletions(-)
>>
> 
> [...]
> 
>> +static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
>> +                                    struct pt_regs *regs)
>> +{
>> +       struct return_frame *frame = &utask->frame;
>> +       struct return_instance *ri = frame->return_instance;
>> +       enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
>> +
>> +       while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
>> +               ri = next_ret_instance(frame, ri);
>> +               utask->depth--;
>> +       }
>> +       frame->return_instance = ri;
>> +}
>> +
>> +static struct return_instance *alloc_return_instance(struct uprobe_task *task)
>> +{
>> +       struct return_frame *frame = &task->frame;
>> +
>> +       if (!frame->vaddr) {
>> +               frame->vaddr = kcalloc(MAX_URETPROBE_DEPTH,
>> +                               sizeof(struct return_instance), GFP_KERNEL);
> 
> Are you just pre-allocating MAX_URETPROBE_DEPTH instances always?
> I.e., even if we need just one (because there is no recursion), you'd
> still waste memory for all 64 ones?

This is the truth. On my testing machines, each struct return_instance data
is 28 bytes, resulting in a total pre-allocated 1792 bytes when the first
instrumented function is hit.

> 
> That seems rather wasteful.
> 
> Have you considered using objpool for fast reuse across multiple CPUs?
> Check lib/objpool.c.

After studying how kretprobe uses objpool, I'm convinced it is a right solution for
managing return_instance in uretporbe. While I need some time to fully understand
the objpool code itself and run some benchmark to verify its performance.

Thanks for the suggestion.

> 
>> +               if (!frame->vaddr)
>> +                       return NULL;
>> +       }
>> +
>> +       if (!frame->return_instance) {
>> +               frame->return_instance = frame->vaddr;
>> +               return frame->return_instance;
>> +       }
>> +
>> +       return ++frame->return_instance;
>> +}
>> +
>> +static inline bool return_frame_empty(struct uprobe_task *task)
>> +{
>> +       return !task->frame.return_instance;
>>  }
>>
>>  /*
> 
> [...]

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] uprobes: Optimize the return_instance related routines
  2024-07-10  8:19     ` Liao, Chang
@ 2024-07-10 21:21       ` Andrii Nakryiko
  2024-07-11  2:05         ` Liao, Chang
  0 siblings, 1 reply; 7+ messages in thread
From: Andrii Nakryiko @ 2024-07-10 21:21 UTC (permalink / raw)
  To: Liao, Chang
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, ast, daniel, andrii,
	martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, mykolal, shuah, linux-kernel, linux-perf-users, bpf,
	linux-kselftest

On Wed, Jul 10, 2024 at 1:19 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/7/10 7:55, Andrii Nakryiko 写道:
> > On Mon, Jul 8, 2024 at 6:00 PM Liao Chang <liaochang1@huawei.com> wrote:
> >>
> >> Reduce the runtime overhead for struct return_instance data managed by
> >> uretprobe. This patch replaces the dynamic allocation with statically
> >> allocated array, leverage two facts that are limited nesting depth of
> >> uretprobe (max 64) and the function call style of return_instance usage
> >> (create at entry, free at exit).
> >>
> >> This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64
> >> cores @ 2.4GHz. Redis benchmarks show a throughput gain by 2% for Redis
> >> GET and SET commands:
> >>
> >> ------------------------------------------------------------------
> >> Test case       | No uretprobes | uretprobes     | uretprobes
> >>                 |               | (current)      | (optimized)
> >> ==================================================================
> >> Redis SET (RPS) | 47025         | 40619 (-13.6%) | 41529 (-11.6%)
> >> ------------------------------------------------------------------
> >> Redis GET (RPS) | 46715         | 41426 (-11.3%) | 42306 (-9.4%)
> >> ------------------------------------------------------------------
> >>
> >> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >> ---
> >>  include/linux/uprobes.h |  10 ++-
> >>  kernel/events/uprobes.c | 162 ++++++++++++++++++++++++----------------
> >>  2 files changed, 105 insertions(+), 67 deletions(-)
> >>
> >
> > [...]
> >
> >> +static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
> >> +                                    struct pt_regs *regs)
> >> +{
> >> +       struct return_frame *frame = &utask->frame;
> >> +       struct return_instance *ri = frame->return_instance;
> >> +       enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
> >> +
> >> +       while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
> >> +               ri = next_ret_instance(frame, ri);
> >> +               utask->depth--;
> >> +       }
> >> +       frame->return_instance = ri;
> >> +}
> >> +
> >> +static struct return_instance *alloc_return_instance(struct uprobe_task *task)
> >> +{
> >> +       struct return_frame *frame = &task->frame;
> >> +
> >> +       if (!frame->vaddr) {
> >> +               frame->vaddr = kcalloc(MAX_URETPROBE_DEPTH,
> >> +                               sizeof(struct return_instance), GFP_KERNEL);
> >
> > Are you just pre-allocating MAX_URETPROBE_DEPTH instances always?
> > I.e., even if we need just one (because there is no recursion), you'd
> > still waste memory for all 64 ones?
>
> This is the truth. On my testing machines, each struct return_instance data
> is 28 bytes, resulting in a total pre-allocated 1792 bytes when the first
> instrumented function is hit.
>
> >
> > That seems rather wasteful.
> >
> > Have you considered using objpool for fast reuse across multiple CPUs?
> > Check lib/objpool.c.
>
> After studying how kretprobe uses objpool, I'm convinced it is a right solution for
> managing return_instance in uretporbe. While I need some time to fully understand
> the objpool code itself and run some benchmark to verify its performance.
>
> Thanks for the suggestion.

Keep in mind that there are two patch sets under development/review,
both of which touch this code. [0] will make return_instance
variable-sized, so think how to accommodate that. And [1] in general
touches a bunch of this code. So I'd let those two settle and land
before optimizing return_instance allocations further.

  [0] https://lore.kernel.org/linux-trace-kernel/20240701164115.723677-1-jolsa@kernel.org/
  [1] https://lore.kernel.org/linux-kernel/20240708091241.544262971@infradead.org/

>
> >
> >> +               if (!frame->vaddr)
> >> +                       return NULL;
> >> +       }
> >> +
> >> +       if (!frame->return_instance) {
> >> +               frame->return_instance = frame->vaddr;
> >> +               return frame->return_instance;
> >> +       }
> >> +
> >> +       return ++frame->return_instance;
> >> +}
> >> +
> >> +static inline bool return_frame_empty(struct uprobe_task *task)
> >> +{
> >> +       return !task->frame.return_instance;
> >>  }
> >>
> >>  /*
> >
> > [...]
>
> --
> BR
> Liao, Chang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] uprobes: Optimize the return_instance related routines
  2024-07-10 21:21       ` Andrii Nakryiko
@ 2024-07-11  2:05         ` Liao, Chang
  0 siblings, 0 replies; 7+ messages in thread
From: Liao, Chang @ 2024-07-11  2:05 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, ast, daniel, andrii,
	martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, mykolal, shuah, linux-kernel, linux-perf-users, bpf,
	linux-kselftest



在 2024/7/11 5:21, Andrii Nakryiko 写道:
> On Wed, Jul 10, 2024 at 1:19 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/7/10 7:55, Andrii Nakryiko 写道:
>>> On Mon, Jul 8, 2024 at 6:00 PM Liao Chang <liaochang1@huawei.com> wrote:
>>>>
>>>> Reduce the runtime overhead for struct return_instance data managed by
>>>> uretprobe. This patch replaces the dynamic allocation with statically
>>>> allocated array, leverage two facts that are limited nesting depth of
>>>> uretprobe (max 64) and the function call style of return_instance usage
>>>> (create at entry, free at exit).
>>>>
>>>> This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64
>>>> cores @ 2.4GHz. Redis benchmarks show a throughput gain by 2% for Redis
>>>> GET and SET commands:
>>>>
>>>> ------------------------------------------------------------------
>>>> Test case       | No uretprobes | uretprobes     | uretprobes
>>>>                 |               | (current)      | (optimized)
>>>> ==================================================================
>>>> Redis SET (RPS) | 47025         | 40619 (-13.6%) | 41529 (-11.6%)
>>>> ------------------------------------------------------------------
>>>> Redis GET (RPS) | 46715         | 41426 (-11.3%) | 42306 (-9.4%)
>>>> ------------------------------------------------------------------
>>>>
>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>> ---
>>>>  include/linux/uprobes.h |  10 ++-
>>>>  kernel/events/uprobes.c | 162 ++++++++++++++++++++++++----------------
>>>>  2 files changed, 105 insertions(+), 67 deletions(-)
>>>>
>>>
>>> [...]
>>>
>>>> +static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
>>>> +                                    struct pt_regs *regs)
>>>> +{
>>>> +       struct return_frame *frame = &utask->frame;
>>>> +       struct return_instance *ri = frame->return_instance;
>>>> +       enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
>>>> +
>>>> +       while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
>>>> +               ri = next_ret_instance(frame, ri);
>>>> +               utask->depth--;
>>>> +       }
>>>> +       frame->return_instance = ri;
>>>> +}
>>>> +
>>>> +static struct return_instance *alloc_return_instance(struct uprobe_task *task)
>>>> +{
>>>> +       struct return_frame *frame = &task->frame;
>>>> +
>>>> +       if (!frame->vaddr) {
>>>> +               frame->vaddr = kcalloc(MAX_URETPROBE_DEPTH,
>>>> +                               sizeof(struct return_instance), GFP_KERNEL);
>>>
>>> Are you just pre-allocating MAX_URETPROBE_DEPTH instances always?
>>> I.e., even if we need just one (because there is no recursion), you'd
>>> still waste memory for all 64 ones?
>>
>> This is the truth. On my testing machines, each struct return_instance data
>> is 28 bytes, resulting in a total pre-allocated 1792 bytes when the first
>> instrumented function is hit.
>>
>>>
>>> That seems rather wasteful.
>>>
>>> Have you considered using objpool for fast reuse across multiple CPUs?
>>> Check lib/objpool.c.
>>
>> After studying how kretprobe uses objpool, I'm convinced it is a right solution for
>> managing return_instance in uretporbe. While I need some time to fully understand
>> the objpool code itself and run some benchmark to verify its performance.
>>
>> Thanks for the suggestion.
> 
> Keep in mind that there are two patch sets under development/review,
> both of which touch this code. [0] will make return_instance
> variable-sized, so think how to accommodate that. And [1] in general
> touches a bunch of this code. So I'd let those two settle and land
> before optimizing return_instance allocations further.
> 
>   [0] https://lore.kernel.org/linux-trace-kernel/20240701164115.723677-1-jolsa@kernel.org/
>   [1] https://lore.kernel.org/linux-kernel/20240708091241.544262971@infradead.org/

Thanks for letting me know. I've made a note to track the progress of these patch sets.

> 
>>
>>>
>>>> +               if (!frame->vaddr)
>>>> +                       return NULL;
>>>> +       }
>>>> +
>>>> +       if (!frame->return_instance) {
>>>> +               frame->return_instance = frame->vaddr;
>>>> +               return frame->return_instance;
>>>> +       }
>>>> +
>>>> +       return ++frame->return_instance;
>>>> +}
>>>> +
>>>> +static inline bool return_frame_empty(struct uprobe_task *task)
>>>> +{
>>>> +       return !task->frame.return_instance;
>>>>  }
>>>>
>>>>  /*
>>>
>>> [...]
>>
>> --
>> BR
>> Liao, Chang

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-07-11  2:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09  0:51 [PATCH 0/2] Optimize the return_instance management of uretprobe Liao Chang
2024-07-09  0:51 ` [PATCH 1/2] uprobes: Optimize the return_instance related routines Liao Chang
2024-07-09 23:55   ` Andrii Nakryiko
2024-07-10  8:19     ` Liao, Chang
2024-07-10 21:21       ` Andrii Nakryiko
2024-07-11  2:05         ` Liao, Chang
2024-07-09  0:51 ` [PATCH 2/2] selftests/bpf: Add uretprobe test for return_instance management Liao Chang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).