LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Dump off-cpu samples directly
@ 2024-04-24  2:48 Howard Chu
  2024-04-24  2:48 ` [PATCH v2 1/4] perf record off-cpu: Parse off-cpu event, change config location Howard Chu
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Howard Chu @ 2024-04-24  2:48 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, zegao2021, leo.yan,
	ravi.bangoria, linux-perf-users, linux-kernel, bpf

As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323

Currently, off-cpu samples are dumped when perf record is exiting. This
results in off-cpu samples being after the regular samples. Also, samples
are stored in large BPF maps which contain all the stack traces and
accumulated off-cpu time, but they are eventually going to fill up after
running for an extensive period. This patch fixes those problems by dumping
samples directly into perf ring buffer, and dispatching those samples to the
correct format.

Before, off-cpu samples are after regular samples

```
         swapper       0 [000] 963432.136150:    2812933    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
         swapper       0 [000] 963432.637911:    4932876    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
         swapper       0 [001] 963432.798072:    6273398    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
         swapper       0 [000] 963433.541152:    5279005    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
sh 1410180 [000] 18446744069.414584:    2528851 offcpu-time: 
	    7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)


sh 1410185 [000] 18446744069.414584:    2314223 offcpu-time: 
	    7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)


awk 1409644 [000] 18446744069.414584:     191785 offcpu-time: 
	    702609d03681 read+0x11 (/usr/lib/libc.so.6)
	          4a02a4 [unknown] ([unknown])
```


After, regular samples(cycles:P) and off-cpu(offcpu-time) samples are
collected simultaneously:

```
upowerd     741 [000] 963757.428701:     297848 offcpu-time: 
	    72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)


      irq/9-acpi      56 [000] 963757.429116:    8760875    cycles:P:  ffffffffb779849f acpi_os_read_port+0x2f ([kernel.kallsyms])
upowerd     741 [000] 963757.429172:     459522 offcpu-time: 
	    72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)


         swapper       0 [002] 963757.434529:    5759904    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
perf 1419260 [000] 963757.434550: 1001012116 offcpu-time: 
	    7274e5d190bf __poll+0x4f (/usr/lib/libc.so.6)
	    591acfc5daf0 perf_evlist__poll+0x24 (/root/hw/perf-tools-next/tools/perf/perf)
	    591acfb1ca50 perf_evlist__poll_thread+0x160 (/root/hw/perf-tools-next/tools/perf/perf)
	    7274e5ca955a [unknown] (/usr/lib/libc.so.6)
```

Here's a simple flowchart:

[parse_event (sample type: PERF_SAMPLE_RAW)] --> [config (bind fds,
sample_id, sample_type)] --> [off_cpu_strip (sample type: PERF_SAMPLE_RAW)] -->
[record_done(hooks off_cpu_finish)] --> [prepare_parse(sample type: OFFCPU_SAMPLE_TYPES)]

Changes in v2:
 - Remove unnecessary comments.
 - Rename function off_cpu_change_type to off_cpu_prepare_parse

Howard Chu (4):
  perf record off-cpu: Parse off-cpu event, change config location
  perf record off-cpu: BPF perf_event_output on sched_switch
  perf record off-cpu: extract off-cpu sample data from raw_data
  perf record off-cpu: delete bound-to-fail test

 tools/perf/builtin-record.c             |  98 +++++++++-
 tools/perf/tests/shell/record_offcpu.sh |  29 ---
 tools/perf/util/bpf_off_cpu.c           | 242 +++++++++++-------------
 tools/perf/util/bpf_skel/off_cpu.bpf.c  | 163 +++++++++++++---
 tools/perf/util/evsel.c                 |   8 -
 tools/perf/util/off_cpu.h               |  14 +-
 tools/perf/util/perf-hooks-list.h       |   1 +
 7 files changed, 344 insertions(+), 211 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/4] perf record off-cpu: Parse off-cpu event, change config location
  2024-04-24  2:48 [PATCH v2 0/4] Dump off-cpu samples directly Howard Chu
@ 2024-04-24  2:48 ` Howard Chu
  2024-04-24  2:48 ` [PATCH v2 2/4] perf record off-cpu: BPF perf_event_output on sched_switch Howard Chu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Howard Chu @ 2024-04-24  2:48 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, zegao2021, leo.yan,
	ravi.bangoria, linux-perf-users, linux-kernel, bpf

Parse off-cpu events using parse_event(). Change the location of
record__config_off_cpu, after record__open because we need to write
mmapped fds into BPF's perf_event_array map, also, write
sample_id/sample_type into BPF. In record__pushfn and record__aio_pushfn,
handle off-cpu samples by off_cpu_strip. This is because the off-cpu
samples that we want to write to perf.data is in off-cpu samples' raw_data
section:

regular samples:
[sample: sample_data]

off-cpu samples:
[sample: [raw_data: sample_data]]

We need to extract the useful sample data out before writing.

Hooks record_done just before evlist__disable to stop BPF program from
outputting, otherwise, we lose some samples.

After samples are collected, change sample_type of off-cpu event to
the OFFCPU_SAMPLE_TYPES for parsing correctly, it was PERF_SAMPLE_RAW and
some others, because BPF can only output to a specific type of perf_event,
which is why `evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;` is
deleted in util/evsel.c.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/builtin-record.c | 98 ++++++++++++++++++++++++++++++++++---
 tools/perf/util/evsel.c     |  8 ---
 2 files changed, 91 insertions(+), 15 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 2ff718d3e202..3994adaf4607 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -389,6 +389,8 @@ struct record_aio {
 static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size)
 {
 	struct record_aio *aio = to;
+	char *bf_stripped = NULL;
+	size_t stripped;
 
 	/*
 	 * map->core.base data pointed by buf is copied into free map->aio.data[] buffer
@@ -404,6 +406,31 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
 	 * from the beginning of the kernel buffer till the end of the data chunk.
 	 */
 
+	if (aio->rec->off_cpu) {
+		if (size == 0)
+			return 0;
+
+		map->core.start -= size;
+		size = map->core.end - map->core.start;
+
+		bf_stripped = malloc(size);
+
+		if (bf_stripped == NULL) {
+			pr_err("Failed to allocate off-cpu strip buffer\n");
+			return -ENOMEM;
+		}
+
+		stripped = off_cpu_strip(aio->rec->evlist, map, bf_stripped, size);
+
+		if (stripped < 0) {
+			size = (int)stripped;
+			goto out;
+		}
+
+		size = stripped;
+		buf = bf_stripped;
+	}
+
 	if (record__comp_enabled(aio->rec)) {
 		ssize_t compressed = zstd_compress(aio->rec->session, NULL, aio->data + aio->size,
 						   mmap__mmap_len(map) - aio->size,
@@ -432,6 +459,9 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
 
 	aio->size += size;
 
+out:
+	free(bf_stripped);
+
 	return size;
 }
 
@@ -635,6 +665,38 @@ static int process_locked_synthesized_event(struct perf_tool *tool,
 static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
 {
 	struct record *rec = to;
+	int err;
+	char *bf_stripped = NULL;
+	size_t stripped;
+
+	if (rec->off_cpu) {
+		/*
+		 * We'll read all the events at once without masking.
+		 * When reading the remainder from a map, the size is 0, because
+		 * start is shifted to the end so no more data is to be read.
+		 */
+		if (size == 0)
+			return 0;
+
+		map->core.start -= size;
+		/* size in total */
+		size = map->core.end - map->core.start;
+
+		bf_stripped = malloc(size);
+
+		if (bf_stripped == NULL) {
+			pr_err("Failed to allocate off-cpu strip buffer\n");
+			return -ENOMEM;
+		}
+
+		stripped = off_cpu_strip(rec->evlist, map, bf_stripped, size);
+
+		if (stripped < 0)
+			return (int)stripped;
+
+		size = stripped;
+		bf = bf_stripped;
+	}
 
 	if (record__comp_enabled(rec)) {
 		ssize_t compressed = zstd_compress(rec->session, map, map->data,
@@ -648,7 +710,11 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
 	}
 
 	thread->samples++;
-	return record__write(rec, map, bf, size);
+	err = record__write(rec, map, bf, size);
+
+	free(bf_stripped);
+
+	return err;
 }
 
 static volatile sig_atomic_t signr = -1;
@@ -1790,6 +1856,7 @@ record__finish_output(struct record *rec)
 		if (rec->buildid_all)
 			perf_session__dsos_hit_all(rec->session);
 	}
+
 	perf_session__write_header(rec->session, rec->evlist, fd, true);
 
 	return;
@@ -2501,6 +2568,14 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		}
 	}
 
+	if (rec->off_cpu) {
+		err = record__config_off_cpu(rec);
+		if (err) {
+			pr_err("record__config_off_cpu failed, error %d\n", err);
+			goto out_free_threads;
+		}
+	}
+
 	/*
 	 * Normally perf_session__new would do this, but it doesn't have the
 	 * evlist.
@@ -2764,6 +2839,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		 * disable events in this case.
 		 */
 		if (done && !disabled && !target__none(&opts->target)) {
+			perf_hooks__invoke_record_done();
 			trigger_off(&auxtrace_snapshot_trigger);
 			evlist__disable(rec->evlist);
 			disabled = true;
@@ -2827,14 +2903,17 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	} else
 		status = err;
 
-	if (rec->off_cpu)
-		rec->bytes_written += off_cpu_write(rec->session);
-
 	record__read_lost_samples(rec);
 	record__synthesize(rec, true);
 	/* this will be recalculated during process_buildids() */
 	rec->samples = 0;
 
+	/* change to the correct sample type for parsing */
+	if (rec->off_cpu && off_cpu_prepare_parse(rec->evlist)) {
+		pr_err("ERROR: Failed to change sample type for off-cpu event\n");
+		goto out_delete_session;
+	}
+
 	if (!err) {
 		if (!rec->timestamp_filename) {
 			record__finish_output(rec);
@@ -3198,7 +3277,7 @@ static int switch_output_setup(struct record *rec)
 	unsigned long val;
 
 	/*
-	 * If we're using --switch-output-events, then we imply its 
+	 * If we're using --switch-output-events, then we imply its
 	 * --switch-output=signal, as we'll send a SIGUSR2 from the side band
 	 *  thread to its parent.
 	 */
@@ -4221,9 +4300,14 @@ int cmd_record(int argc, const char **argv)
 	}
 
 	if (rec->off_cpu) {
-		err = record__config_off_cpu(rec);
+		char off_cpu_event[64];
+
+		snprintf(off_cpu_event, sizeof(off_cpu_event),
+			 "bpf-output/no-inherit=1,name=%s/", OFFCPU_EVENT);
+
+		err = parse_event(rec->evlist, off_cpu_event);
 		if (err) {
-			pr_err("record__config_off_cpu failed, error %d\n", err);
+			pr_err("Failed to open off-cpu event\n");
 			goto out;
 		}
 	}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 3536404e9447..c08ae6a3c8d6 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1092,11 +1092,6 @@ static void evsel__set_default_freq_period(struct record_opts *opts,
 	}
 }
 
-static bool evsel__is_offcpu_event(struct evsel *evsel)
-{
-	return evsel__is_bpf_output(evsel) && evsel__name_is(evsel, OFFCPU_EVENT);
-}
-
 /*
  * The enable_on_exec/disabled value strategy:
  *
@@ -1363,9 +1358,6 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	if (evsel__is_dummy_event(evsel))
 		evsel__reset_sample_bit(evsel, BRANCH_STACK);
 
-	if (evsel__is_offcpu_event(evsel))
-		evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES;
-
 	arch__post_evsel_config(evsel, attr);
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/4] perf record off-cpu: BPF perf_event_output on sched_switch
  2024-04-24  2:48 [PATCH v2 0/4] Dump off-cpu samples directly Howard Chu
  2024-04-24  2:48 ` [PATCH v2 1/4] perf record off-cpu: Parse off-cpu event, change config location Howard Chu
@ 2024-04-24  2:48 ` Howard Chu
  2024-04-24  2:48 ` [PATCH v2 3/4] perf record off-cpu: extract off-cpu sample data from raw_data Howard Chu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Howard Chu @ 2024-04-24  2:48 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, zegao2021, leo.yan,
	ravi.bangoria, linux-perf-users, linux-kernel, bpf

bpf_perf_event_output the off-cpu samples on sched_switch. Because most of
the time can_record() returns 0, we can't collect stacks, so when stack 
trace is collectable, store it in stack_save for later output. If we
don't do that, most of the off-cpu samples won't have a stack trace.
And since stack traces are collected in task_storage, we don't need to
worry about maps getting data overflow.

There is a threshold OUTPUT_THRESHOLD (ns) to decide the minimum off-CPU
time to trigger output, it is now set to zero. I need opinions on this
value.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 163 ++++++++++++++++++++-----
 1 file changed, 135 insertions(+), 28 deletions(-)

diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index d877a0a9731f..81114de2436d 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -17,9 +17,13 @@
 
 #define MAX_STACKS   32
 #define MAX_ENTRIES  102400
+#define MAX_CPUS  4096
+#define MAX_OFFCPU_LEN 128
+
+/* minimum offcpu time to trigger output */
+#define OUTPUT_THRESHOLD 0ULL
 
 struct tstamp_data {
-	__u32 stack_id;
 	__u32 state;
 	__u64 timestamp;
 };
@@ -27,17 +31,17 @@ struct tstamp_data {
 struct offcpu_key {
 	__u32 pid;
 	__u32 tgid;
-	__u32 stack_id;
 	__u32 state;
 	__u64 cgroup_id;
 };
 
-struct {
-	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
-	__uint(key_size, sizeof(__u32));
-	__uint(value_size, MAX_STACKS * sizeof(__u64));
-	__uint(max_entries, MAX_ENTRIES);
-} stacks SEC(".maps");
+struct offcpu_array {
+	u64 array[MAX_OFFCPU_LEN];
+};
+
+struct stack_array {
+	u64 array[MAX_STACKS];
+};
 
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
@@ -46,13 +50,6 @@ struct {
 	__type(value, struct tstamp_data);
 } tstamp SEC(".maps");
 
-struct {
-	__uint(type, BPF_MAP_TYPE_HASH);
-	__uint(key_size, sizeof(struct offcpu_key));
-	__uint(value_size, sizeof(__u64));
-	__uint(max_entries, MAX_ENTRIES);
-} off_cpu SEC(".maps");
-
 struct {
 	__uint(type, BPF_MAP_TYPE_HASH);
 	__uint(key_size, sizeof(__u32));
@@ -74,6 +71,34 @@ struct {
 	__uint(max_entries, 1);
 } cgroup_filter SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, MAX_CPUS);
+} offcpu_output SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct offcpu_array));
+	__uint(max_entries, 1);
+} offcpu_data SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct stack_array));
+	__uint(max_entries, 1);
+} stack_frame SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct stack_array);
+} stack_save SEC(".maps");
+
 /* new kernel task_struct definition */
 struct task_struct___new {
 	long __state;
@@ -96,6 +121,8 @@ const volatile bool uses_cgroup_v1 = false;
 
 int perf_subsys_id = -1;
 
+u64 sample_id, sample_type;
+
 /*
  * Old kernel used to call it task_struct->state and now it's '__state'.
  * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
@@ -182,50 +209,130 @@ static inline int can_record(struct task_struct *t, int state)
 	return 1;
 }
 
+static inline bool check_bounds(int index)
+{
+	if (index >= 0 && index < MAX_OFFCPU_LEN)
+		return true;
+
+	return false;
+}
+
+static inline int copy_stack(struct stack_array *from,
+			     struct offcpu_array *to, int n)
+{
+	int max_stacks = MAX_STACKS, len = 0;
+
+	if (!from)
+		return len;
+
+	for (int i = 0; i < max_stacks && from->array[i]; ++i) {
+		if (check_bounds(n + 2 + i)) {
+			to->array[n + 2 + i] = from->array[i];
+			++len;
+		}
+	}
+	return len;
+}
+
 static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 			struct task_struct *next, int state)
 {
 	__u64 ts;
-	__u32 stack_id;
 	struct tstamp_data *pelem;
-
+	struct stack_array *frame, *stack_save_p;
 	ts = bpf_ktime_get_ns();
+	int zero = 0, len = 0, size;
 
 	if (!can_record(prev, state))
 		goto next;
 
-	stack_id = bpf_get_stackid(ctx, &stacks,
-				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
+	frame = bpf_map_lookup_elem(&stack_frame, &zero);
+	if (frame)
+		len = bpf_get_stack(ctx, frame->array, MAX_STACKS * sizeof(u64),
+				    BPF_F_USER_STACK) / sizeof(u64);
+
+	/* save stacks if collectable */
+	if (len > 0) {
+		stack_save_p = bpf_task_storage_get(&stack_save, prev, NULL,
+						    BPF_LOCAL_STORAGE_GET_F_CREATE);
+		if (stack_save_p)
+			for (int i = 0; i < len && i < MAX_STACKS; ++i)
+				stack_save_p->array[i] = frame->array[i];
+	}
 
 	pelem = bpf_task_storage_get(&tstamp, prev, NULL,
 				     BPF_LOCAL_STORAGE_GET_F_CREATE);
+
 	if (!pelem)
 		goto next;
 
 	pelem->timestamp = ts;
 	pelem->state = state;
-	pelem->stack_id = stack_id;
 
 next:
 	pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
 
+	stack_save_p = bpf_task_storage_get(&stack_save, next, NULL, 0);
+
 	if (pelem && pelem->timestamp) {
 		struct offcpu_key key = {
 			.pid = next->pid,
 			.tgid = next->tgid,
-			.stack_id = pelem->stack_id,
 			.state = pelem->state,
 			.cgroup_id = needs_cgroup ? get_cgroup_id(next) : 0,
 		};
-		__u64 delta = ts - pelem->timestamp;
-		__u64 *total;
 
-		total = bpf_map_lookup_elem(&off_cpu, &key);
-		if (total)
-			*total += delta;
-		else
-			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+		__u64 delta = ts - pelem->timestamp;
 
+		struct offcpu_array *data = bpf_map_lookup_elem(&offcpu_data, &zero);
+
+		if (data && delta >= OUTPUT_THRESHOLD) {
+			int n = 0;
+			int ip_pos = -1;
+
+			if (sample_type & PERF_SAMPLE_IDENTIFIER && check_bounds(n))
+				data->array[n++] = sample_id;
+			if (sample_type & PERF_SAMPLE_IP && check_bounds(n)) {
+				ip_pos = n;
+				data->array[n++] = 0;  /* will be updated */
+			}
+			if (sample_type & PERF_SAMPLE_TID && check_bounds(n))
+				data->array[n++] = (u64)key.pid << 32 | key.tgid;
+			if (sample_type & PERF_SAMPLE_TIME && check_bounds(n))
+				data->array[n++] = pelem->timestamp;
+			if (sample_type & PERF_SAMPLE_ID && check_bounds(n))
+				data->array[n++] = sample_id;
+			if (sample_type & PERF_SAMPLE_CPU && check_bounds(n))
+				data->array[n++] = 0;
+			if (sample_type & PERF_SAMPLE_PERIOD && check_bounds(n))
+				data->array[n++] = delta;
+			if (sample_type & PERF_SAMPLE_CALLCHAIN && check_bounds(n + 2)) {
+				len = 0;
+
+				/* data->array[n] is callchain->nr (updated later) */
+				data->array[n + 1] = PERF_CONTEXT_USER;
+				data->array[n + 2] = 0;
+
+				len = copy_stack(stack_save_p, data, n);
+
+				/* update length of callchain */
+				data->array[n] = len + 1;
+
+				/* update sample ip with the first callchain entry */
+				if (ip_pos >= 0)
+					data->array[ip_pos] = data->array[n + 2];
+
+				/* calculate sample callchain data->array length */
+				n += len + 2;
+			}
+			if (sample_type & PERF_SAMPLE_CGROUP && check_bounds(n))
+				data->array[n++] = key.cgroup_id;
+
+			size = n * sizeof(u64);
+			if (size >= 0 && size <= MAX_OFFCPU_LEN * sizeof(u64))
+				bpf_perf_event_output(ctx, &offcpu_output, BPF_F_CURRENT_CPU,
+						      data, size);
+		}
 		/* prevent to reuse the timestamp later */
 		pelem->timestamp = 0;
 	}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/4] perf record off-cpu: extract off-cpu sample data from raw_data
  2024-04-24  2:48 [PATCH v2 0/4] Dump off-cpu samples directly Howard Chu
  2024-04-24  2:48 ` [PATCH v2 1/4] perf record off-cpu: Parse off-cpu event, change config location Howard Chu
  2024-04-24  2:48 ` [PATCH v2 2/4] perf record off-cpu: BPF perf_event_output on sched_switch Howard Chu
@ 2024-04-24  2:48 ` Howard Chu
  2024-04-24  2:48 ` [PATCH v2 4/4] perf record off-cpu: delete bound-to-fail test Howard Chu
  2024-04-24 19:12 ` [PATCH v2 0/4] Dump off-cpu samples directly Namhyung Kim
  4 siblings, 0 replies; 9+ messages in thread
From: Howard Chu @ 2024-04-24  2:48 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, zegao2021, leo.yan,
	ravi.bangoria, linux-perf-users, linux-kernel, bpf

Because we need to extract real sample data from raw_data, an
off_cpu_strip function is needed. In this function, we read events one by
one, extract actual samples from raw data. After stripping is done, a
new stripped buffer will be written to the file(or compressed).

The size is end - start without masking, because masking will be handled
by perf_mmap__read_event(). Also, there's no need to call
perf_mmap__consume() as it will be called by perf_mmap__push(). We read
all the data at once, so start will be moved to end, when
perf_mmap__push() is called in pushfn the second time, size will be zero,
returns directly.

Hook record_done instead of record_end to off_cpu_finish.

Although moving record_end hook is also fine because currently, only
off-cpu is using these hooks, technically record doesn't end when we need
to turn off BPF output, so it may confuse if moved. The reason why there's
an additional off_cpu_prepare_parse is that it cannot be put into
off_cpu_finish because there are still samples to be read(which requires
sample_type to stay PERF_SAMPLE_RAW), and it cannot be hooked to
record_end because it has to be put before record__finish_output.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/util/bpf_off_cpu.c     | 242 +++++++++++++-----------------
 tools/perf/util/off_cpu.h         |  14 +-
 tools/perf/util/perf-hooks-list.h |   1 +
 3 files changed, 118 insertions(+), 139 deletions(-)

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 6af36142dc5a..4b055b77d734 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -12,6 +12,9 @@
 #include "util/thread_map.h"
 #include "util/cgroup.h"
 #include "util/strlist.h"
+#include "util/mmap.h"
+#include "util/sample.h"
+#include <perf/mmap.h>
 #include <bpf/bpf.h>
 
 #include "bpf_skel/off_cpu.skel.h"
@@ -23,51 +26,6 @@
 
 static struct off_cpu_bpf *skel;
 
-struct off_cpu_key {
-	u32 pid;
-	u32 tgid;
-	u32 stack_id;
-	u32 state;
-	u64 cgroup_id;
-};
-
-union off_cpu_data {
-	struct perf_event_header hdr;
-	u64 array[1024 / sizeof(u64)];
-};
-
-static int off_cpu_config(struct evlist *evlist)
-{
-	struct evsel *evsel;
-	struct perf_event_attr attr = {
-		.type	= PERF_TYPE_SOFTWARE,
-		.config = PERF_COUNT_SW_BPF_OUTPUT,
-		.size	= sizeof(attr), /* to capture ABI version */
-	};
-	char *evname = strdup(OFFCPU_EVENT);
-
-	if (evname == NULL)
-		return -ENOMEM;
-
-	evsel = evsel__new(&attr);
-	if (!evsel) {
-		free(evname);
-		return -ENOMEM;
-	}
-
-	evsel->core.attr.freq = 1;
-	evsel->core.attr.sample_period = 1;
-	/* off-cpu analysis depends on stack trace */
-	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
-
-	evlist__add(evlist, evsel);
-
-	free(evsel->name);
-	evsel->name = evname;
-
-	return 0;
-}
-
 static void off_cpu_start(void *arg)
 {
 	struct evlist *evlist = arg;
@@ -125,18 +83,29 @@ static void check_sched_switch_args(void)
 	btf__free(btf);
 }
 
+int off_cpu_prepare_parse(struct evlist *evlist)
+{
+	struct evsel *evsel;
+
+	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
+	if (evsel == NULL)
+		return -1;
+
+	evsel->core.attr.sample_type = OFFCPU_SAMPLE_TYPES;
+
+	return 0;
+}
+
 int off_cpu_prepare(struct evlist *evlist, struct target *target,
 		    struct record_opts *opts)
 {
 	int err, fd, i;
 	int ncpus = 1, ntasks = 1, ncgrps = 1;
+	u64 sid = 0;
 	struct strlist *pid_slist = NULL;
 	struct str_node *pos;
-
-	if (off_cpu_config(evlist) < 0) {
-		pr_err("Failed to config off-cpu BPF event\n");
-		return -1;
-	}
+	struct evsel *evsel;
+	struct perf_cpu pcpu;
 
 	skel = off_cpu_bpf__open();
 	if (!skel) {
@@ -250,7 +219,6 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 	}
 
 	if (evlist__first(evlist)->cgrp) {
-		struct evsel *evsel;
 		u8 val = 1;
 
 		skel->bss->has_cgroup = 1;
@@ -272,6 +240,25 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 		}
 	}
 
+	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
+	if (evsel == NULL) {
+		pr_err("%s evsel not found\n", OFFCPU_EVENT);
+		goto out;
+	}
+
+	if (evsel->core.id)
+		sid = evsel->core.id[0];
+
+	skel->bss->sample_id = sid;
+	skel->bss->sample_type = OFFCPU_SAMPLE_TYPES;
+
+	perf_cpu_map__for_each_cpu(pcpu, i, evsel->core.cpus) {
+		bpf_map__update_elem(skel->maps.offcpu_output,
+				     &pcpu.cpu, sizeof(int),
+				     xyarray__entry(evsel->core.fd, pcpu.cpu, 0),
+				     sizeof(__u32), BPF_ANY);
+	}
+
 	err = off_cpu_bpf__attach(skel);
 	if (err) {
 		pr_err("Failed to attach off-cpu BPF skeleton\n");
@@ -279,7 +266,7 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 	}
 
 	if (perf_hooks__set_hook("record_start", off_cpu_start, evlist) ||
-	    perf_hooks__set_hook("record_end", off_cpu_finish, evlist)) {
+	    perf_hooks__set_hook("record_done", off_cpu_finish, evlist)) {
 		pr_err("Failed to attach off-cpu skeleton\n");
 		goto out;
 	}
@@ -291,105 +278,88 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target,
 	return -1;
 }
 
-int off_cpu_write(struct perf_session *session)
+ssize_t off_cpu_strip(struct evlist *evlist, struct mmap *mp, char *dst, size_t size)
 {
-	int bytes = 0, size;
-	int fd, stack;
-	u64 sample_type, val, sid = 0;
+	/*
+	 * In this function, we read events one by one,
+	 * stripping actual samples from raw data.
+	 */
+
+	union perf_event *event, tmp;
+	u64 sample_type = OFFCPU_SAMPLE_TYPES;
+	size_t written = 0, event_sz, write_sz, raw_sz_aligned, offset = 0;
+	void *src;
+	int err = 0, n = 0;
+	struct perf_sample sample;
 	struct evsel *evsel;
-	struct perf_data_file *file = &session->data->file;
-	struct off_cpu_key prev, key;
-	union off_cpu_data data = {
-		.hdr = {
-			.type = PERF_RECORD_SAMPLE,
-			.misc = PERF_RECORD_MISC_USER,
-		},
-	};
-	u64 tstamp = OFF_CPU_TIMESTAMP;
-
-	skel->bss->enabled = 0;
 
-	evsel = evlist__find_evsel_by_str(session->evlist, OFFCPU_EVENT);
+	evsel = evlist__find_evsel_by_str(evlist, OFFCPU_EVENT);
 	if (evsel == NULL) {
 		pr_err("%s evsel not found\n", OFFCPU_EVENT);
-		return 0;
-	}
-
-	sample_type = evsel->core.attr.sample_type;
-
-	if (sample_type & ~OFFCPU_SAMPLE_TYPES) {
-		pr_err("not supported sample type: %llx\n",
-		       (unsigned long long)sample_type);
 		return -1;
 	}
 
-	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
-		if (evsel->core.id)
-			sid = evsel->core.id[0];
-	}
+	/* for writing sample time*/
+	if (sample_type & PERF_SAMPLE_IDENTIFIER)
+		++n;
+	if (sample_type & PERF_SAMPLE_IP)
+		++n;
+	if (sample_type & PERF_SAMPLE_TID)
+		++n;
+
+	/* no need for perf_mmap__consume(), it will be handled by perf_mmap__push() */
+	while ((event = perf_mmap__read_event(&mp->core)) != NULL) {
+		event_sz = event->header.size;
+		write_sz = event_sz;
+		src = event;
+
+		if (event->header.type == PERF_RECORD_SAMPLE) {
+			err = evlist__parse_sample(evlist, event, &sample);
+			if (err) {
+				pr_err("Failed to parse off-cpu sample\n");
+				return -1;
+			}
 
-	fd = bpf_map__fd(skel->maps.off_cpu);
-	stack = bpf_map__fd(skel->maps.stacks);
-	memset(&prev, 0, sizeof(prev));
+			if (sample.raw_data && evsel->core.id) {
+				bool flag = false;
 
-	while (!bpf_map_get_next_key(fd, &prev, &key)) {
-		int n = 1;  /* start from perf_event_header */
-		int ip_pos = -1;
+				for (u32 i = 0; i < evsel->core.ids; i++) {
+					if (sample.id == evsel->core.id[i]) {
+						flag = true;
+						break;
+					}
+				}
+				if (flag) {
+					memcpy(&tmp, event, event_sz);
 
-		bpf_map_lookup_elem(fd, &key, &val);
+					/* raw data has extra bits for alignment, discard them */
+					raw_sz_aligned = sample.raw_size - sizeof(u32);
+					memcpy(tmp.sample.array, sample.raw_data, raw_sz_aligned);
 
-		if (sample_type & PERF_SAMPLE_IDENTIFIER)
-			data.array[n++] = sid;
-		if (sample_type & PERF_SAMPLE_IP) {
-			ip_pos = n;
-			data.array[n++] = 0;  /* will be updated */
-		}
-		if (sample_type & PERF_SAMPLE_TID)
-			data.array[n++] = (u64)key.pid << 32 | key.tgid;
-		if (sample_type & PERF_SAMPLE_TIME)
-			data.array[n++] = tstamp;
-		if (sample_type & PERF_SAMPLE_ID)
-			data.array[n++] = sid;
-		if (sample_type & PERF_SAMPLE_CPU)
-			data.array[n++] = 0;
-		if (sample_type & PERF_SAMPLE_PERIOD)
-			data.array[n++] = val;
-		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
-			int len = 0;
-
-			/* data.array[n] is callchain->nr (updated later) */
-			data.array[n + 1] = PERF_CONTEXT_USER;
-			data.array[n + 2] = 0;
-
-			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
-			while (data.array[n + 2 + len])
-				len++;
-
-			/* update length of callchain */
-			data.array[n] = len + 1;
-
-			/* update sample ip with the first callchain entry */
-			if (ip_pos >= 0)
-				data.array[ip_pos] = data.array[n + 2];
-
-			/* calculate sample callchain data array length */
-			n += len + 2;
-		}
-		if (sample_type & PERF_SAMPLE_CGROUP)
-			data.array[n++] = key.cgroup_id;
+					write_sz = sizeof(struct perf_event_header) +
+							  raw_sz_aligned;
+
+					/* without this we'll have out of order events */
+					if (sample_type & PERF_SAMPLE_TIME)
+						tmp.sample.array[n] = sample.time;
 
-		size = n * sizeof(u64);
-		data.hdr.size = size;
-		bytes += size;
+					tmp.header.size = write_sz;
+					tmp.header.type = PERF_RECORD_SAMPLE;
+					tmp.header.misc = PERF_RECORD_MISC_USER;
 
-		if (perf_data_file__write(file, &data, size) < 0) {
-			pr_err("failed to write perf data, error: %m\n");
-			return bytes;
+					src = &tmp;
+				}
+			}
 		}
+		if (offset + event_sz > size || written + write_sz > size)
+			break;
 
-		prev = key;
-		/* increase dummy timestamp to sort later samples */
-		tstamp++;
+		memcpy(dst, src, write_sz);
+
+		dst += write_sz;
+		written += write_sz;
+		offset += event_sz;
 	}
-	return bytes;
+
+	return written;
 }
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 2dd67c60f211..1d195cd045ad 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -20,7 +20,9 @@ struct record_opts;
 #ifdef HAVE_BPF_SKEL
 int off_cpu_prepare(struct evlist *evlist, struct target *target,
 		    struct record_opts *opts);
-int off_cpu_write(struct perf_session *session);
+ssize_t off_cpu_strip(struct evlist *evlist, struct mmap *mp,
+		      char *dst, size_t size);
+int off_cpu_prepare_parse(struct evlist *evlist);
 #else
 static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused,
 				  struct target *target __maybe_unused,
@@ -28,8 +30,14 @@ static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused,
 {
 	return -1;
 }
-
-static inline int off_cpu_write(struct perf_session *session __maybe_unused)
+static inline ssize_t off_cpu_strip(struct evlist *evlist __maybe_unused,
+				    struct mmap *mp __maybe_unused,
+				    char *dst __maybe_unused,
+				    size_t size __maybe_unused)
+{
+	return -1;
+}
+static inline int off_cpu_prepare_parse(struct evlist *evlist __maybe_unused)
 {
 	return -1;
 }
diff --git a/tools/perf/util/perf-hooks-list.h b/tools/perf/util/perf-hooks-list.h
index 2867c07ee84e..1ce4d44ace35 100644
--- a/tools/perf/util/perf-hooks-list.h
+++ b/tools/perf/util/perf-hooks-list.h
@@ -1,3 +1,4 @@
 PERF_HOOK(record_start)
 PERF_HOOK(record_end)
+PERF_HOOK(record_done)
 PERF_HOOK(test)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 4/4] perf record off-cpu: delete bound-to-fail test
  2024-04-24  2:48 [PATCH v2 0/4] Dump off-cpu samples directly Howard Chu
                   ` (2 preceding siblings ...)
  2024-04-24  2:48 ` [PATCH v2 3/4] perf record off-cpu: extract off-cpu sample data from raw_data Howard Chu
@ 2024-04-24  2:48 ` Howard Chu
  2024-04-24 19:12 ` [PATCH v2 0/4] Dump off-cpu samples directly Namhyung Kim
  4 siblings, 0 replies; 9+ messages in thread
From: Howard Chu @ 2024-04-24  2:48 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, zegao2021, leo.yan,
	ravi.bangoria, linux-perf-users, linux-kernel, bpf

Since `--off-cpu` now uses the same ring buffer as hardware samples,
but `perf record --off-cpu -e dummy sleep 1`does not enable evlist,
off-cpu samples cannot be read.`test_offcpu_basic` fails and is no
longer necessary.

Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 tools/perf/tests/shell/record_offcpu.sh | 29 -------------------------
 1 file changed, 29 deletions(-)

diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
index 67c925f3a15a..c446c0cdee4f 100755
--- a/tools/perf/tests/shell/record_offcpu.sh
+++ b/tools/perf/tests/shell/record_offcpu.sh
@@ -36,30 +36,6 @@ test_offcpu_priv() {
   fi
 }
 
-test_offcpu_basic() {
-  echo "Basic off-cpu test"
-
-  if ! perf record --off-cpu -e dummy -o ${perfdata} sleep 1 2> /dev/null
-  then
-    echo "Basic off-cpu test [Failed record]"
-    err=1
-    return
-  fi
-  if ! perf evlist -i ${perfdata} | grep -q "offcpu-time"
-  then
-    echo "Basic off-cpu test [Failed no event]"
-    err=1
-    return
-  fi
-  if ! perf report -i ${perfdata} -q --percent-limit=90 | grep -E -q sleep
-  then
-    echo "Basic off-cpu test [Failed missing output]"
-    err=1
-    return
-  fi
-  echo "Basic off-cpu test [Success]"
-}
-
 test_offcpu_child() {
   echo "Child task off-cpu test"
 
@@ -88,13 +64,8 @@ test_offcpu_child() {
   echo "Child task off-cpu test [Success]"
 }
 
-
 test_offcpu_priv
 
-if [ $err = 0 ]; then
-  test_offcpu_basic
-fi
-
 if [ $err = 0 ]; then
   test_offcpu_child
 fi
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] Dump off-cpu samples directly
  2024-04-24  2:48 [PATCH v2 0/4] Dump off-cpu samples directly Howard Chu
                   ` (3 preceding siblings ...)
  2024-04-24  2:48 ` [PATCH v2 4/4] perf record off-cpu: delete bound-to-fail test Howard Chu
@ 2024-04-24 19:12 ` Namhyung Kim
  2024-04-24 21:11   ` Arnaldo Carvalho de Melo
  4 siblings, 1 reply; 9+ messages in thread
From: Namhyung Kim @ 2024-04-24 19:12 UTC (permalink / raw)
  To: Howard Chu
  Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, zegao2021, leo.yan,
	ravi.bangoria, linux-perf-users, linux-kernel, bpf

Hello,

On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
>
> As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
>
> Currently, off-cpu samples are dumped when perf record is exiting. This
> results in off-cpu samples being after the regular samples. Also, samples
> are stored in large BPF maps which contain all the stack traces and
> accumulated off-cpu time, but they are eventually going to fill up after
> running for an extensive period. This patch fixes those problems by dumping
> samples directly into perf ring buffer, and dispatching those samples to the
> correct format.

Thanks for working on this.

But the problem of dumping all sched-switch events is that it can be
too frequent on loaded machines.  Copying many events to the buffer
can result in losing other records.  As perf report doesn't care about
timing much, I decided to aggregate the result in a BPF map and dump
them at the end of the profiling session.

Maybe that's not a concern for you (or smaller systems).  Then I think
we can keep the original behavior and add a new option (I'm not good
at naming things, but maybe --off-cpu-sample?) to work differently
instead of removing the old behavior.

Thanks,
Namhyung

>
> Before, off-cpu samples are after regular samples
>
> ```
>          swapper       0 [000] 963432.136150:    2812933    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
>          swapper       0 [000] 963432.637911:    4932876    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
>          swapper       0 [001] 963432.798072:    6273398    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
>          swapper       0 [000] 963433.541152:    5279005    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> sh 1410180 [000] 18446744069.414584:    2528851 offcpu-time:
>             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
>
>
> sh 1410185 [000] 18446744069.414584:    2314223 offcpu-time:
>             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
>
>
> awk 1409644 [000] 18446744069.414584:     191785 offcpu-time:
>             702609d03681 read+0x11 (/usr/lib/libc.so.6)
>                   4a02a4 [unknown] ([unknown])
> ```
>
>
> After, regular samples(cycles:P) and off-cpu(offcpu-time) samples are
> collected simultaneously:
>
> ```
> upowerd     741 [000] 963757.428701:     297848 offcpu-time:
>             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
>
>
>       irq/9-acpi      56 [000] 963757.429116:    8760875    cycles:P:  ffffffffb779849f acpi_os_read_port+0x2f ([kernel.kallsyms])
> upowerd     741 [000] 963757.429172:     459522 offcpu-time:
>             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
>
>
>          swapper       0 [002] 963757.434529:    5759904    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> perf 1419260 [000] 963757.434550: 1001012116 offcpu-time:
>             7274e5d190bf __poll+0x4f (/usr/lib/libc.so.6)
>             591acfc5daf0 perf_evlist__poll+0x24 (/root/hw/perf-tools-next/tools/perf/perf)
>             591acfb1ca50 perf_evlist__poll_thread+0x160 (/root/hw/perf-tools-next/tools/perf/perf)
>             7274e5ca955a [unknown] (/usr/lib/libc.so.6)
> ```
>
> Here's a simple flowchart:
>
> [parse_event (sample type: PERF_SAMPLE_RAW)] --> [config (bind fds,
> sample_id, sample_type)] --> [off_cpu_strip (sample type: PERF_SAMPLE_RAW)] -->
> [record_done(hooks off_cpu_finish)] --> [prepare_parse(sample type: OFFCPU_SAMPLE_TYPES)]
>
> Changes in v2:
>  - Remove unnecessary comments.
>  - Rename function off_cpu_change_type to off_cpu_prepare_parse
>
> Howard Chu (4):
>   perf record off-cpu: Parse off-cpu event, change config location
>   perf record off-cpu: BPF perf_event_output on sched_switch
>   perf record off-cpu: extract off-cpu sample data from raw_data
>   perf record off-cpu: delete bound-to-fail test
>
>  tools/perf/builtin-record.c             |  98 +++++++++-
>  tools/perf/tests/shell/record_offcpu.sh |  29 ---
>  tools/perf/util/bpf_off_cpu.c           | 242 +++++++++++-------------
>  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 163 +++++++++++++---
>  tools/perf/util/evsel.c                 |   8 -
>  tools/perf/util/off_cpu.h               |  14 +-
>  tools/perf/util/perf-hooks-list.h       |   1 +
>  7 files changed, 344 insertions(+), 211 deletions(-)
>
> --
> 2.44.0
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] Dump off-cpu samples directly
  2024-04-24 19:12 ` [PATCH v2 0/4] Dump off-cpu samples directly Namhyung Kim
@ 2024-04-24 21:11   ` Arnaldo Carvalho de Melo
  2024-04-24 22:19     ` Ian Rogers
  0 siblings, 1 reply; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-04-24 21:11 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Howard Chu, peterz, mingo, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, zegao2021, leo.yan,
	ravi.bangoria, linux-perf-users, linux-kernel, bpf

On Wed, Apr 24, 2024 at 12:12:26PM -0700, Namhyung Kim wrote:
> Hello,
> 
> On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> >
> > Currently, off-cpu samples are dumped when perf record is exiting. This
> > results in off-cpu samples being after the regular samples. Also, samples
> > are stored in large BPF maps which contain all the stack traces and
> > accumulated off-cpu time, but they are eventually going to fill up after
> > running for an extensive period. This patch fixes those problems by dumping
> > samples directly into perf ring buffer, and dispatching those samples to the
> > correct format.
> 
> Thanks for working on this.
> 
> But the problem of dumping all sched-switch events is that it can be
> too frequent on loaded machines.  Copying many events to the buffer
> can result in losing other records.  As perf report doesn't care about
> timing much, I decided to aggregate the result in a BPF map and dump
> them at the end of the profiling session.

Should we try to adapt when there are too many context switches, i.e.
the BPF program can notice that the interval from the last context
switch is too small and then avoid adding samples, while if the interval
is a long one then indeed this is a problem where the workload is
waiting for a long time for something and we want to know what is that,
and in that case capturing callchains is both desirable and not costly,
no?

The tool could then at the end produce one of two outputs: the most
common reasons for being off cpu, or some sort of counter stating that
there are way too many context switches?

And perhaps we should think about what is best to have as a default, not
to present just plain old cycles, but point out that the workload is
most of the time waiting for IO, etc, i.e. the default should give
interesting clues instead of expecting that the tool user knows all the
possible knobs and try them in all sorts of combinations to then reach
some conclusion.

The default should use stuff that isn't that costly, thus not getting in
the way of what is being observed, but at the same time look for common
patterns, etc.

- Arnaldo
 
> Maybe that's not a concern for you (or smaller systems).  Then I think
> we can keep the original behavior and add a new option (I'm not good
> at naming things, but maybe --off-cpu-sample?) to work differently
> instead of removing the old behavior.
> 
> Thanks,
> Namhyung
> 
> >
> > Before, off-cpu samples are after regular samples
> >
> > ```
> >          swapper       0 [000] 963432.136150:    2812933    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> >          swapper       0 [000] 963432.637911:    4932876    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> >          swapper       0 [001] 963432.798072:    6273398    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> >          swapper       0 [000] 963433.541152:    5279005    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > sh 1410180 [000] 18446744069.414584:    2528851 offcpu-time:
> >             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
> >
> >
> > sh 1410185 [000] 18446744069.414584:    2314223 offcpu-time:
> >             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
> >
> >
> > awk 1409644 [000] 18446744069.414584:     191785 offcpu-time:
> >             702609d03681 read+0x11 (/usr/lib/libc.so.6)
> >                   4a02a4 [unknown] ([unknown])
> > ```
> >
> >
> > After, regular samples(cycles:P) and off-cpu(offcpu-time) samples are
> > collected simultaneously:
> >
> > ```
> > upowerd     741 [000] 963757.428701:     297848 offcpu-time:
> >             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
> >
> >
> >       irq/9-acpi      56 [000] 963757.429116:    8760875    cycles:P:  ffffffffb779849f acpi_os_read_port+0x2f ([kernel.kallsyms])
> > upowerd     741 [000] 963757.429172:     459522 offcpu-time:
> >             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
> >
> >
> >          swapper       0 [002] 963757.434529:    5759904    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > perf 1419260 [000] 963757.434550: 1001012116 offcpu-time:
> >             7274e5d190bf __poll+0x4f (/usr/lib/libc.so.6)
> >             591acfc5daf0 perf_evlist__poll+0x24 (/root/hw/perf-tools-next/tools/perf/perf)
> >             591acfb1ca50 perf_evlist__poll_thread+0x160 (/root/hw/perf-tools-next/tools/perf/perf)
> >             7274e5ca955a [unknown] (/usr/lib/libc.so.6)
> > ```
> >
> > Here's a simple flowchart:
> >
> > [parse_event (sample type: PERF_SAMPLE_RAW)] --> [config (bind fds,
> > sample_id, sample_type)] --> [off_cpu_strip (sample type: PERF_SAMPLE_RAW)] -->
> > [record_done(hooks off_cpu_finish)] --> [prepare_parse(sample type: OFFCPU_SAMPLE_TYPES)]
> >
> > Changes in v2:
> >  - Remove unnecessary comments.
> >  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> >
> > Howard Chu (4):
> >   perf record off-cpu: Parse off-cpu event, change config location
> >   perf record off-cpu: BPF perf_event_output on sched_switch
> >   perf record off-cpu: extract off-cpu sample data from raw_data
> >   perf record off-cpu: delete bound-to-fail test
> >
> >  tools/perf/builtin-record.c             |  98 +++++++++-
> >  tools/perf/tests/shell/record_offcpu.sh |  29 ---
> >  tools/perf/util/bpf_off_cpu.c           | 242 +++++++++++-------------
> >  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 163 +++++++++++++---
> >  tools/perf/util/evsel.c                 |   8 -
> >  tools/perf/util/off_cpu.h               |  14 +-
> >  tools/perf/util/perf-hooks-list.h       |   1 +
> >  7 files changed, 344 insertions(+), 211 deletions(-)
> >
> > --
> > 2.44.0
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] Dump off-cpu samples directly
  2024-04-24 21:11   ` Arnaldo Carvalho de Melo
@ 2024-04-24 22:19     ` Ian Rogers
  2024-04-24 22:57       ` Namhyung Kim
  0 siblings, 1 reply; 9+ messages in thread
From: Ian Rogers @ 2024-04-24 22:19 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Namhyung Kim, Howard Chu, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, adrian.hunter, kan.liang, zegao2021,
	leo.yan, ravi.bangoria, linux-perf-users, linux-kernel, bpf

On Wed, Apr 24, 2024 at 2:11 PM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Wed, Apr 24, 2024 at 12:12:26PM -0700, Namhyung Kim wrote:
> > Hello,
> >
> > On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
> > >
> > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > >
> > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > results in off-cpu samples being after the regular samples. Also, samples
> > > are stored in large BPF maps which contain all the stack traces and
> > > accumulated off-cpu time, but they are eventually going to fill up after
> > > running for an extensive period. This patch fixes those problems by dumping
> > > samples directly into perf ring buffer, and dispatching those samples to the
> > > correct format.
> >
> > Thanks for working on this.
> >
> > But the problem of dumping all sched-switch events is that it can be
> > too frequent on loaded machines.  Copying many events to the buffer
> > can result in losing other records.  As perf report doesn't care about
> > timing much, I decided to aggregate the result in a BPF map and dump
> > them at the end of the profiling session.
>
> Should we try to adapt when there are too many context switches, i.e.
> the BPF program can notice that the interval from the last context
> switch is too small and then avoid adding samples, while if the interval
> is a long one then indeed this is a problem where the workload is
> waiting for a long time for something and we want to know what is that,
> and in that case capturing callchains is both desirable and not costly,
> no?
>
> The tool could then at the end produce one of two outputs: the most
> common reasons for being off cpu, or some sort of counter stating that
> there are way too many context switches?
>
> And perhaps we should think about what is best to have as a default, not
> to present just plain old cycles, but point out that the workload is
> most of the time waiting for IO, etc, i.e. the default should give
> interesting clues instead of expecting that the tool user knows all the
> possible knobs and try them in all sorts of combinations to then reach
> some conclusion.
>
> The default should use stuff that isn't that costly, thus not getting in
> the way of what is being observed, but at the same time look for common
> patterns, etc.
>
> - Arnaldo

I really appreciate Howard doing this work!

I wonder there are other cases where we want to synthesize events in
BPF, for example, we may have fast and slow memory on a system, we
could turn memory events on a system into either fast or slow ones in
BPF based on the memory accessed, so that fast/slow memory systems can
be simulated without access to hardware. This also feels like a perf
script type problem. Perhaps we can add something to the bpf-output
event so it can have multiple uses and not just off-cpu.

To turn the bpf-output samples into off-cpu events there is a pass
added to the saving. I wonder if that can be more generic, like a save
time perf inject.

I worry about dropping short samples we can create a property that
off-cpu time + on-cpu time != wall clock time. Perhaps such short
things can get pushed into Namhyung's "at the end" approach while
longer things get samples. Perhaps we only do that when the frequency
is too great.

It would be nice to start landing this work so I'm wondering what the
minimal way to do that is. It seems putting behavior behind a flag is
a first step.

Thanks,
Ian

> > Maybe that's not a concern for you (or smaller systems).  Then I think
> > we can keep the original behavior and add a new option (I'm not good
> > at naming things, but maybe --off-cpu-sample?) to work differently
> > instead of removing the old behavior.
> >
> > Thanks,
> > Namhyung
> >
> > >
> > > Before, off-cpu samples are after regular samples
> > >
> > > ```
> > >          swapper       0 [000] 963432.136150:    2812933    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > >          swapper       0 [000] 963432.637911:    4932876    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > >          swapper       0 [001] 963432.798072:    6273398    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > >          swapper       0 [000] 963433.541152:    5279005    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > > sh 1410180 [000] 18446744069.414584:    2528851 offcpu-time:
> > >             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
> > >
> > >
> > > sh 1410185 [000] 18446744069.414584:    2314223 offcpu-time:
> > >             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
> > >
> > >
> > > awk 1409644 [000] 18446744069.414584:     191785 offcpu-time:
> > >             702609d03681 read+0x11 (/usr/lib/libc.so.6)
> > >                   4a02a4 [unknown] ([unknown])
> > > ```
> > >
> > >
> > > After, regular samples(cycles:P) and off-cpu(offcpu-time) samples are
> > > collected simultaneously:
> > >
> > > ```
> > > upowerd     741 [000] 963757.428701:     297848 offcpu-time:
> > >             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
> > >
> > >
> > >       irq/9-acpi      56 [000] 963757.429116:    8760875    cycles:P:  ffffffffb779849f acpi_os_read_port+0x2f ([kernel.kallsyms])
> > > upowerd     741 [000] 963757.429172:     459522 offcpu-time:
> > >             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
> > >
> > >
> > >          swapper       0 [002] 963757.434529:    5759904    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > > perf 1419260 [000] 963757.434550: 1001012116 offcpu-time:
> > >             7274e5d190bf __poll+0x4f (/usr/lib/libc.so.6)
> > >             591acfc5daf0 perf_evlist__poll+0x24 (/root/hw/perf-tools-next/tools/perf/perf)
> > >             591acfb1ca50 perf_evlist__poll_thread+0x160 (/root/hw/perf-tools-next/tools/perf/perf)
> > >             7274e5ca955a [unknown] (/usr/lib/libc.so.6)
> > > ```
> > >
> > > Here's a simple flowchart:
> > >
> > > [parse_event (sample type: PERF_SAMPLE_RAW)] --> [config (bind fds,
> > > sample_id, sample_type)] --> [off_cpu_strip (sample type: PERF_SAMPLE_RAW)] -->
> > > [record_done(hooks off_cpu_finish)] --> [prepare_parse(sample type: OFFCPU_SAMPLE_TYPES)]
> > >
> > > Changes in v2:
> > >  - Remove unnecessary comments.
> > >  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> > >
> > > Howard Chu (4):
> > >   perf record off-cpu: Parse off-cpu event, change config location
> > >   perf record off-cpu: BPF perf_event_output on sched_switch
> > >   perf record off-cpu: extract off-cpu sample data from raw_data
> > >   perf record off-cpu: delete bound-to-fail test
> > >
> > >  tools/perf/builtin-record.c             |  98 +++++++++-
> > >  tools/perf/tests/shell/record_offcpu.sh |  29 ---
> > >  tools/perf/util/bpf_off_cpu.c           | 242 +++++++++++-------------
> > >  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 163 +++++++++++++---
> > >  tools/perf/util/evsel.c                 |   8 -
> > >  tools/perf/util/off_cpu.h               |  14 +-
> > >  tools/perf/util/perf-hooks-list.h       |   1 +
> > >  7 files changed, 344 insertions(+), 211 deletions(-)
> > >
> > > --
> > > 2.44.0
> > >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] Dump off-cpu samples directly
  2024-04-24 22:19     ` Ian Rogers
@ 2024-04-24 22:57       ` Namhyung Kim
  0 siblings, 0 replies; 9+ messages in thread
From: Namhyung Kim @ 2024-04-24 22:57 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Howard Chu, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, adrian.hunter, kan.liang, zegao2021,
	leo.yan, ravi.bangoria, linux-perf-users, linux-kernel, bpf

On Wed, Apr 24, 2024 at 3:19 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Apr 24, 2024 at 2:11 PM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > On Wed, Apr 24, 2024 at 12:12:26PM -0700, Namhyung Kim wrote:
> > > Hello,
> > >
> > > On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
> > > >
> > > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > > >
> > > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > > results in off-cpu samples being after the regular samples. Also, samples
> > > > are stored in large BPF maps which contain all the stack traces and
> > > > accumulated off-cpu time, but they are eventually going to fill up after
> > > > running for an extensive period. This patch fixes those problems by dumping
> > > > samples directly into perf ring buffer, and dispatching those samples to the
> > > > correct format.
> > >
> > > Thanks for working on this.
> > >
> > > But the problem of dumping all sched-switch events is that it can be
> > > too frequent on loaded machines.  Copying many events to the buffer
> > > can result in losing other records.  As perf report doesn't care about
> > > timing much, I decided to aggregate the result in a BPF map and dump
> > > them at the end of the profiling session.
> >
> > Should we try to adapt when there are too many context switches, i.e.
> > the BPF program can notice that the interval from the last context
> > switch is too small and then avoid adding samples, while if the interval
> > is a long one then indeed this is a problem where the workload is
> > waiting for a long time for something and we want to know what is that,
> > and in that case capturing callchains is both desirable and not costly,
> > no?

Sounds interesting.  Yeah we could make it adaptive based on the
off-cpu time at the moment.

> >
> > The tool could then at the end produce one of two outputs: the most
> > common reasons for being off cpu, or some sort of counter stating that
> > there are way too many context switches?
> >
> > And perhaps we should think about what is best to have as a default, not
> > to present just plain old cycles, but point out that the workload is
> > most of the time waiting for IO, etc, i.e. the default should give
> > interesting clues instead of expecting that the tool user knows all the
> > possible knobs and try them in all sorts of combinations to then reach
> > some conclusion.
> >
> > The default should use stuff that isn't that costly, thus not getting in
> > the way of what is being observed, but at the same time look for common
> > patterns, etc.
> >
> > - Arnaldo
>
> I really appreciate Howard doing this work!
>
> I wonder there are other cases where we want to synthesize events in
> BPF, for example, we may have fast and slow memory on a system, we
> could turn memory events on a system into either fast or slow ones in
> BPF based on the memory accessed, so that fast/slow memory systems can
> be simulated without access to hardware. This also feels like a perf
> script type problem. Perhaps we can add something to the bpf-output
> event so it can have multiple uses and not just off-cpu.
>
> To turn the bpf-output samples into off-cpu events there is a pass
> added to the saving. I wonder if that can be more generic, like a save
> time perf inject.
>
> I worry about dropping short samples we can create a property that
> off-cpu time + on-cpu time != wall clock time. Perhaps such short
> things can get pushed into Namhyung's "at the end" approach while
> longer things get samples. Perhaps we only do that when the frequency
> is too great.

Sounds good.  We might add an option to specify the threshold to
determine whether to dump the data or to save it for later.  But ideally
it should be able to find a good default.

>
> It would be nice to start landing this work so I'm wondering what the
> minimal way to do that is. It seems putting behavior behind a flag is
> a first step.

Agreed!

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-04-24 22:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-24  2:48 [PATCH v2 0/4] Dump off-cpu samples directly Howard Chu
2024-04-24  2:48 ` [PATCH v2 1/4] perf record off-cpu: Parse off-cpu event, change config location Howard Chu
2024-04-24  2:48 ` [PATCH v2 2/4] perf record off-cpu: BPF perf_event_output on sched_switch Howard Chu
2024-04-24  2:48 ` [PATCH v2 3/4] perf record off-cpu: extract off-cpu sample data from raw_data Howard Chu
2024-04-24  2:48 ` [PATCH v2 4/4] perf record off-cpu: delete bound-to-fail test Howard Chu
2024-04-24 19:12 ` [PATCH v2 0/4] Dump off-cpu samples directly Namhyung Kim
2024-04-24 21:11   ` Arnaldo Carvalho de Melo
2024-04-24 22:19     ` Ian Rogers
2024-04-24 22:57       ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).