Hello Peter, On 4/5/2024 3:58 PM, Peter Zijlstra wrote: > Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by > noting that lag is fundamentally a temporal measure. It should not be > carried around indefinitely. > > OTOH it should also not be instantly discarded, doing so will allow a > task to game the system by purposefully (micro) sleeping at the end of > its time quantum. > > Since lag is intimately tied to the virtual time base, a wall-time > based decay is also insufficient, notably competition is required for > any of this to make sense. > > Instead, delay the dequeue and keep the 'tasks' on the runqueue, > competing until they are eligible. > > Strictly speaking, we only care about keeping them until the 0-lag > point, but that is a difficult proposition, instead carry them around > until they get picked again, and dequeue them at that point. > > Since we should have dequeued them at the 0-lag point, truncate lag > (eg. don't let them earn positive lag). > > XXX test the cfs-throttle stuff I ran into a few issues when testing the series on top of tip:sched/core at commit 4475cd8bfd9b ("sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags"). All of these splats surfaced when running unixbench with Delayed Dequeue (echoing NO_DELAY_DEQUEUE to /sys/kernel/debug/sched/features seems to make the system stable when running Unixbench spawn) Unixbench (https://github.com/kdlucas/byte-unixbench.git) command: ./Run spawn -c 512 Splats appear soon into the run. Following are the splats and their corresponding code blocks from my 3rd Generation EPYC system (2 x 64C/128T): 1. NULL pointer dereferencing in can_migrate_task(): BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 154 PID: 1507736 Comm: spawn Not tainted 6.9.0-rc1-test+ #958 Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022 RIP: 0010:can_migrate_task+0x2b/0x6c0 Code: ... RSP: 0018:ffffb6bb9e6a3bc0 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffffb6bb9e6a3c80 RCX: ffff90ad0d209400 RDX: 0000000000000008 RSI: ffffb6bb9e6a3c80 RDI: ffff90eb3b236438 RBP: ffff90eb3b236438 R08: 0000005c743512ab R09: ffffffffffff0000 R10: 0000000000000001 R11: 0000000000000100 R12: ffff90eb3b236438 R13: ffff90eb3b2364f0 R14: ffff90eb3b6359c0 R15: ffff90eb3b6359c0 FS: 0000000000000000(0000) GS:ffff90eb3df00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 000000807da3c006 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: ? __die+0x24/0x70 ? page_fault_oops+0x14a/0x510 ? exc_page_fault+0x77/0x170 ? asm_exc_page_fault+0x26/0x30 ? can_migrate_task+0x2b/0x6c0 sched_balance_rq+0x7a8/0x1190 sched_balance_newidle+0x1e2/0x490 pick_next_task_fair+0x36/0x4a0 __schedule+0x1c0/0x1710 ? srso_alias_return_thunk+0x5/0xfbef5 ? refill_stock+0x1a/0x30 ? srso_alias_return_thunk+0x5/0xfbef5 ? obj_cgroup_uncharge_pages+0x4d/0xd0 do_task_dead+0x42/0x50 do_exit+0x777/0xad0 do_group_exit+0x30/0x80 __x64_sys_exit_group+0x18/0x20 do_syscall_64+0x79/0x120 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? irqentry_exit_to_user_mode+0x5b/0x170 entry_SYSCALL_64_after_hwframe+0x6c/0x74 RIP: 0033:0x7f963a6eac31 Code: Unable to access opcode bytes at 0x7f963a6eac07. RSP: 002b:00007ffc6b7158c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f963a816a00 RCX: 00007f963a6eac31 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000020 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f963a816a00 R13: 0000000000000000 R14: 00007f963a81bee8 R15: 00007f963a81bf00 Modules linked in: ... CR2: 0000000000000040 ---[ end trace 0000000000000000 ]--- $ scripts/faddr2line vmlinux can_migrate_task+0x2b/0x6c0 can_migrate_task+0x2b/0x6c0: throttled_lb_pair at kernel/sched/fair.c:5738 (inlined by) can_migrate_task at kernel/sched/fair.c:9090 Corresponds to: static inline int throttled_lb_pair(struct task_group *tg, int src_cpu, int dest_cpu) { struct cfs_rq *src_cfs_rq, *dest_cfs_rq; src_cfs_rq = tg->cfs_rq[src_cpu]; /* <----- Here -----< */ dest_cfs_rq = tg->cfs_rq[dest_cpu]; return throttled_hierarchy(src_cfs_rq) || throttled_hierarchy(dest_cfs_rq); } (inlined by) int can_migrate_task(struct task_struct *p, struct lb_env *env) { /* Called here */ if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0; ... } 2. A NULL pointer dereferencing in pick_next_task_fair(): BUG: kernel NULL pointer dereference, address: 0000000000000098 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 107 PID: 1206665 Comm: spawn Tainted: G W 6.9.0-rc1-test+ #958 Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022 RIP: 0010:pick_next_task_fair+0x327/0x4a0 Code: ... RSP: 0018:ffffb613c212fd28 EFLAGS: 00010002 RAX: 0000004ed2799383 RBX: 0000000000000000 RCX: ffff8f65baf3f800 RDX: ffff8f65baf3ca00 RSI: 0000000000000000 RDI: 000000825ae302ab RBP: ffff8f64b13b59c0 R08: 0000000000000015 R09: 0000000000000314 R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f261ac199c0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f768b79e740(0000) GS:ffff8f64b1380000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000098 CR3: 00000040d1010006 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: ? __die+0x24/0x70 ? page_fault_oops+0x14a/0x510 ? srso_alias_return_thunk+0x5/0xfbef5 ? report_bug+0x18e/0x1a0 ? srso_alias_return_thunk+0x5/0xfbef5 ? exc_page_fault+0x77/0x170 ? asm_exc_page_fault+0x26/0x30 ? pick_next_task_fair+0x327/0x4a0 ? pick_next_task_fair+0x320/0x4a0 __schedule+0x1c0/0x1710 ? release_task+0x2fc/0x4c0 ? srso_alias_return_thunk+0x5/0xfbef5 schedule+0x30/0x120 syscall_exit_to_user_mode+0x98/0x1b0 do_syscall_64+0x85/0x120 ? srso_alias_return_thunk+0x5/0xfbef5 ? __count_memcg_events+0x69/0x100 ? srso_alias_return_thunk+0x5/0xfbef5 ? count_memcg_events.constprop.0+0x1a/0x30 ? srso_alias_return_thunk+0x5/0xfbef5 ? handle_mm_fault+0x17d/0x2e0 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_user_addr_fault+0x33d/0x6f0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? irqentry_exit_to_user_mode+0x5b/0x170 entry_SYSCALL_64_after_hwframe+0x6c/0x74 RIP: 0033:0x7f768b4eab57 Code: ... RSP: 002b:00007fff5f6e2018 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 RAX: 000000000026d13b RBX: 00007f768b7ee040 RCX: 00007f768b4eab57 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f768b79ea10 R11: 0000000000000246 R12: 0000000000000001 R13: 000055b811d1a140 R14: 000055b811d1cd88 R15: 00007f768b7ee040 Modules linked in: ... CR2: 0000000000000098 ---[ end trace 0000000000000000 ]--- $ scripts/faddr2line vmlinux pick_next_task_fair+0x327/0x4a0 pick_next_task_fair+0x327/0x4a0: is_same_group at kernel/sched/fair.c:418 (inlined by) pick_next_task_fair at kernel/sched/fair.c:8625 struct cfs_rq * is_same_group(struct sched_entity *se, struct sched_entity *pse) { if (se->cfs_rq == pse->cfs_rq) /* <----- HERE -----< */ return se->cfs_rq; return NULL; } (inlined by) struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { ... if (prev != p) { ... while (!(cfs_rq = is_same_group(se, pse) /* <---- HERE ----< */)) { ... } ... } ... } 3. A NULL Pointer dereferencing in __dequeue_entity(): BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 95 PID: 60896 Comm: spawn Not tainted 6.9.0-rc1-test+ #958 Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022 RIP: 0010:__rb_erase_color+0x88/0x260 Code: ... RSP: 0018:ffffab158755fc08 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffff841314b0 RCX: 0000000017fc8dd0 RDX: 0000000000000000 RSI: ffff8decfb1fe450 RDI: ffff8decf80bcdd0 RBP: ffff8decf80bcdd0 R08: ffff8decf80bcdd0 R09: ffffffffffffbb60 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffff8decfb1fe450 R14: ffff8ded0ec03400 R15: ffff8decfb1fe400 FS: 00007f1ded0a2740(0000) GS:ffff8e2bb0d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000040e66f0005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: ? __die+0x24/0x70 ? page_fault_oops+0x14a/0x510 ? exc_page_fault+0x77/0x170 ? asm_exc_page_fault+0x26/0x30 ? __pfx_min_vruntime_cb_rotate+0x10/0x10 ? __rb_erase_color+0x88/0x260 __dequeue_entity+0x1b7/0x310 set_next_entity+0xc0/0x1e0 pick_next_task_fair+0x355/0x4a0 __schedule+0x1c0/0x1710 ? native_queued_spin_lock_slowpath+0x2a4/0x2f0 schedule+0x30/0x120 do_wait+0xad/0x100 kernel_wait4+0xa9/0x150 ? __pfx_child_wait_callback+0x10/0x10 do_syscall_64+0x79/0x120 ? srso_alias_return_thunk+0x5/0xfbef5 ? __count_memcg_events+0x69/0x100 ? srso_alias_return_thunk+0x5/0xfbef5 ? count_memcg_events.constprop.0+0x1a/0x30 ? srso_alias_return_thunk+0x5/0xfbef5 ? handle_mm_fault+0x17d/0x2e0 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_user_addr_fault+0x33d/0x6f0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? irqentry_exit_to_user_mode+0x5b/0x170 entry_SYSCALL_64_after_hwframe+0x6c/0x74 RIP: 0033:0x7f1deceea3ea Code: ... RSP: 002b:00007ffd7fd37ca8 EFLAGS: 00000246 ORIG_RAX: 000000000000003d RAX: ffffffffffffffda RBX: 00007ffd7fd37cb4 RCX: 00007f1deceea3ea RDX: 0000000000000000 RSI: 00007ffd7fd37cb4 RDI: 00000000ffffffff RBP: 0000000000000002 R08: 00000000000136f5 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd7fd37dd8 R13: 000055fd8debf140 R14: 000055fd8dec1d88 R15: 00007f1ded0f2040 Modules linked in: ... CR2: 0000000000000000 ---[ end trace 0000000000000000 ]--- Note: I only ran into this issue with unixbench spawn. A bunch of other benchmarks (hackbench, stream, tbench, netperf, schbench, other variants of unixbench) ran fine without bringing down the machine. Attaching my config below in case this in config specific. > > Signed-off-by: Peter Zijlstra (Intel) > --- > include/linux/sched.h | 1 > kernel/sched/core.c | 22 +++++-- > kernel/sched/fair.c | 148 +++++++++++++++++++++++++++++++++++++++++++----- > kernel/sched/features.h | 12 +++ > kernel/sched/sched.h | 2 > 5 files changed, 167 insertions(+), 18 deletions(-) > > [..snip..] > -- Thanks and Regards, Prateek