[PATCH v3] perf/core: Fix missing wakeup when waiting for context reference

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] perf/core: Fix missing wakeup when waiting for context reference
@ 2024-04-18 11:42 Haifeng Xu
  2024-04-18 15:46 ` Frederic Weisbecker
  2024-04-23 10:05 ` Mark Rutland
  0 siblings, 2 replies; 4+ messages in thread
From: Haifeng Xu @ 2024-04-18 11:42 UTC (permalink / raw
  To: peterz, mingo, frederic
  Cc: acme, mark.rutland, alexander.shishkin, jolsa, namhyung, irogers,
	adrian.hunter, linux-perf-users, linux-kernel, Haifeng Xu

In our production environment, we found many hung tasks which are
blocked for more than 18 hours. Their call traces are like this:

[346278.191038] __schedule+0x2d8/0x890
[346278.191046] schedule+0x4e/0xb0
[346278.191049] perf_event_free_task+0x220/0x270
[346278.191056] ? init_wait_var_entry+0x50/0x50
[346278.191060] copy_process+0x663/0x18d0
[346278.191068] kernel_clone+0x9d/0x3d0
[346278.191072] __do_sys_clone+0x5d/0x80
[346278.191076] __x64_sys_clone+0x25/0x30
[346278.191079] do_syscall_64+0x5c/0xc0
[346278.191083] ? syscall_exit_to_user_mode+0x27/0x50
[346278.191086] ? do_syscall_64+0x69/0xc0
[346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20
[346278.191092] ? irqentry_exit+0x19/0x30
[346278.191095] ? exc_page_fault+0x89/0x160
[346278.191097] ? asm_exc_page_fault+0x8/0x30
[346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae

The task was waiting for the refcount become to 1, but from the vmcore,
we found the refcount has already been 1. It seems that the task didn't
get woken up by perf_event_release_kernel() and got stuck forever. The
below scenario may cause the problem.

Thread A					Thread B
...						...
perf_event_free_task				perf_event_release_kernel
						   ...
						   acquire event->child_mutex
						   ...
						   get_ctx
   ...						   release event->child_mutex
   acquire ctx->mutex
   ...
   perf_free_event (acquire/release event->child_mutex)
   ...
   release ctx->mutex
   wait_var_event
						   acquire ctx->mutex
						   acquire event->child_mutex
						   # move existing events to free_list
						   release event->child_mutex
						   release ctx->mutex
						   put_ctx
...						...

In this case, all events of the ctx have been freed, so we couldn't
find the ctx in free_list and Thread A will miss the wakeup. It's thus
necessary to add a wakeup after dropping the reference.

Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()")
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
Changes since v1:
- Add the fixed tag.
- Simplify v1's patch. (Frederic)

Changes since v2:
- Use Reviewed-by tag instead of Signed-off-by tag.
---
 kernel/events/core.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4f0c45ab8d7d..15c35070db6a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5340,6 +5340,7 @@ int perf_event_release_kernel(struct perf_event *event)
 again:
 	mutex_lock(&event->child_mutex);
 	list_for_each_entry(child, &event->child_list, child_list) {
+		void *var = NULL;
 
 		/*
 		 * Cannot change, child events are not migrated, see the
@@ -5380,11 +5381,23 @@ int perf_event_release_kernel(struct perf_event *event)
 			 * this can't be the last reference.
 			 */
 			put_event(event);
+		} else {
+			var = &ctx->refcount;
 		}
 
 		mutex_unlock(&event->child_mutex);
 		mutex_unlock(&ctx->mutex);
 		put_ctx(ctx);
+
+		if (var) {
+			/*
+			 * If perf_event_free_task() has deleted all events from the
+			 * ctx while the child_mutex got released above, make sure to
+			 * notify about the preceding put_ctx().
+			 */
+			smp_mb(); /* pairs with wait_var_event() */
+			wake_up_var(var);
+		}
 		goto again;
 	}
 	mutex_unlock(&event->child_mutex);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] perf/core: Fix missing wakeup when waiting for context reference
  2024-04-18 11:42 [PATCH v3] perf/core: Fix missing wakeup when waiting for context reference Haifeng Xu
@ 2024-04-18 15:46 ` Frederic Weisbecker
  2024-04-23 10:05 ` Mark Rutland
  1 sibling, 0 replies; 4+ messages in thread
From: Frederic Weisbecker @ 2024-04-18 15:46 UTC (permalink / raw
  To: Haifeng Xu, mingo
  Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel

Le Thu, Apr 18, 2024 at 11:42:09AM +0000, Haifeng Xu a écrit :
> In our production environment, we found many hung tasks which are
> blocked for more than 18 hours. Their call traces are like this:
> 
> [346278.191038] __schedule+0x2d8/0x890
> [346278.191046] schedule+0x4e/0xb0
> [346278.191049] perf_event_free_task+0x220/0x270
> [346278.191056] ? init_wait_var_entry+0x50/0x50
> [346278.191060] copy_process+0x663/0x18d0
> [346278.191068] kernel_clone+0x9d/0x3d0
> [346278.191072] __do_sys_clone+0x5d/0x80
> [346278.191076] __x64_sys_clone+0x25/0x30
> [346278.191079] do_syscall_64+0x5c/0xc0
> [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50
> [346278.191086] ? do_syscall_64+0x69/0xc0
> [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20
> [346278.191092] ? irqentry_exit+0x19/0x30
> [346278.191095] ? exc_page_fault+0x89/0x160
> [346278.191097] ? asm_exc_page_fault+0x8/0x30
> [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The task was waiting for the refcount become to 1, but from the vmcore,
> we found the refcount has already been 1. It seems that the task didn't
> get woken up by perf_event_release_kernel() and got stuck forever. The
> below scenario may cause the problem.
> 
> Thread A					Thread B
> ...						...
> perf_event_free_task				perf_event_release_kernel
> 						   ...
> 						   acquire event->child_mutex
> 						   ...
> 						   get_ctx
>    ...						   release event->child_mutex
>    acquire ctx->mutex
>    ...
>    perf_free_event (acquire/release event->child_mutex)
>    ...
>    release ctx->mutex
>    wait_var_event
> 						   acquire ctx->mutex
> 						   acquire event->child_mutex
> 						   # move existing events to free_list
> 						   release event->child_mutex
> 						   release ctx->mutex
> 						   put_ctx
> ...						...
> 
> In this case, all events of the ctx have been freed, so we couldn't
> find the ctx in free_list and Thread A will miss the wakeup. It's thus
> necessary to add a wakeup after dropping the reference.
> 
> Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()")
> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] perf/core: Fix missing wakeup when waiting for context reference
  2024-04-18 11:42 [PATCH v3] perf/core: Fix missing wakeup when waiting for context reference Haifeng Xu
  2024-04-18 15:46 ` Frederic Weisbecker
@ 2024-04-23 10:05 ` Mark Rutland
  2024-05-13  9:08   ` Haifeng Xu
  1 sibling, 1 reply; 4+ messages in thread
From: Mark Rutland @ 2024-04-23 10:05 UTC (permalink / raw
  To: Haifeng Xu
  Cc: peterz, mingo, frederic, acme, alexander.shishkin, jolsa,
	namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel

On Thu, Apr 18, 2024 at 11:42:09AM +0000, Haifeng Xu wrote:
> In our production environment, we found many hung tasks which are
> blocked for more than 18 hours. Their call traces are like this:
> 
> [346278.191038] __schedule+0x2d8/0x890
> [346278.191046] schedule+0x4e/0xb0
> [346278.191049] perf_event_free_task+0x220/0x270
> [346278.191056] ? init_wait_var_entry+0x50/0x50
> [346278.191060] copy_process+0x663/0x18d0
> [346278.191068] kernel_clone+0x9d/0x3d0
> [346278.191072] __do_sys_clone+0x5d/0x80
> [346278.191076] __x64_sys_clone+0x25/0x30
> [346278.191079] do_syscall_64+0x5c/0xc0
> [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50
> [346278.191086] ? do_syscall_64+0x69/0xc0
> [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20
> [346278.191092] ? irqentry_exit+0x19/0x30
> [346278.191095] ? exc_page_fault+0x89/0x160
> [346278.191097] ? asm_exc_page_fault+0x8/0x30
> [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The task was waiting for the refcount become to 1, but from the vmcore,
> we found the refcount has already been 1. It seems that the task didn't
> get woken up by perf_event_release_kernel() and got stuck forever. The
> below scenario may cause the problem.
> 
> Thread A					Thread B
> ...						...
> perf_event_free_task				perf_event_release_kernel
> 						   ...
> 						   acquire event->child_mutex
> 						   ...
> 						   get_ctx
>    ...						   release event->child_mutex
>    acquire ctx->mutex
>    ...
>    perf_free_event (acquire/release event->child_mutex)
>    ...
>    release ctx->mutex
>    wait_var_event
> 						   acquire ctx->mutex
> 						   acquire event->child_mutex
> 						   # move existing events to free_list
> 						   release event->child_mutex
> 						   release ctx->mutex
> 						   put_ctx
> ...						...
> 
> In this case, all events of the ctx have been freed, so we couldn't
> find the ctx in free_list and Thread A will miss the wakeup. It's thus
> necessary to add a wakeup after dropping the reference.
> 
> Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()")
> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

FWIW, this looks good to me, but I haven't yet been able to write a test to
exercise this: perf_event_free_task() is only called if
perf_event_init_context() fails or of copy_process() fails partway through, and
while it should be possible to make the latter fail consistently by messing
with cgroups, I haven't had the time to work all that out.

So I think there's a reliable DoS here, but I haven't had the time to go write
that myself. It would be nice if we actually had a test for this.

I reckon that in addition to the Fixes tag, this needs:

Cc: stable@vger.kernel.org

> ---
> Changes since v1:
> - Add the fixed tag.
> - Simplify v1's patch. (Frederic)
> 
> Changes since v2:
> - Use Reviewed-by tag instead of Signed-off-by tag.
> ---
>  kernel/events/core.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 4f0c45ab8d7d..15c35070db6a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5340,6 +5340,7 @@ int perf_event_release_kernel(struct perf_event *event)
>  again:
>  	mutex_lock(&event->child_mutex);
>  	list_for_each_entry(child, &event->child_list, child_list) {
> +		void *var = NULL;
>  
>  		/*
>  		 * Cannot change, child events are not migrated, see the
> @@ -5380,11 +5381,23 @@ int perf_event_release_kernel(struct perf_event *event)
>  			 * this can't be the last reference.
>  			 */
>  			put_event(event);
> +		} else {
> +			var = &ctx->refcount;
>  		}
>  
>  		mutex_unlock(&event->child_mutex);
>  		mutex_unlock(&ctx->mutex);
>  		put_ctx(ctx);
> +
> +		if (var) {
> +			/*
> +			 * If perf_event_free_task() has deleted all events from the
> +			 * ctx while the child_mutex got released above, make sure to
> +			 * notify about the preceding put_ctx().
> +			 */
> +			smp_mb(); /* pairs with wait_var_event() */
> +			wake_up_var(var);
> +		}
>  		goto again;
>  	}
>  	mutex_unlock(&event->child_mutex);

I was a bit worrited that we're doing the wakeup with the event->child_mutex
held; AFAICT that looks to be safe, but I'm not a scheduler expert.

FWIW:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] perf/core: Fix missing wakeup when waiting for context reference
  2024-04-23 10:05 ` Mark Rutland
@ 2024-05-13  9:08   ` Haifeng Xu
  0 siblings, 0 replies; 4+ messages in thread
From: Haifeng Xu @ 2024-05-13  9:08 UTC (permalink / raw
  To: Mark Rutland
  Cc: peterz, mingo, frederic, acme, alexander.shishkin, jolsa,
	namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel



On 2024/4/23 18:05, Mark Rutland wrote:
> On Thu, Apr 18, 2024 at 11:42:09AM +0000, Haifeng Xu wrote:
>> In our production environment, we found many hung tasks which are
>> blocked for more than 18 hours. Their call traces are like this:
>>
>> [346278.191038] __schedule+0x2d8/0x890
>> [346278.191046] schedule+0x4e/0xb0
>> [346278.191049] perf_event_free_task+0x220/0x270
>> [346278.191056] ? init_wait_var_entry+0x50/0x50
>> [346278.191060] copy_process+0x663/0x18d0
>> [346278.191068] kernel_clone+0x9d/0x3d0
>> [346278.191072] __do_sys_clone+0x5d/0x80
>> [346278.191076] __x64_sys_clone+0x25/0x30
>> [346278.191079] do_syscall_64+0x5c/0xc0
>> [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50
>> [346278.191086] ? do_syscall_64+0x69/0xc0
>> [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20
>> [346278.191092] ? irqentry_exit+0x19/0x30
>> [346278.191095] ? exc_page_fault+0x89/0x160
>> [346278.191097] ? asm_exc_page_fault+0x8/0x30
>> [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>
>> The task was waiting for the refcount become to 1, but from the vmcore,
>> we found the refcount has already been 1. It seems that the task didn't
>> get woken up by perf_event_release_kernel() and got stuck forever. The
>> below scenario may cause the problem.
>>
>> Thread A					Thread B
>> ...						...
>> perf_event_free_task				perf_event_release_kernel
>> 						   ...
>> 						   acquire event->child_mutex
>> 						   ...
>> 						   get_ctx
>>    ...						   release event->child_mutex
>>    acquire ctx->mutex
>>    ...
>>    perf_free_event (acquire/release event->child_mutex)
>>    ...
>>    release ctx->mutex
>>    wait_var_event
>> 						   acquire ctx->mutex
>> 						   acquire event->child_mutex
>> 						   # move existing events to free_list
>> 						   release event->child_mutex
>> 						   release ctx->mutex
>> 						   put_ctx
>> ...						...
>>
>> In this case, all events of the ctx have been freed, so we couldn't
>> find the ctx in free_list and Thread A will miss the wakeup. It's thus
>> necessary to add a wakeup after dropping the reference.
>>
>> Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()")
>> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> 
> FWIW, this looks good to me, but I haven't yet been able to write a test to
> exercise this: perf_event_free_task() is only called if
> perf_event_init_context() fails or of copy_process() fails partway through, and
> while it should be possible to make the latter fail consistently by messing
> with cgroups, I haven't had the time to work all that out.
> 

Hi, Mark.

This problem seems similar to this bug reported by syzbot.
https://lore.kernel.org/all/00000000000057102e058e722bba@google.com/T/#mbb1d50748ff3190738a9754bdff118e640fbb3a3

> So I think there's a reliable DoS here, but I haven't had the time to go write
> that myself. It would be nice if we actually had a test for this.
> 
> I reckon that in addition to the Fixes tag, this needs:
> 
> Cc: stable@vger.kernel.org
> 

Ok, I'll add this tag next version.

>> ---
>> Changes since v1:
>> - Add the fixed tag.
>> - Simplify v1's patch. (Frederic)
>>
>> Changes since v2:
>> - Use Reviewed-by tag instead of Signed-off-by tag.
>> ---
>>  kernel/events/core.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 4f0c45ab8d7d..15c35070db6a 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -5340,6 +5340,7 @@ int perf_event_release_kernel(struct perf_event *event)
>>  again:
>>  	mutex_lock(&event->child_mutex);
>>  	list_for_each_entry(child, &event->child_list, child_list) {
>> +		void *var = NULL;
>>  
>>  		/*
>>  		 * Cannot change, child events are not migrated, see the
>> @@ -5380,11 +5381,23 @@ int perf_event_release_kernel(struct perf_event *event)
>>  			 * this can't be the last reference.
>>  			 */
>>  			put_event(event);
>> +		} else {
>> +			var = &ctx->refcount;
>>  		}
>>  
>>  		mutex_unlock(&event->child_mutex);
>>  		mutex_unlock(&ctx->mutex);
>>  		put_ctx(ctx);
>> +
>> +		if (var) {
>> +			/*
>> +			 * If perf_event_free_task() has deleted all events from the
>> +			 * ctx while the child_mutex got released above, make sure to
>> +			 * notify about the preceding put_ctx().
>> +			 */
>> +			smp_mb(); /* pairs with wait_var_event() */
>> +			wake_up_var(var);
>> +		}
>>  		goto again;
>>  	}
>>  	mutex_unlock(&event->child_mutex);
> 
> I was a bit worrited that we're doing the wakeup with the event->child_mutex
> held; 

Actually the event->child_mutex has been released before doing the wakeup.


AFAICT that looks to be safe, but I'm not a scheduler expert.
> 
> FWIW:
> 
> Acked-by: Mark Rutland <mark.rutland@arm.com>
> 
> Mark.

Thanks!

> 
>> -- 
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-13  9:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-18 11:42 [PATCH v3] perf/core: Fix missing wakeup when waiting for context reference Haifeng Xu
2024-04-18 15:46 ` Frederic Weisbecker
2024-04-23 10:05 ` Mark Rutland
2024-05-13  9:08   ` Haifeng Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).