Crash in crc32_le during kmemleak

LKML Archive mirror
 help / color / mirror / Atom feed

* Crash in crc32_le during kmemleak_scan()
@ 2015-05-13 13:15 vigneshr
  2015-05-14 11:21 ` Catalin Marinas
  2015-05-20 13:16 ` vigneshr
  0 siblings, 2 replies; 9+ messages in thread
From: vigneshr @ 2015-05-13 13:15 UTC (permalink / raw
  To: linux-kernel; +Cc: catalin.marinas, bernd.schubert

We are seeing a panic in crc32_le after kmemleak_scan(). I have pasted the
snippet of the crash log below. This is seen on 3.10 Kernel.

This issue was earlier discussed over this thread as well -
https://lkml.org/lkml/2014/2/6/287.


<4>[   70.732606] kmemleak: Cannot allocate a kmemleak_object structure
<6>[   70.739803] kmemleak: Kernel memory leak detector disabled
<6>[   70.740831] lowmemorykiller: Killing 'id.gms.wearable' (7004), adj 647,
<6>[   70.740831]    to free 29144kB on behalf of 'kswapd0' (95) because
<6>[   70.740831]    cache 216560kB is below limit 225000kB for
oom_score_adj 529
<6>[   70.740831]    Free memory is 5788kB above reserved.
<6>[   70.740831]    Free CMA is 107248kB
<6>[   70.740831]    Total reserve is 17856kB
<6>[   70.740831]    Total free pages is 113036kB
<6>[   70.740831]    Total file cache is 217632kB
<6>[   70.740831]    Slab Reclaimable is 17104kB
<6>[   70.740831]    Slab UnReclaimable is 117508kB
<6>[   70.740831]    Total Slab is 134612kB
<6>[   70.740831]    GFP mask is 0xd0
<6>[   70.768877] lowmemorykiller: Killing 'le.android.talk' (7955), adj 529,
<6>[   70.768877]    to free 39124kB on behalf of 'kswapd0' (95) because
<6>[   70.768877]    cache 211088kB is below limit 225000kB for
oom_score_adj 529
<6>[   70.768877]    Free memory is 4716kB above reserved.
<6>[   70.768877]    Free CMA is 108048kB
<6>[   70.768877]    Total reserve is 17856kB
<6>[   70.768877]    Total free pages is 112764kB
<6>[   70.768877]    Total file cache is 212304kB
<6>[   70.768877]    Slab Reclaimable is 17104kB
<6>[   70.768877]    Slab UnReclaimable is 117408kB
<6>[   70.768877]    Total Slab is 134512kB
<6>[   70.768877]    GFP mask is 0xd0
<1>[   70.935886] Unable to handle kernel paging request at virtual
address ffffff8001119000
<1>[   70.942768] pgd = ffffffc00007d000
<1>[   70.946150] [ffffff8001119000] *pgd=000000008c307003,
*pmd=00000000b2b96003, *pte=0000000000000000
<0>[   70.955094] Internal error: Oops: 96000007 [#1] PREEMPT SMP
<6>[   70.960648] Modules linked in: wlan(O)
<6>[   70.964383] CPU: 2 PID: 198 Comm: kmemleak Tainted: G           O
3.10.49-gcb3457b-00006-g72e5422 #1
<6>[   70.973497] task: ffffffc030b24b40 ti: ffffffc03019c000 task.ti:
ffffffc03019c000
<6>[   70.980966] PC is at crc32_le+0x78/0x140
<6>[   70.984873] LR is at kmemleak_scan+0x2d0/0x3c8
<6>[   70.989296] pc : [<ffffffc00044ea30>] lr : [<ffffffc0002f088c>]
pstate: 800001c5
<6>[   70.996671] sp : ffffffc03019fd60
<6>[   70.999972] x29: ffffffc03019fd70 x28: 0000000000000000
<6>[   71.005267] x27: 0000000000000000 x26: 0000000000000000
<6>[   71.010557] x25: ffffffc00168d6e8 x24: 0000000000000000
<6>[   71.028876] x23: 0000000000000140 x22: ffffffc0014e23d8
<6>[   71.034168] x21: ffffffc0014e23a0 x20: ffffffc0014e2338
<6>[   71.039464] x19: ffffffc002a02640 x18: 000000000b220dc1
<6>[   71.044757] x17: 0000000000000400 x16: 0000000000000800
<6>[   71.050053] x15: 0000000000000c00 x14: 0000000000001400
<6>[   71.064462] x13: 0000000000001800 x12: 0000000000001c00
<6>[   71.069756] x11: 0000000000001000 x10: ffffffc0014c2000
<6>[   71.075052] x9 : 0000000000000000 x8 : 0000000000000000
<6>[   71.080347] x7 : ffffff8001118ffc x6 : 00000000000012aa
<6>[   71.085642] x5 : 0000000000000000 x4 : ffffffc0014e23d8
<6>[   71.100052] x3 : 0000000000000000 x2 : 00000000000012aa
<6>[   71.105348] x1 : ffffff8001118ffc x0 : 0000000000000000
<6>[   71.115590] e930  97fffff053107e60 a94153f333103e80 d65f03c0a8c27bfd
d280000453003c00 eb02009f90003f65 3864682354000120 91000484913ea0a6
12001c634a000063
<6>[   71.138421] e970  4a4020607863d8c3 d65f03c017fffff7 d280000453003c00
eb02009fb0003f65 3864682354000120 910004849106a0a6 12001c634a000063
4a4020607863d8c3
<snip> <snip>
<6>[   71.468288] fde0  ffffffc03019fe10 ffffffc00023bb1c ffffffc03302fd30
ffffffc001691738 ffffffc000f61dd0 ffffffc00023bb00 0000000000000000
ffffffc000203ba0
<6>[   71.482001] fe20  ffffffc00023ba6c ffffffc03302fd30 0000000000000000
0000000000000000 0000000000000000 ffffffc000247048 ffffffc03019fea0
0000000000000000
<snip> <snip>
<6>[   71.495716]
<0>[   71.497201] Process kmemleak (pid: 198, stack limit =
0xffffffc03019c058)
<6>[   71.503968] Call trace:
<6>[   71.506402] [<ffffffc00044ea30>] crc32_le+0x78/0x140
<6>[   71.511356] [<ffffffc0002f0cc0>] kmemleak_scan_thread+0x74/0xc0 <6>[
  71.517249] [<ffffffc00023bb18>] kthread+0xac/0xb8
<0>[   71.522027] Code: d281800f d2810010 d2808011 b4000466 (b94004e3)
<4>[   71.528180] ---[ end trace 0387f1db53e12801 ]---
<0>[   73.779697] Rebooting in 5 seconds..
<5>[   78.782326] Going down for restart now


Analysis so far :

1. We see that this is during low memory conditions, when kmemleak_object
structure allocation fails (happens at around 70.732606 seconds after
boot).

2. Immediately as per code, we go ahead and call kmemleak_disable() which
is an irreversible operation and we see the message at 70.739803 seconds.

Now we see that kmemleak_scan() has caused the panic here at around
70.935886 seconds, even though disable was called earlier (0.2 seconds
before).

So the question was why it was getting triggered/running even after we
disable it.
So we started seeing if kmemleak_disable() bailed out and if yes, at what
stage via post mortem analysis.

  1766 static void kmemleak_disable(void)
  1767 {
  1768     /* atomically check whether it was already invoked */
  1769     if (cmpxchg(&kmemleak_error, 0, 1))
  1770         return;
  1771
  1772     /* stop any memory operation tracing */
  1773     kmemleak_enabled = 0;
  1774
  1775     /* check whether it is too early for a kernel thread */ 1776   
 if (kmemleak_initialized)
  1777         schedule_work(&cleanup_work);
  1778
  1779     pr_info("Kernel memory leak detector disabled\n");
  1780 }


1. From Line 1769, it would have bailed out if the kmemleak_error were
already set to 1 in which case , we would have already disabled it,
therefore this would have been 0 and then changed to 1.

2. From the post mortem raw dumps, we see that kmemleak_enabled was indeed
set to 0. Therefore we had successfully come here.

3. After the above, we see that we have scheduled a work - cleanup_work,
which does the following :

  1747 static void kmemleak_do_cleanup(struct work_struct *work)
  1748 {
  1749     mutex_lock(&scan_mutex);
  1750     stop_scan_thread();
  1751
  1752     if (!kmemleak_found_leaks)
  1753         __kmemleak_do_cleanup();
  1754     else
  1755         pr_info("Kmemleak disabled without freeing internal data. "
1756             "Reclaim the memory with \"echo clear >
/sys/kernel/debug/kmemleak\"\n");
  1757     mutex_unlock(&scan_mutex);
  1758 }

If it had gone inside stop_scan_thread() from line 1750 routine, we should
have stopped the kmemleak kthread and the variable scan_thread should have
been null :

a. Postmortem analysis says otherwise.
b. We see that kmemleak thread did not stop since it executed after 0.2
seconds after we disabled.
c. Additionally we saw that scan_thread variable is perfectly intact and
not NULL. (snippet pasted below)

    scan_thread = 0xFFFFFFC030B24B40 -> (
    state = 0,
    stack = 0xFFFFFFC03019C000,
    usage = (counter = 2),
    flags = 2129984,
    ptrace = 0,
    wake_entry = (next = 0x0),
    on_cpu = 1,
    on_rq = 1,
    prio = 130,
    static_prio = 130,
    normal_prio = 130,
    rt_priority = 0,
    sched_class = 0xFFFFFFC000C03280,
    se = (load = (weight = 110, inv_weight = 39045157), run_node =
(__rb_parent_
    rt = (run_list = (next = 0xFFFFFFC030B24D08, prev =
0xFFFFFFC030B24D08), tim
    ravg = (mark_start = 71528155165, sum = 8467458, demand = 10000000,
sum_hist
    sched_task_group = 0xFFFFFFC00D53F200,
    fpu_counter = 0,
    policy = 0,
    nr_cpus_allowed = 8,


Therefore, we checked the weird possibilty of the work function getting
stuck in acquiring scan_mutex lock which would explain the fact that
stop_scan_thread() did not execute and it
indeed turned out be that way.

  Serving work : (struct work_struct *)0xFFFFFFC0014E23B0
  work_func : \\vmlinux\kmemleak\kmemleak_do_cleanup
  serving worker : kworker/3:1 PID : 32 on CPU : 3

Stack of kworker/3:1 :

-000|__switch_to(prev = 0xFFFFFFC032A60AC0, next = 0xFFFFFFC0014E2360)
-001|context_switch(inline)
-001|__schedule()
-002|schedule()
-003|schedule_preempt_disabled()
-004|__mutex_lock_common(inline)
-004|__mutex_lock_slowpath(lock_count = 0xFFFFFFC0014E2358)
-005|current_thread_info(inline)
-005|mutex_set_owner(inline)
-005|mutex_lock(lock = 0xFFFFFFC0014E2358)
-006|kmemleak_do_cleanup(?)
-007|static_key_false(inline)
-007|trace_workqueue_execute_end(inline)
-007|process_one_work(worker = 0xFFFFFFC032A0D000, work =
0xFFFFFFC0014E23B0)
-008|worker_thread(__worker = 0xFFFFFFC032A0D000)
-009|kthread(_create = 0xFFFFFFC032E03C80)
-010|ret_from_fork(asm)
-->|exception
-011|create_kthread(inline)
-011|kthreadd(?)
---|end of frame


We can see that it went into sleep trying to acquire the mutex lock and it
never came back and last arrival backs this claim.

    sched_info = (
      pcount = 1179,
      run_delay = 209410806,
      last_arrival = 70739794902, ----------------> Last it got a chance
to execute at 70.73 seconds when we called disable and its stuck
there.


Now when we check who is the owner of this mutex lock, we see that
kmemleak scan thread holds it

    |__mutex_lock_slowpath(
    |    lock_count = 0xFFFFFFC0014E2358)
    |  lock = 0xFFFFFFC0014E2358 -> (
    |    count = (counter = -1),
    |    wait_lock = (rlock = (raw_lock = (owner = 4, next = 4), magic =
3735899
    |    wait_list = (next = 0xFFFFFFC032A6BCD0, prev =
0xFFFFFFC032A6BCD0),
    |    owner = 0xFFFFFFC030B24B40, --> Corresponds to task_struct
of kmemleak
    |    name = 0x0,
    |    magic = 0xFFFFFFC0014E2358)


We tried the following as well to stop the scan thread first and then
schedule work but it did not help :


diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 0bd522f..6105dfe 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1460,7 +1460,6 @@ static void start_scan_thread(void)
static void stop_scan_thread(void)
{
        if (scan_thread) {
-               kthread_stop(scan_thread);
                scan_thread = NULL;
        }
}
@@ -1747,6 +1746,9 @@ static void kmemleak_disable(void)
        /* stop any memory operation tracing */
        atomic_set(&kmemleak_enabled, 0);

+       if(scan_thread)
+               kthread_stop(scan_thread);
+
        /* check whether it is too early for a kernel thread */



Therefore in summary, the cleanup_work scheduled after kmemleak_disable()
is stuck in acquiring scan_mutex and this mutex lock is held by kmemleak
scan thread.

In the above stated case we have seen the time difference between the
disable call and crash is close to 0.2 seconds, but we have other
instances where it took more time like close to a second or so :

===
    81.797655:   <6> kmemleak: Cannot allocate a kmemleak_object structure
81.830707:   <6> kmemleak: Kernel memory leak detector disabled
                                                <snip> <snip>
    83.015406:   <6> Unable to handle kernel paging request at virtual
address ffffff80013bd000
    83.022284:   <6> pgd = ffffffc00007d000
    83.025666:   <2> [ffffff80013bd000] *pgd=000000008c307003,
*pmd=00000000b24af003, *pte=0000000000000000
    83.034610:   <6> Internal error: Oops: 96000007 [#1] PREEMPT SMP
83.040163:   <2> Modules linked in: wlan(O)
    83.043899:   <6> CPU: 2 PID: 204 Comm: kmemleak Tainted: G           O
3.10.49-g7ed4272-00237-g8f9114d #1
    83.053011:   <6> task: ffffffc030b54080 ti: ffffffc030214000 task.ti:
ffffffc030214000
    83.060481:   <2> PC is at crc32_le+0x78/0x140
    83.064384:   <2> LR is at kmemleak_scan+0x2d0/0x3d4


====


    97.092890:   <6> kmemleak: Cannot allocate a kmemleak_object structure
97.098695:   <6> kmemleak: Kernel memory leak detector disabled
                                                <snip> <snip>
    98.149826:   <6> Unable to handle kernel paging request at virtual
address ffffff800038a000
    98.156704:   <6> pgd = ffffffc00007d000
    84.347644:   <2> msm_pm_qos_update_request: update request -1
    98.160088:   <4> [ffffff800038a000] *pgd=000000008c307003
    98.164862:   <2> , *pmd=00000000b2a36003, *pte=0000000000000000
98.164874:   <6> Internal error: Oops: 96000007 [#1] PREEMPT SMP
98.170418:   <2> Modules linked in:
    98.173462:   <6> CPU: 2 PID: 203 Comm: kmemleak Not tainted
3.10.49-g18d73a2 #1
    98.180317:   <6> task: ffffffc0304bd600 ti: ffffffc030204000 task.ti:
ffffffc030204000
    98.187784:   <2> PC is at crc32_le+0x78/0x140
    98.191688:   <2> LR is at kmemleak_scan+0x2d0/0x3d4

====

Any debug pointers/suggestions for the same would be really helpful.

Thanks and regards,
Vignesh Radhakrishnan

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project





^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Crash in crc32_le during kmemleak_scan()
  2015-05-13 13:15 Crash in crc32_le during kmemleak_scan() vigneshr
@ 2015-05-14 11:21 ` Catalin Marinas
  2015-05-15  4:53   ` vigneshr
  2015-05-20 13:16 ` vigneshr
  1 sibling, 1 reply; 9+ messages in thread
From: Catalin Marinas @ 2015-05-14 11:21 UTC (permalink / raw
  To: vigneshr@codeaurora.org
  Cc: linux-kernel@vger.kernel.org, bernd.schubert@itwm.fraunhofer.de

On Wed, May 13, 2015 at 02:15:08PM +0100, vigneshr@codeaurora.org wrote:
> We are seeing a panic in crc32_le after kmemleak_scan(). I have pasted the
> snippet of the crash log below. This is seen on 3.10 Kernel.
> 
> This issue was earlier discussed over this thread as well -
> https://lkml.org/lkml/2014/2/6/287.

Thanks for raising this again and the in depth analysis.

> We tried the following as well to stop the scan thread first and then
> schedule work but it did not help :
> 
> 
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 0bd522f..6105dfe 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -1460,7 +1460,6 @@ static void start_scan_thread(void)
> static void stop_scan_thread(void)
> {
>         if (scan_thread) {
> -               kthread_stop(scan_thread);
>                 scan_thread = NULL;
>         }
> }
> @@ -1747,6 +1746,9 @@ static void kmemleak_disable(void)
>         /* stop any memory operation tracing */
>         atomic_set(&kmemleak_enabled, 0);
> 
> +       if(scan_thread)
> +               kthread_stop(scan_thread);
> +
>         /* check whether it is too early for a kernel thread */

kmemleak_disable() may be called in an atomic context, so calling
kthread_stop() here is not safe. We have a scan_should_stop() function
which checks for the kmemleak_enabled variable but it doesn't seem to be
enough.

Basically the object_list has some vmalloc'ed objects. Scanning such
objects is protected by the kmemleak_object.lock and the look-up by the
kmemleak_lock. What happens during kmemleak_disable() is that we set
kmemleak_enable to 0 and kmemleak_free() simply exits. When this happens
during a scan, objects in the object_list are freed/vunmap'ed but
kmemleak doesn't know about this until the clean-up completes (which, as
you found, may be blocked on the scanning to complete).

A patch I had but never managed to test it properly (as in reproducing
the low mem conditions during a scan) postpones the kmemleak disabling
until after the clean-up is finished. If it works for you, I'll add a
proper commit message:

-----8<-------------------------

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5405aff5a590..dcba05812678 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -521,6 +521,10 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size,
 	struct kmemleak_object *object, *parent;
 	struct rb_node **link, *rb_parent;
 
+	/* stop further allocations while kmemleak is being disabled */
+	if (kmemleak_error)
+		return NULL;
+
 	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
 	if (!object) {
 		pr_warning("Cannot allocate a kmemleak_object structure\n");
@@ -741,6 +745,10 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp)
 	struct kmemleak_object *object;
 	struct kmemleak_scan_area *area;
 
+	/* stop further allocations while kmemleak is being disabled */
+	if (kmemleak_error)
+		return;
+
 	object = find_and_get_object(ptr, 1);
 	if (!object) {
 		kmemleak_warn("Adding scan area to unknown object at 0x%08lx\n",
@@ -1127,7 +1135,7 @@ static bool update_checksum(struct kmemleak_object *object)
  */
 static int scan_should_stop(void)
 {
-	if (!kmemleak_enabled)
+	if (kmemleak_error)
 		return 1;
 
 	/*
@@ -1755,6 +1763,10 @@ static void kmemleak_do_cleanup(struct work_struct *work)
 		pr_info("Kmemleak disabled without freeing internal data. "
 			"Reclaim the memory with \"echo clear > /sys/kernel/debug/kmemleak\"\n");
 	mutex_unlock(&scan_mutex);
+
+	/* stop any memory operation tracing */
+	kmemleak_enabled = 0;
+
 }
 
 static DECLARE_WORK(cleanup_work, kmemleak_do_cleanup);
@@ -1769,12 +1781,11 @@ static void kmemleak_disable(void)
 	if (cmpxchg(&kmemleak_error, 0, 1))
 		return;
 
-	/* stop any memory operation tracing */
-	kmemleak_enabled = 0;
-
 	/* check whether it is too early for a kernel thread */
 	if (kmemleak_initialized)
 		schedule_work(&cleanup_work);
+	else
+		kmemleak_enabled = 0;
 
 	pr_info("Kernel memory leak detector disabled\n");
 }

-- 
Catalin

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Crash in crc32_le during kmemleak_scan()
  2015-05-14 11:21 ` Catalin Marinas
@ 2015-05-15  4:53   ` vigneshr
  0 siblings, 0 replies; 9+ messages in thread
From: vigneshr @ 2015-05-15  4:53 UTC (permalink / raw
  To: Catalin Marinas
  Cc: vigneshr@codeaurora.org, linux-kernel@vger.kernel.org,
	bernd.schubert@itwm.fraunhofer.de

>
> kmemleak_disable() may be called in an atomic context, so calling
> kthread_stop() here is not safe. We have a scan_should_stop() function
> which checks for the kmemleak_enabled variable but it doesn't seem to be
> enough.
>
> Basically the object_list has some vmalloc'ed objects. Scanning such
> objects is protected by the kmemleak_object.lock and the look-up by the
> kmemleak_lock. What happens during kmemleak_disable() is that we set
> kmemleak_enable to 0 and kmemleak_free() simply exits. When this happens
> during a scan, objects in the object_list are freed/vunmap'ed but
> kmemleak doesn't know about this until the clean-up completes (which, as
> you found, may be blocked on the scanning to complete).
>
> A patch I had but never managed to test it properly (as in reproducing
> the low mem conditions during a scan) postpones the kmemleak disabling
> until after the clean-up is finished. If it works for you, I'll add a
> proper commit message:
>
> -----8<-------------------------
>
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 5405aff5a590..dcba05812678 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -521,6 +521,10 @@ static struct kmemleak_object *create_object(unsigned
> long ptr, size_t size,
>  	struct kmemleak_object *object, *parent;
>  	struct rb_node **link, *rb_parent;
>
> +	/* stop further allocations while kmemleak is being disabled */
> +	if (kmemleak_error)
> +		return NULL;
> +
>  	object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
>  	if (!object) {
>  		pr_warning("Cannot allocate a kmemleak_object structure\n");
> @@ -741,6 +745,10 @@ static void add_scan_area(unsigned long ptr, size_t
> size, gfp_t gfp)
>  	struct kmemleak_object *object;
>  	struct kmemleak_scan_area *area;
>
> +	/* stop further allocations while kmemleak is being disabled */
> +	if (kmemleak_error)
> +		return;
> +
>  	object = find_and_get_object(ptr, 1);
>  	if (!object) {
>  		kmemleak_warn("Adding scan area to unknown object at 0x%08lx\n",
> @@ -1127,7 +1135,7 @@ static bool update_checksum(struct kmemleak_object
> *object)
>   */
>  static int scan_should_stop(void)
>  {
> -	if (!kmemleak_enabled)
> +	if (kmemleak_error)
>  		return 1;
>
>  	/*
> @@ -1755,6 +1763,10 @@ static void kmemleak_do_cleanup(struct work_struct
> *work)
>  		pr_info("Kmemleak disabled without freeing internal data. "
>  			"Reclaim the memory with \"echo clear >
> /sys/kernel/debug/kmemleak\"\n");
>  	mutex_unlock(&scan_mutex);
> +
> +	/* stop any memory operation tracing */
> +	kmemleak_enabled = 0;
> +
>  }
>
>  static DECLARE_WORK(cleanup_work, kmemleak_do_cleanup);
> @@ -1769,12 +1781,11 @@ static void kmemleak_disable(void)
>  	if (cmpxchg(&kmemleak_error, 0, 1))
>  		return;
>
> -	/* stop any memory operation tracing */
> -	kmemleak_enabled = 0;
> -
>  	/* check whether it is too early for a kernel thread */
>  	if (kmemleak_initialized)
>  		schedule_work(&cleanup_work);
> +	else
> +		kmemleak_enabled = 0;
>
>  	pr_info("Kernel memory leak detector disabled\n");
>  }
>
> --
> Catalin
> --

Thank you for the explanation and the patch. I will give this a shot and
will get back with the results.

--
Thanks and regards,
Vignesh Radhakrishnan


QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of the Code Aurora Forum, hosted by The Linux Foundation.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Crash in crc32_le during kmemleak_scan()
  2015-05-13 13:15 Crash in crc32_le during kmemleak_scan() vigneshr
  2015-05-14 11:21 ` Catalin Marinas
@ 2015-05-20 13:16 ` vigneshr
  2015-05-21  5:45   ` vigneshr
  1 sibling, 1 reply; 9+ messages in thread
From: vigneshr @ 2015-05-20 13:16 UTC (permalink / raw
  To: vigneshr; +Cc: linux-kernel, catalin.marinas, bernd.schubert

Hi,

Sharing an update. We got back the test results. As such we did not
observe the bug, but we hit another side effect.

After kmemleak was disabled due to low memory

    96.825883:   <6> kmemleak: Cannot allocate a kmemleak_object structure
    96.825902:   <6> kmemleak: Cannot allocate a kmemleak_object structure
    96.825923:   <6> kmemleak: Kernel memory leak detector disabled


We saw the following warning appearing in the dmesg :

    97.023202:   <6> kmemleak: Trying to color unknown object at
0xffffffc01f0b2c80 as Grey
    97.029735:   <6> CPU: 0 PID: 49 Comm: kworker/u8:3 Tainted: G        W
 O 3.10.49-gde07ba9-00027-g9ce640d #1
    97.029751:   <2> Workqueue: diag_wq diag_read_smd_work_fn
    97.029758:   <2> Call trace:
    97.029769:   <2> [<ffffffc000206a2c>] dump_backtrace+0x0/0x270
    97.029777:   <2> [<ffffffc000206cac>] show_stack+0x10/0x1c
    97.029786:   <2> [<ffffffc000bf0eec>] dump_stack+0x1c/0x28
    97.029794:   <2> [<ffffffc0002f1da8>] paint_ptr+0x64/0xb8
    97.029801:   <2> [<ffffffc000be885c>] kmemleak_not_leak+0x64/0x98
    97.029808:   <2> [<ffffffc00056bd94>] diagmem_alloc+0x148/0x174
    97.029816:   <2> [<ffffffc00056b780>] diag_usb_write+0x80/0x180
    97.029823:   <2> [<ffffffc00056ac38>] diag_mux_write+0x38/0x4c
    97.029830:   <2> [<ffffffc0005682b0>]
diag_process_smd_read_data+0x270/0x2e4
    97.029837:   <2> [<ffffffc00056889c>] diag_smd_send_req+0x578/0x5b0
    97.029844:   <2> [<ffffffc0005688e0>] diag_read_smd_work_fn+0xc/0x18
    97.029852:   <2> [<ffffffc00023589c>] process_one_work+0x258/0x3d0
    97.029858:   <2> [<ffffffc0002368dc>] worker_thread+0x1f0/0x340
    97.029867:   <2> [<ffffffc00023bdc8>] kthread+0xac/0xb8

This should be fixed by the addition of the following to the patch that
was provided earlier :

--------------------- 8< --------------------------------


diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5ec8b71..7d67c87 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1019,7 +1019,7 @@ void __ref kmemleak_not_leak(const void *ptr)
 {
        pr_debug("%s(0x%p)\n", __func__, ptr);

-       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
+       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
                make_gray_object((unsigned long)ptr);
        else if (kmemleak_early_log)
                log_early(KMEMLEAK_NOT_LEAK, ptr, 0, 0);






--------------------- 8< --------------------------------



However, the more concerning thing was that we crashed later at around 100
seconds. I have pasted the snippet of the crash log below :



   100.946902:   <6> Unable to handle kernel NULL pointer dereference at
virtual address 00000000
   100.953953:   <6> pgd = ffffffc00007d000
   100.957338:   <2> [00000000] *pgd=000000008c307003,
*pmd=000000008c308003, *pte=00e000000b000407
   100.965587:   <6> Internal error: Oops: 96000006 [#1] PREEMPT SMP
   100.971139:   <2> Modules linked in: wlan(O)
   100.974876:   <6> CPU: 0 PID: 246 Comm: kworker/0:2 Tainted: G        W
 O 3.10.49-gde07ba9-00027-g9ce640d #1
   100.984254:   <2> Workqueue: events kmemleak_do_cleanup
   100.988936:   <6> task: ffffffc02e7d9580 ti: ffffffc02da0c000 task.ti:
ffffffc02da0c000
   100.996404:   <2> PC is at rb_erase+0x224/0x344
   101.000393:   <2> LR is at __delete_object+0x3c/0xc8
   101.004820:   <2> pc : [<ffffffc000444e48>] lr : [<ffffffc0002f1afc>]
pstate: 600001c5
   101.012198:   <2> sp : ffffffc02da0fcb0
   101.015497:   <2> x29: ffffffc02da0fcb0 x28: 0000000000000009
   101.020792:   <2> x27: ffffffc0015f9000 x26: 0000000000000000
   101.026086:   <2> x25: ffffffc034ad2a40 x24: ffffffc02da0c000
   101.031380:   <2> x23: ffffffc034ad7b00 x22: 0000000000000000
   101.036676:   <2> x21: 0000000000000140 x20: ffffffc0014e26b8
   101.041972:   <2> x19: ffffffc0236b03c0 x18: 0000000000000000
   101.047266:   <2> x17: 0000007f973e7f8c x16: ffffffc00030447c
   101.052561:   <2> x15: 0000000000000000 x14: 0000000066666667
   101.057857:   <2> x13: 000000000000000a x12: 0000007f7c95e9f0
   101.063152:   <2> x11: 731a976167c4c606 x10: 0000000000000013
   101.068447:   <2> x9 : 0000000000127500 x8 : 0000001780764fe3
   101.073743:   <2> x7 : ffffffc0236b1bc0 x6 : 0000000000000000
   101.079038:   <2> x5 : ffffffc02eeb6f41 x4 : 0000000000000000
   101.084333:   <2> x3 : 0000000000000000 x2 : ffffffc02eeb6f40
   101.089628:   <2> x1 : ffffffc0016e05b8 x0 : 0000000000000000


	<SNIP .................>

   101.454125:   <6> Process kworker/0:2 (pid: 246, stack limit =
0xffffffc02da0c058)
   101.461155:   <2> Call trace:
   101.463590:   <2> [<ffffffc000444e48>] rb_erase+0x224/0x344
   101.468624:   <2> [<ffffffc0002f1c6c>] delete_object_full+0x1c/0x34
   101.474353:   <2> [<ffffffc0002f1cbc>] __kmemleak_do_cleanup+0x38/0x54
   101.480342:   <2> [<ffffffc0002f1d10>] kmemleak_do_cleanup+0x38/0x6c
   101.486160:   <2> [<ffffffc00023589c>] process_one_work+0x258/0x3d0
   101.491887:   <2> [<ffffffc0002368dc>] worker_thread+0x1f0/0x340
   101.497359:   <2> [<ffffffc00023bdc8>] kthread+0xac/0xb8
   101.502132:   <6> Code: f9000046 14000040 f9400843 aa0303e0 (f9400064)


Analysis so far :

I haven't been able to figure out exactly how the dots are connected and
the relation to the change provided earlier but the following is what i
could put together :

When i checked the post mortem analysis i see the following in the
crashing stack :

-018|rb_erase(
    |    [locdesc] node = ?,
    |    [X1] root = 0xFFFFFFC0016E05B8)
    |  [X1] root = 0xFFFFFFC0016E05B8
    |  [locdesc] node = ?
    |  [X2] parent = 0xFFFFFFC02EEB6F40
    |  [X4] node = 0x0
    |  [X3] sibling = 0x0
-019|__delete_object(
    |    [X19] object = 0xFFFFFFC0236B03C0)
    |  [X21] flags = 0x0140
-020|delete_object_full(
    |  ?)
    |  [X19] object = 0xFFFFFFC0236B03C0
-021|__kmemleak_do_cleanup()
    |  [X19] object = 0xFFFFFFC0236B03C0 -> (
    |    [0xFFFFFFC0236B03C0] lock = ([0xFFFFFFC0236B03C0] rlock =
([0xFFFFFFC0236B03C0] raw_lock = ([0xFFFFFFC0236B03C0] owner
    |    [0xFFFFFFC0236B03D8] flags = 0x0,
    |    [0xFFFFFFC0236B03E0] object_list = ([NSD:0xFFFFFFC0236B03E0] next
= 0xFFFFFFC0236B1920, [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200),
    |    [0xFFFFFFC0236B03F0] gray_list = ([0xFFFFFFC0236B03F0] next =
0x00100100, [0xFFFFFFC0236B03F8] prev = 0x00200200),
	 [0xFFFFFFC0236B0400] rb_node = (
      			[0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41,
			[0xFFFFFFC0236B0408] rb_right = 0x0,
			[0xFFFFFFC0236B0410] rb_left = 0x0


The node is becoming 0x0 here and is crashing in the following piece of
code in rb_erase:


  309         } else {
  310             sibling = parent->rb_left;
  311             if (rb_is_red(sibling)) {

and rb_is_red(sibling) resolves into :

96 #define rb_is_red(rb)      __rb_is_red((rb)->__rb_parent_color)



1. Here the parent ptr refers to Object pointer residing at
0xFFFFFFC0236B03C0 and we can see that rb_node->rb_left is becoming 0x0 (
this is from the values observed in the crashed stack)

		 [0xFFFFFFC0236B0400] rb_node = (
      			[0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41,
			[0xFFFFFFC0236B0408] rb_right = 0x0,
			[0xFFFFFFC0236B0410] rb_left = 0x0

2. Therefore the sibling will take the value 0x0

3. And then when rb->__rb_parent_color operation takes place, we crash
since its not able to resolve 0x0 (since rb=0x0 here).


So the question is how this variable ended up being this way.

1. I checked the object_list via list walkthrough  and it did not report
any corruptions.

2. However, the object in question from our call stack,
object->object_list pointers looks a bit amiss.

  object_list = ([NSD:0xFFFFFFC0236B03E0] next = 0xFFFFFFC0236B1920,
[NSD:0xFFFFFFC0236B03E8] prev = 0x00200200),

But its not clear to me how it ended up being in this state.


I will continue to debug what went wrong and share my analysis. Please let
me know if anything looks obvious to you that i can try out.

====================
Thanks and regards,
Vignesh Radhakrishnan


QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of the Code Aurora Forum, hosted by The Linux Foundation.






^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Crash in crc32_le during kmemleak_scan()
  2015-05-20 13:16 ` vigneshr
@ 2015-05-21  5:45   ` vigneshr
  2015-05-22 16:51     ` Catalin Marinas
  0 siblings, 1 reply; 9+ messages in thread
From: vigneshr @ 2015-05-21  5:45 UTC (permalink / raw
  To: vigneshr; +Cc: vigneshr, linux-kernel, catalin.marinas, bernd.schubert

> Analysis so far :
>
> I haven't been able to figure out exactly how the dots are connected and
> the relation to the change provided earlier but the following is what i
> could put together :
>
> When i checked the post mortem analysis i see the following in the
> crashing stack :
>
> -018|rb_erase(
>     |    [locdesc] node = ?,
>     |    [X1] root = 0xFFFFFFC0016E05B8)
>     |  [X1] root = 0xFFFFFFC0016E05B8
>     |  [locdesc] node = ?
>     |  [X2] parent = 0xFFFFFFC02EEB6F40
>     |  [X4] node = 0x0
>     |  [X3] sibling = 0x0
> -019|__delete_object(
>     |    [X19] object = 0xFFFFFFC0236B03C0)
>     |  [X21] flags = 0x0140
> -020|delete_object_full(
>     |  ?)
>     |  [X19] object = 0xFFFFFFC0236B03C0
> -021|__kmemleak_do_cleanup()
>     |  [X19] object = 0xFFFFFFC0236B03C0 -> (
>     |    [0xFFFFFFC0236B03C0] lock = ([0xFFFFFFC0236B03C0] rlock =
> ([0xFFFFFFC0236B03C0] raw_lock = ([0xFFFFFFC0236B03C0] owner
>     |    [0xFFFFFFC0236B03D8] flags = 0x0,
>     |    [0xFFFFFFC0236B03E0] object_list = ([NSD:0xFFFFFFC0236B03E0] next
> = 0xFFFFFFC0236B1920, [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200),
>     |    [0xFFFFFFC0236B03F0] gray_list = ([0xFFFFFFC0236B03F0] next =
> 0x00100100, [0xFFFFFFC0236B03F8] prev = 0x00200200),
> 	 [0xFFFFFFC0236B0400] rb_node = (
>       			[0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41,
> 			[0xFFFFFFC0236B0408] rb_right = 0x0,
> 			[0xFFFFFFC0236B0410] rb_left = 0x0
>
>
> The node is becoming 0x0 here and is crashing in the following piece of
> code in rb_erase:
>
>
>   309         } else {
>   310             sibling = parent->rb_left;
>   311             if (rb_is_red(sibling)) {
>
> and rb_is_red(sibling) resolves into :
>
> 96 #define rb_is_red(rb)      __rb_is_red((rb)->__rb_parent_color)
>
>
>
> 1. Here the parent ptr refers to Object pointer residing at
> 0xFFFFFFC0236B03C0 and we can see that rb_node->rb_left is becoming 0x0 (
> this is from the values observed in the crashed stack)
>
> 		 [0xFFFFFFC0236B0400] rb_node = (
>       			[0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41,
> 			[0xFFFFFFC0236B0408] rb_right = 0x0,
> 			[0xFFFFFFC0236B0410] rb_left = 0x0
>
> 2. Therefore the sibling will take the value 0x0
>
> 3. And then when rb->__rb_parent_color operation takes place, we crash
> since its not able to resolve 0x0 (since rb=0x0 here).
>
>
> So the question is how this variable ended up being this way.
>
> 1. I checked the object_list via list walkthrough  and it did not report
> any corruptions.
>
> 2. However, the object in question from our call stack,
> object->object_list pointers looks a bit amiss.
>
>   object_list = ([NSD:0xFFFFFFC0236B03E0] next = 0xFFFFFFC0236B1920,
> [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200),
>
> But its not clear to me how it ended up being in this state.
>
>
> I will continue to debug what went wrong and share my analysis. Please let
> me know if anything looks obvious to you that i can try out.

I went through the bug again and i think the following is happening :

I just noticed that there could be possible race that's happening between 2
cores causing the aforementioned crash to occur :

Core 0 ____________________________ Core 1

rb_erase()                  |
__delete_object()           |
delete_object_full()        |
                            |
___________________________ | __delete_object()
                            |
___________________________ | delete_object_full()
                            |
___________________________ | kmemleak_free()-Call from some other ctxt.
                            |
                            |
__kmemleak_do_cleanup()     |
kmemleak_do_cleanup()       |
process_one_work()          |
worker_thread()             |
kthread()                   |
ret_from_fork()             |

Just after __delete_object() running on core 1 unlocks the writelock
(kmemleak_lock), from core 0 we take the same object and try to do rb_erase
again. This is causing it to crash in this fashion.

Therefore, together with the warning that i quoted in the earlier mail and
the above race, i think the following can be added to the patch provided
earlier:

---------------------------------- 8< -------------------------------------

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5ec8b71..4455bb8 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -959,7 +959,7 @@ void __ref kmemleak_free(const void *ptr)
 {
        pr_debug("%s(0x%p)\n", __func__, ptr);

-       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
+       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
                delete_object_full((unsigned long)ptr);
        else if (kmemleak_early_log)
                log_early(KMEMLEAK_FREE, ptr, 0, 0);
@@ -1019,7 +1019,7 @@ void __ref kmemleak_not_leak(const void *ptr)
 {
        pr_debug("%s(0x%p)\n", __func__, ptr);

-       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
+       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
                make_gray_object((unsigned long)ptr);
        else if (kmemleak_early_log)
                log_early(KMEMLEAK_NOT_LEAK, ptr, 0, 0);




---------------------------------- 8< -------------------------------------

Please let me know if the above analysis and the addition to the patch
looks good to you or if you a different opinion. I can try out testing
once more to see if it pops back if it looks good to you.

====================
Thanks and regards,
Vignesh Radhakrishnan


QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of the Code Aurora Forum, hosted by The Linux Foundation.
again.


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Crash in crc32_le during kmemleak_scan()
  2015-05-21  5:45   ` vigneshr
@ 2015-05-22 16:51     ` Catalin Marinas
  2015-06-01  4:57       ` vigneshr
  0 siblings, 1 reply; 9+ messages in thread
From: Catalin Marinas @ 2015-05-22 16:51 UTC (permalink / raw
  To: vigneshr@codeaurora.org
  Cc: linux-kernel@vger.kernel.org, bernd.schubert@itwm.fraunhofer.de

Hi Vignesh,

Thanks for testing this.

On Thu, May 21, 2015 at 06:45:17AM +0100, vigneshr@codeaurora.org wrote:
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 5ec8b71..4455bb8 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -959,7 +959,7 @@ void __ref kmemleak_free(const void *ptr)
>  {
>         pr_debug("%s(0x%p)\n", __func__, ptr);
> 
> -       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
> +       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
>                 delete_object_full((unsigned long)ptr);
>         else if (kmemleak_early_log)
>                 log_early(KMEMLEAK_FREE, ptr, 0, 0);

That's the problem we try to avoid, if we block kmemleak_free on
kmemleak_error (that was the same as the kmemleak_enabled case before),
scanning may still be in progress for an object but the object unmapped
by something like vfree.

So for the error case, we want:

1. Allow object freeing during a memory scan
2. Block kmemleak_free() being entered once the scanning stops and the
   clean-up starts

What I missed is that the clean-up calls delete_object_full() and this
can race with a kmemleak_free() on the same object. The same could
probably happen if buggy kernel code would call kfree() on the same
object from different CPUs. Covering this case is more complicated, I
have to properly think of the locking.

But assuming that the callers are safe, we need to disable kmemleak
before the clean-up starts. We can safely set kmemleak_enabled to 0
after the scanning thread is stopped. So on top of my previous patch:

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index dcba05812678..52a38eed50e2 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1757,16 +1757,15 @@ static void kmemleak_do_cleanup(struct work_struct *work)
 	mutex_lock(&scan_mutex);
 	stop_scan_thread();
 
+	/* stop any memory operation tracing */
+	kmemleak_enabled = 0;
+
 	if (!kmemleak_found_leaks)
 		__kmemleak_do_cleanup();
 	else
 		pr_info("Kmemleak disabled without freeing internal data. "
 			"Reclaim the memory with \"echo clear > /sys/kernel/debug/kmemleak\"\n");
 	mutex_unlock(&scan_mutex);
-
-	/* stop any memory operation tracing */
-	kmemleak_enabled = 0;
-
 }
 
 static DECLARE_WORK(cleanup_work, kmemleak_do_cleanup);

> @@ -1019,7 +1019,7 @@ void __ref kmemleak_not_leak(const void *ptr)
>  {
>         pr_debug("%s(0x%p)\n", __func__, ptr);
> 
> -       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
> +       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
>                 make_gray_object((unsigned long)ptr);
>         else if (kmemleak_early_log)
>                 log_early(KMEMLEAK_NOT_LEAK, ptr, 0, 0);

That's needed as well. Actually, all the kmemleak entry points apart
from kmemleak_free() need to bail out on kmemleak_error (e.g.
kmemleak_ignore).

So I think we need a separate kmemleak_free_enabled. Can you try the
patch below against mainline please (so revert the previous one)? I
haven't bothered with kmemleak_free_part() since this is only called
during early memboot allocations, so we don't have any scanning thread
running.

BTW, I'll be on holiday for a week, back on the 1st of June.

----8<------------------

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5405aff5a590..7913386ca506 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -194,6 +194,8 @@ static struct kmem_cache *scan_area_cache;
 
 /* set if tracing memory operations is enabled */
 static int kmemleak_enabled;
+/* same as above but only for the kmemleak_free() callback */
+static int kmemleak_free_enabled;
 /* set in the late_initcall if there were no errors */
 static int kmemleak_initialized;
 /* enables or disables early logging of the memory operations */
@@ -941,7 +943,7 @@ void __ref kmemleak_free(const void *ptr)
 {
 	pr_debug("%s(0x%p)\n", __func__, ptr);
 
-	if (kmemleak_enabled && ptr && !IS_ERR(ptr))
+	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
 		delete_object_full((unsigned long)ptr);
 	else if (kmemleak_early_log)
 		log_early(KMEMLEAK_FREE, ptr, 0, 0);
@@ -981,7 +983,7 @@ void __ref kmemleak_free_percpu(const void __percpu *ptr)
 
 	pr_debug("%s(0x%p)\n", __func__, ptr);
 
-	if (kmemleak_enabled && ptr && !IS_ERR(ptr))
+	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
 		for_each_possible_cpu(cpu)
 			delete_object_full((unsigned long)per_cpu_ptr(ptr,
 								      cpu));
@@ -1749,6 +1751,12 @@ static void kmemleak_do_cleanup(struct work_struct *work)
 	mutex_lock(&scan_mutex);
 	stop_scan_thread();
 
+	/*
+	 * Once the scan thread has stopped, it is safe to no longer track
+	 * object freeing.
+	 */
+	kmemleak_free_enabled = 0;
+
 	if (!kmemleak_found_leaks)
 		__kmemleak_do_cleanup();
 	else
@@ -1775,6 +1783,8 @@ static void kmemleak_disable(void)
 	/* check whether it is too early for a kernel thread */
 	if (kmemleak_initialized)
 		schedule_work(&cleanup_work);
+	else
+		kmemleak_free_enabled = 0;
 
 	pr_info("Kernel memory leak detector disabled\n");
 }
@@ -1839,8 +1849,10 @@ void __init kmemleak_init(void)
 	if (kmemleak_error) {
 		local_irq_restore(flags);
 		return;
-	} else
+	} else {
 		kmemleak_enabled = 1;
+		kmemleak_free_enabled = 1;
+	}
 	local_irq_restore(flags);
 
 	/*

-------------------8<-----------------

Thanks,

Catalin

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Crash in crc32_le during kmemleak_scan()
  2015-05-22 16:51     ` Catalin Marinas
@ 2015-06-01  4:57       ` vigneshr
  2015-06-19 10:30         ` vigneshr
  0 siblings, 1 reply; 9+ messages in thread
From: vigneshr @ 2015-06-01  4:57 UTC (permalink / raw
  To: Catalin Marinas
  Cc: vigneshr@codeaurora.org, linux-kernel@vger.kernel.org,
	bernd.schubert@itwm.fraunhofer.de


> On Thu, May 21, 2015 at 06:45:17AM +0100, vigneshr@codeaurora.org wrote:
>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>> index 5ec8b71..4455bb8 100644
>> --- a/mm/kmemleak.c
>> +++ b/mm/kmemleak.c
>> @@ -959,7 +959,7 @@ void __ref kmemleak_free(const void *ptr)
>>  {
>>         pr_debug("%s(0x%p)\n", __func__, ptr);
>>
>> -       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
>> +       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
>>                 delete_object_full((unsigned long)ptr);
>>         else if (kmemleak_early_log)
>>                 log_early(KMEMLEAK_FREE, ptr, 0, 0);
>
> That's the problem we try to avoid, if we block kmemleak_free on
> kmemleak_error (that was the same as the kmemleak_enabled case before),
> scanning may still be in progress for an object but the object unmapped
> by something like vfree.


OOps ! Ya my bad. Sorry about that.

> So for the error case, we want:
>
> 1. Allow object freeing during a memory scan
> 2. Block kmemleak_free() being entered once the scanning stops and the
>    clean-up starts
>
> What I missed is that the clean-up calls delete_object_full() and this
> can race with a kmemleak_free() on the same object. The same could
> probably happen if buggy kernel code would call kfree() on the same
> object from different CPUs. Covering this case is more complicated, I
> have to properly think of the locking.
>
> But assuming that the callers are safe, we need to disable kmemleak
> before the clean-up starts. We can safely set kmemleak_enabled to 0
> after the scanning thread is stopped. So on top of my previous patch:
>
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index dcba05812678..52a38eed50e2 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -1757,16 +1757,15 @@ static void kmemleak_do_cleanup(struct work_struct
> *work)
>  	mutex_lock(&scan_mutex);
>  	stop_scan_thread();
>
> +	/* stop any memory operation tracing */
> +	kmemleak_enabled = 0;
> +
>  	if (!kmemleak_found_leaks)
>  		__kmemleak_do_cleanup();
>  	else
>  		pr_info("Kmemleak disabled without freeing internal data. "
>  			"Reclaim the memory with \"echo clear >
> /sys/kernel/debug/kmemleak\"\n");
>  	mutex_unlock(&scan_mutex);
> -
> -	/* stop any memory operation tracing */
> -	kmemleak_enabled = 0;
> -
>  }
>
>  static DECLARE_WORK(cleanup_work, kmemleak_do_cleanup);
>
>> @@ -1019,7 +1019,7 @@ void __ref kmemleak_not_leak(const void *ptr)
>>  {
>>         pr_debug("%s(0x%p)\n", __func__, ptr);
>>
>> -       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
>> +       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
>>                 make_gray_object((unsigned long)ptr);
>>         else if (kmemleak_early_log)
>>                 log_early(KMEMLEAK_NOT_LEAK, ptr, 0, 0);
>
> That's needed as well. Actually, all the kmemleak entry points apart
> from kmemleak_free() need to bail out on kmemleak_error (e.g.
> kmemleak_ignore).
>
> So I think we need a separate kmemleak_free_enabled. Can you try the
> patch below against mainline please (so revert the previous one)? I
> haven't bothered with kmemleak_free_part() since this is only called
> during early memboot allocations, so we don't have any scanning thread
> running.


Ah! Thanks for the explanation. This makes sense.

> BTW, I'll be on holiday for a week, back on the 1st of June.
>
> ----8<------------------
>
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 5405aff5a590..7913386ca506 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -194,6 +194,8 @@ static struct kmem_cache *scan_area_cache;
>
>  /* set if tracing memory operations is enabled */
>  static int kmemleak_enabled;
> +/* same as above but only for the kmemleak_free() callback */
> +static int kmemleak_free_enabled;
>  /* set in the late_initcall if there were no errors */
>  static int kmemleak_initialized;
>  /* enables or disables early logging of the memory operations */
> @@ -941,7 +943,7 @@ void __ref kmemleak_free(const void *ptr)
>  {
>  	pr_debug("%s(0x%p)\n", __func__, ptr);
>
> -	if (kmemleak_enabled && ptr && !IS_ERR(ptr))
> +	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
>  		delete_object_full((unsigned long)ptr);
>  	else if (kmemleak_early_log)
>  		log_early(KMEMLEAK_FREE, ptr, 0, 0);
> @@ -981,7 +983,7 @@ void __ref kmemleak_free_percpu(const void __percpu
> *ptr)
>
>  	pr_debug("%s(0x%p)\n", __func__, ptr);
>
> -	if (kmemleak_enabled && ptr && !IS_ERR(ptr))
> +	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
>  		for_each_possible_cpu(cpu)
>  			delete_object_full((unsigned long)per_cpu_ptr(ptr,
>  								      cpu));
> @@ -1749,6 +1751,12 @@ static void kmemleak_do_cleanup(struct work_struct
> *work)
>  	mutex_lock(&scan_mutex);
>  	stop_scan_thread();
>
> +	/*
> +	 * Once the scan thread has stopped, it is safe to no longer track
> +	 * object freeing.
> +	 */
> +	kmemleak_free_enabled = 0;
> +
>  	if (!kmemleak_found_leaks)
>  		__kmemleak_do_cleanup();
>  	else
> @@ -1775,6 +1783,8 @@ static void kmemleak_disable(void)
>  	/* check whether it is too early for a kernel thread */
>  	if (kmemleak_initialized)
>  		schedule_work(&cleanup_work);
> +	else
> +		kmemleak_free_enabled = 0;
>
>  	pr_info("Kernel memory leak detector disabled\n");
>  }
> @@ -1839,8 +1849,10 @@ void __init kmemleak_init(void)
>  	if (kmemleak_error) {
>  		local_irq_restore(flags);
>  		return;
> -	} else
> +	} else {
>  		kmemleak_enabled = 1;
> +		kmemleak_free_enabled = 1;
> +	}
>  	local_irq_restore(flags);
>
>  	/*
>
> -------------------8<-----------------
>

We have tested the patch provided above and it was clean report with no
crashes that were seen earlier. I guess we can go ahead with this one if
its okay with you.

Tested-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>

Thanks and regards,
Vignesh Radhakrishnan

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Crash in crc32_le during kmemleak_scan()
  2015-06-01  4:57       ` vigneshr
@ 2015-06-19 10:30         ` vigneshr
  2015-06-19 13:20           ` Catalin Marinas
  0 siblings, 1 reply; 9+ messages in thread
From: vigneshr @ 2015-06-19 10:30 UTC (permalink / raw
  To: vigneshr
  Cc: Catalin Marinas, vigneshr@codeaurora.org,
	linux-kernel@vger.kernel.org, bernd.schubert@itwm.fraunhofer.de


>>
>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>> index 5405aff5a590..7913386ca506 100644
>> --- a/mm/kmemleak.c
>> +++ b/mm/kmemleak.c
>> @@ -194,6 +194,8 @@ static struct kmem_cache *scan_area_cache;
>>
>>  /* set if tracing memory operations is enabled */
>>  static int kmemleak_enabled;
>> +/* same as above but only for the kmemleak_free() callback */
>> +static int kmemleak_free_enabled;
>>  /* set in the late_initcall if there were no errors */
>>  static int kmemleak_initialized;
>>  /* enables or disables early logging of the memory operations */
>> @@ -941,7 +943,7 @@ void __ref kmemleak_free(const void *ptr)
>>  {
>>  	pr_debug("%s(0x%p)\n", __func__, ptr);
>>
>> -	if (kmemleak_enabled && ptr && !IS_ERR(ptr))
>> +	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
>>  		delete_object_full((unsigned long)ptr);
>>  	else if (kmemleak_early_log)
>>  		log_early(KMEMLEAK_FREE, ptr, 0, 0);
>> @@ -981,7 +983,7 @@ void __ref kmemleak_free_percpu(const void __percpu
>> *ptr)
>>
>>  	pr_debug("%s(0x%p)\n", __func__, ptr);
>>
>> -	if (kmemleak_enabled && ptr && !IS_ERR(ptr))
>> +	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
>>  		for_each_possible_cpu(cpu)
>>  			delete_object_full((unsigned long)per_cpu_ptr(ptr,
>>  								      cpu));
>> @@ -1749,6 +1751,12 @@ static void kmemleak_do_cleanup(struct
>> work_struct
>> *work)
>>  	mutex_lock(&scan_mutex);
>>  	stop_scan_thread();
>>
>> +	/*
>> +	 * Once the scan thread has stopped, it is safe to no longer track
>> +	 * object freeing.
>> +	 */
>> +	kmemleak_free_enabled = 0;
>> +
>>  	if (!kmemleak_found_leaks)
>>  		__kmemleak_do_cleanup();
>>  	else
>> @@ -1775,6 +1783,8 @@ static void kmemleak_disable(void)
>>  	/* check whether it is too early for a kernel thread */
>>  	if (kmemleak_initialized)
>>  		schedule_work(&cleanup_work);
>> +	else
>> +		kmemleak_free_enabled = 0;
>>
>>  	pr_info("Kernel memory leak detector disabled\n");
>>  }
>> @@ -1839,8 +1849,10 @@ void __init kmemleak_init(void)
>>  	if (kmemleak_error) {
>>  		local_irq_restore(flags);
>>  		return;
>> -	} else
>> +	} else {
>>  		kmemleak_enabled = 1;
>> +		kmemleak_free_enabled = 1;
>> +	}
>>  	local_irq_restore(flags);
>>
>>  	/*
>>
>> -------------------8<-----------------
>>
>
> We have tested the patch provided above and it was clean report with no
> crashes that were seen earlier. I guess we can go ahead with this one if
> its okay with you.
>
> Tested-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
>

Hi,

Gentle reminder ping for the above.

-- 
Thanks and regards,
Vignesh Radhakrishnan

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Crash in crc32_le during kmemleak_scan()
  2015-06-19 10:30         ` vigneshr
@ 2015-06-19 13:20           ` Catalin Marinas
  0 siblings, 0 replies; 9+ messages in thread
From: Catalin Marinas @ 2015-06-19 13:20 UTC (permalink / raw
  To: vigneshr@codeaurora.org
  Cc: linux-kernel@vger.kernel.org, bernd.schubert@itwm.fraunhofer.de

On Fri, Jun 19, 2015 at 11:30:50AM +0100, vigneshr@codeaurora.org wrote:
> >> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> >> index 5405aff5a590..7913386ca506 100644
> >> --- a/mm/kmemleak.c
> >> +++ b/mm/kmemleak.c
> >> @@ -194,6 +194,8 @@ static struct kmem_cache *scan_area_cache;
> >>
> >>  /* set if tracing memory operations is enabled */
> >>  static int kmemleak_enabled;
> >> +/* same as above but only for the kmemleak_free() callback */
> >> +static int kmemleak_free_enabled;
> >>  /* set in the late_initcall if there were no errors */
> >>  static int kmemleak_initialized;
> >>  /* enables or disables early logging of the memory operations */
> >> @@ -941,7 +943,7 @@ void __ref kmemleak_free(const void *ptr)
> >>  {
> >>  	pr_debug("%s(0x%p)\n", __func__, ptr);
> >>
> >> -	if (kmemleak_enabled && ptr && !IS_ERR(ptr))
> >> +	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
> >>  		delete_object_full((unsigned long)ptr);
> >>  	else if (kmemleak_early_log)
> >>  		log_early(KMEMLEAK_FREE, ptr, 0, 0);
> >> @@ -981,7 +983,7 @@ void __ref kmemleak_free_percpu(const void __percpu
> >> *ptr)
> >>
> >>  	pr_debug("%s(0x%p)\n", __func__, ptr);
> >>
> >> -	if (kmemleak_enabled && ptr && !IS_ERR(ptr))
> >> +	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
> >>  		for_each_possible_cpu(cpu)
> >>  			delete_object_full((unsigned long)per_cpu_ptr(ptr,
> >>  								      cpu));
> >> @@ -1749,6 +1751,12 @@ static void kmemleak_do_cleanup(struct
> >> work_struct
> >> *work)
> >>  	mutex_lock(&scan_mutex);
> >>  	stop_scan_thread();
> >>
> >> +	/*
> >> +	 * Once the scan thread has stopped, it is safe to no longer track
> >> +	 * object freeing.
> >> +	 */
> >> +	kmemleak_free_enabled = 0;
> >> +
> >>  	if (!kmemleak_found_leaks)
> >>  		__kmemleak_do_cleanup();
> >>  	else
> >> @@ -1775,6 +1783,8 @@ static void kmemleak_disable(void)
> >>  	/* check whether it is too early for a kernel thread */
> >>  	if (kmemleak_initialized)
> >>  		schedule_work(&cleanup_work);
> >> +	else
> >> +		kmemleak_free_enabled = 0;
> >>
> >>  	pr_info("Kernel memory leak detector disabled\n");
> >>  }
> >> @@ -1839,8 +1849,10 @@ void __init kmemleak_init(void)
> >>  	if (kmemleak_error) {
> >>  		local_irq_restore(flags);
> >>  		return;
> >> -	} else
> >> +	} else {
> >>  		kmemleak_enabled = 1;
> >> +		kmemleak_free_enabled = 1;
> >> +	}
> >>  	local_irq_restore(flags);
> >>
> >>  	/*
> >>
> >> -------------------8<-----------------
> >
> > We have tested the patch provided above and it was clean report with no
> > crashes that were seen earlier. I guess we can go ahead with this one if
> > its okay with you.
> >
> > Tested-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
> 
> Gentle reminder ping for the above.

What's the ping for?

The patch is merged in linux-next via the mm tree as commit d613b66e1df8
("mm: kmemleak: allow safe memory scanning during kmemleak disabling").
It will probably go in during the upcoming merging window (so 4.2) and
it also has a Cc stable tag.

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-06-19 13:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-13 13:15 Crash in crc32_le during kmemleak_scan() vigneshr
2015-05-14 11:21 ` Catalin Marinas
2015-05-15  4:53   ` vigneshr
2015-05-20 13:16 ` vigneshr
2015-05-21  5:45   ` vigneshr
2015-05-22 16:51     ` Catalin Marinas
2015-06-01  4:57       ` vigneshr
2015-06-19 10:30         ` vigneshr
2015-06-19 13:20           ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).