Suggestions on how to debug kernel crashes where printk and gdb both does not work

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* Suggestions on how to debug kernel crashes where printk and gdb both does not work
@ 2021-06-14 13:22 Dongliang Mu
  2021-06-14 13:34 ` Pavel Skripkin
  0 siblings, 1 reply; 12+ messages in thread
From: Dongliang Mu @ 2021-06-14 13:22 UTC (permalink / raw)
  To: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

Dear kernel developers,

I was trying to debug the crash - memory leak in hwsim_add_one [1]
recently. However, I encountered a disgusting issue: my breakpoint and
printk/pr_alert in the functions that will be surely executed do not
work. The stack trace is in the following. I wrote this email to ask
for some suggestions on how to debug such cases?

Thanks very much. Looking forward to your reply.

[<ffffffff82b09f12>] kmalloc include/linux/slab.h:556 [inline]
[<ffffffff82b09f12>] kzalloc include/linux/slab.h:686 [inline]
[<ffffffff82b09f12>] hwsim_alloc_edge.constprop.0+0x22/0x80
drivers/net/ieee802154/mac802154_hwsim.c:385
[<ffffffff82b0b0f3>] hwsim_subscribe_all_others
drivers/net/ieee802154/mac802154_hwsim.c:709 [inline]
[<ffffffff82b0b0f3>] hwsim_add_one+0x3b3/0x640
drivers/net/ieee802154/mac802154_hwsim.c:802
[<ffffffff82b0b3c4>] hwsim_probe+0x44/0xd0
drivers/net/ieee802154/mac802154_hwsim.c:848
[<ffffffff82628bf1>] platform_probe+0x81/0x120 drivers/base/platform.c:1447
[<ffffffff82625679>] really_probe+0x159/0x500 drivers/base/dd.c:576
[<ffffffff82625aab>] driver_probe_device+0x8b/0x100 drivers/base/dd.c:763
[<ffffffff82626325>] device_driver_attach+0x105/0x110 drivers/base/dd.c:1039
[<ffffffff826263d1>] __driver_attach drivers/base/dd.c:1117 [inline]
[<ffffffff826263d1>] __driver_attach+0xa1/0x140 drivers/base/dd.c:1070
[<ffffffff82622459>] bus_for_each_dev+0xa9/0x100 drivers/base/bus.c:305
[<ffffffff826244e0>] bus_add_driver+0x160/0x280 drivers/base/bus.c:622
[<ffffffff82627233>] driver_register+0xc3/0x150 drivers/base/driver.c:171
[<ffffffff874fa3dc>] hwsim_init_module+0xae/0x107
drivers/net/ieee802154/mac802154_hwsim.c:899
[<ffffffff81001083>] do_one_initcall+0x63/0x2e0 init/main.c:1249
[<ffffffff87489873>] do_initcall_level init/main.c:1322 [inline]
[<ffffffff87489873>] do_initcalls init/main.c:1338 [inline]
[<ffffffff87489873>] do_basic_setup init/main.c:1358 [inline]
[<ffffffff87489873>] kernel_init_freeable+0x1f4/0x26e init/main.c:1560
[<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447
[<ffffffff810022ef>] ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

[1] https://groups.google.com/g/syzkaller-bugs/c/iEEZ_UOvoEY/m/6sm4bt6EAgAJ

--
My best regards to you.

     No System Is Safe!
     Dongliang Mu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-14 13:22 Suggestions on how to debug kernel crashes where printk and gdb both does not work Dongliang Mu
@ 2021-06-14 13:34 ` Pavel Skripkin
  2021-06-14 14:19   ` Dongliang Mu
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Skripkin @ 2021-06-14 13:34 UTC (permalink / raw)
  To: Dongliang Mu
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Mon, 14 Jun 2021 21:22:43 +0800
Dongliang Mu <mudongliangabcd@gmail.com> wrote:

> Dear kernel developers,
> 
> I was trying to debug the crash - memory leak in hwsim_add_one [1]
> recently. However, I encountered a disgusting issue: my breakpoint and
> printk/pr_alert in the functions that will be surely executed do not
> work. The stack trace is in the following. I wrote this email to ask
> for some suggestions on how to debug such cases?
> 
> Thanks very much. Looking forward to your reply.
> 

Hi, Dongliang!

This bug is not similar to others on the dashboard. I spent some time
debugging it a week ago. The main problem here, that memory
allocation happens in the boot time:

> [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447

and reproducer simply tries to
free this data. You can use ftrace to look at it. Smth like this: 

$ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter

would work.


With regards,
Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-14 13:34 ` Pavel Skripkin
@ 2021-06-14 14:19   ` Dongliang Mu
  2021-06-14 14:25     ` Pavel Skripkin
  0 siblings, 1 reply; 12+ messages in thread
From: Dongliang Mu @ 2021-06-14 14:19 UTC (permalink / raw)
  To: Pavel Skripkin
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin <paskripkin@gmail.com> wrote:
>
> On Mon, 14 Jun 2021 21:22:43 +0800
> Dongliang Mu <mudongliangabcd@gmail.com> wrote:
>
> > Dear kernel developers,
> >
> > I was trying to debug the crash - memory leak in hwsim_add_one [1]
> > recently. However, I encountered a disgusting issue: my breakpoint and
> > printk/pr_alert in the functions that will be surely executed do not
> > work. The stack trace is in the following. I wrote this email to ask
> > for some suggestions on how to debug such cases?
> >
> > Thanks very much. Looking forward to your reply.
> >
>
> Hi, Dongliang!
>
> This bug is not similar to others on the dashboard. I spent some time
> debugging it a week ago. The main problem here, that memory
> allocation happens in the boot time:
>
> > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447
>

Oh, nice catch. No wonder why my debugging does not work. :(

> and reproducer simply tries to
> free this data. You can use ftrace to look at it. Smth like this:
>
> $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter

Thanks for your suggestion.

Do you have any conclusions about this case? If you have found out the
root cause and start writing patches, I will turn my focus to other
cases.

BTW, I only found another possible memory leak after some manual code
review [1]. However, it is not the root cause for this crash.

[1] https://lkml.org/lkml/2021/6/10/1297

>
> would work.
>
>
> With regards,
> Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-14 14:19   ` Dongliang Mu
@ 2021-06-14 14:25     ` Pavel Skripkin
  2021-06-14 14:40       ` Dongliang Mu
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Skripkin @ 2021-06-14 14:25 UTC (permalink / raw)
  To: Dongliang Mu
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Mon, 14 Jun 2021 22:19:10 +0800
Dongliang Mu <mudongliangabcd@gmail.com> wrote:

> On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin <paskripkin@gmail.com>
> wrote:
> >
> > On Mon, 14 Jun 2021 21:22:43 +0800
> > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> >
> > > Dear kernel developers,
> > >
> > > I was trying to debug the crash - memory leak in hwsim_add_one [1]
> > > recently. However, I encountered a disgusting issue: my
> > > breakpoint and printk/pr_alert in the functions that will be
> > > surely executed do not work. The stack trace is in the following.
> > > I wrote this email to ask for some suggestions on how to debug
> > > such cases?
> > >
> > > Thanks very much. Looking forward to your reply.
> > >
> >
> > Hi, Dongliang!
> >
> > This bug is not similar to others on the dashboard. I spent some
> > time debugging it a week ago. The main problem here, that memory
> > allocation happens in the boot time:
> >
> > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447
> >
> 
> Oh, nice catch. No wonder why my debugging does not work. :(
> 
> > and reproducer simply tries to
> > free this data. You can use ftrace to look at it. Smth like this:
> >
> > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> 
> Thanks for your suggestion.
> 
> Do you have any conclusions about this case? If you have found out the
> root cause and start writing patches, I will turn my focus to other
> cases.

No, I had some busy days and I have nothing about this bug for now.
I've just traced the reproducer execution and that's all :)

I guess, some error handling paths are broken, but Im not sure 
 

> 
> BTW, I only found another possible memory leak after some manual code
> review [1]. However, it is not the root cause for this crash.
> 
> [1] https://lkml.org/lkml/2021/6/10/1297
> 
> >
> > would work.
> >
> >
> > With regards,
> > Pavel Skripkin




With regards,
Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-14 14:25     ` Pavel Skripkin
@ 2021-06-14 14:40       ` Dongliang Mu
  2021-06-14 14:47         ` Pavel Skripkin
  0 siblings, 1 reply; 12+ messages in thread
From: Dongliang Mu @ 2021-06-14 14:40 UTC (permalink / raw)
  To: Pavel Skripkin
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin <paskripkin@gmail.com> wrote:
>
> On Mon, 14 Jun 2021 22:19:10 +0800
> Dongliang Mu <mudongliangabcd@gmail.com> wrote:
>
> > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin <paskripkin@gmail.com>
> > wrote:
> > >
> > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > >
> > > > Dear kernel developers,
> > > >
> > > > I was trying to debug the crash - memory leak in hwsim_add_one [1]
> > > > recently. However, I encountered a disgusting issue: my
> > > > breakpoint and printk/pr_alert in the functions that will be
> > > > surely executed do not work. The stack trace is in the following.
> > > > I wrote this email to ask for some suggestions on how to debug
> > > > such cases?
> > > >
> > > > Thanks very much. Looking forward to your reply.
> > > >
> > >
> > > Hi, Dongliang!
> > >
> > > This bug is not similar to others on the dashboard. I spent some
> > > time debugging it a week ago. The main problem here, that memory
> > > allocation happens in the boot time:
> > >
> > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447
> > >
> >
> > Oh, nice catch. No wonder why my debugging does not work. :(
> >
> > > and reproducer simply tries to
> > > free this data. You can use ftrace to look at it. Smth like this:
> > >
> > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> >
> > Thanks for your suggestion.
> >
> > Do you have any conclusions about this case? If you have found out the
> > root cause and start writing patches, I will turn my focus to other
> > cases.
>
> No, I had some busy days and I have nothing about this bug for now.
> I've just traced the reproducer execution and that's all :)
>
> I guess, some error handling paths are broken, but Im not sure

In the beginning, I agreed with you. However, after I manually checked
functions: hwsim_probe (initialization) and  hwsim_remove (cleanup),
then things may be different. The cleanup looks correct to me. I would
like to debug but stuck with the debugging process.

And there is another issue: the cleanup function also does not output
anything or hit the breakpoint. I don't quite understand it since the
cleanup is not at the boot time.

Any idea?

>
>
> >
> > BTW, I only found another possible memory leak after some manual code
> > review [1]. However, it is not the root cause for this crash.
> >
> > [1] https://lkml.org/lkml/2021/6/10/1297
> >
> > >
> > > would work.
> > >
> > >
> > > With regards,
> > > Pavel Skripkin
>
>
>
>
> With regards,
> Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-14 14:40       ` Dongliang Mu
@ 2021-06-14 14:47         ` Pavel Skripkin
  2021-06-14 15:04           ` Dongliang Mu
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Skripkin @ 2021-06-14 14:47 UTC (permalink / raw)
  To: Dongliang Mu
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Mon, 14 Jun 2021 22:40:55 +0800
Dongliang Mu <mudongliangabcd@gmail.com> wrote:

> On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> <paskripkin@gmail.com> wrote:
> >
> > On Mon, 14 Jun 2021 22:19:10 +0800
> > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> >
> > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > <paskripkin@gmail.com> wrote:
> > > >
> > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > >
> > > > > Dear kernel developers,
> > > > >
> > > > > I was trying to debug the crash - memory leak in
> > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > disgusting issue: my breakpoint and printk/pr_alert in the
> > > > > functions that will be surely executed do not work. The stack
> > > > > trace is in the following. I wrote this email to ask for some
> > > > > suggestions on how to debug such cases?
> > > > >
> > > > > Thanks very much. Looking forward to your reply.
> > > > >
> > > >
> > > > Hi, Dongliang!
> > > >
> > > > This bug is not similar to others on the dashboard. I spent some
> > > > time debugging it a week ago. The main problem here, that memory
> > > > allocation happens in the boot time:
> > > >
> > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447
> > > >
> > >
> > > Oh, nice catch. No wonder why my debugging does not work. :(
> > >
> > > > and reproducer simply tries to
> > > > free this data. You can use ftrace to look at it. Smth like
> > > > this:
> > > >
> > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > >
> > > Thanks for your suggestion.
> > >
> > > Do you have any conclusions about this case? If you have found
> > > out the root cause and start writing patches, I will turn my
> > > focus to other cases.
> >
> > No, I had some busy days and I have nothing about this bug for now.
> > I've just traced the reproducer execution and that's all :)
> >
> > I guess, some error handling paths are broken, but Im not sure
> 
> In the beginning, I agreed with you. However, after I manually checked
> functions: hwsim_probe (initialization) and  hwsim_remove (cleanup),
> then things may be different. The cleanup looks correct to me. I would
> like to debug but stuck with the debugging process.
> 
> And there is another issue: the cleanup function also does not output
> anything or hit the breakpoint. I don't quite understand it since the
> cleanup is not at the boot time.
> 
> Any idea?
> 

Output from ftrace (syzkaller repro):

root@syzkaller:~# cat /sys/kernel/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)               |  hwsim_del_radio_nl() {
 1)               |    hwsim_del() {
 1)               |      hwsim_edge_unsubscribe_me() {
 1) ! 310.041 us  |        hwsim_free_edge();
 1) ! 665.221 us  |      }
 1) * 52999.05 us |    }
 1) * 53035.38 us |  }

Cleanup function is not the case, I think :)



With regards,
Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-14 14:47         ` Pavel Skripkin
@ 2021-06-14 15:04           ` Dongliang Mu
  2021-06-14 20:30             ` Pavel Skripkin
  0 siblings, 1 reply; 12+ messages in thread
From: Dongliang Mu @ 2021-06-14 15:04 UTC (permalink / raw)
  To: Pavel Skripkin
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin <paskripkin@gmail.com> wrote:
>
> On Mon, 14 Jun 2021 22:40:55 +0800
> Dongliang Mu <mudongliangabcd@gmail.com> wrote:
>
> > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> > <paskripkin@gmail.com> wrote:
> > >
> > > On Mon, 14 Jun 2021 22:19:10 +0800
> > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > >
> > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > > <paskripkin@gmail.com> wrote:
> > > > >
> > > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > >
> > > > > > Dear kernel developers,
> > > > > >
> > > > > > I was trying to debug the crash - memory leak in
> > > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > > disgusting issue: my breakpoint and printk/pr_alert in the
> > > > > > functions that will be surely executed do not work. The stack
> > > > > > trace is in the following. I wrote this email to ask for some
> > > > > > suggestions on how to debug such cases?
> > > > > >
> > > > > > Thanks very much. Looking forward to your reply.
> > > > > >
> > > > >
> > > > > Hi, Dongliang!
> > > > >
> > > > > This bug is not similar to others on the dashboard. I spent some
> > > > > time debugging it a week ago. The main problem here, that memory
> > > > > allocation happens in the boot time:
> > > > >
> > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447
> > > > >
> > > >
> > > > Oh, nice catch. No wonder why my debugging does not work. :(
> > > >
> > > > > and reproducer simply tries to
> > > > > free this data. You can use ftrace to look at it. Smth like
> > > > > this:
> > > > >
> > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > > >
> > > > Thanks for your suggestion.
> > > >
> > > > Do you have any conclusions about this case? If you have found
> > > > out the root cause and start writing patches, I will turn my
> > > > focus to other cases.
> > >
> > > No, I had some busy days and I have nothing about this bug for now.
> > > I've just traced the reproducer execution and that's all :)
> > >
> > > I guess, some error handling paths are broken, but Im not sure
> >
> > In the beginning, I agreed with you. However, after I manually checked
> > functions: hwsim_probe (initialization) and  hwsim_remove (cleanup),
> > then things may be different. The cleanup looks correct to me. I would
> > like to debug but stuck with the debugging process.
> >
> > And there is another issue: the cleanup function also does not output
> > anything or hit the breakpoint. I don't quite understand it since the
> > cleanup is not at the boot time.
> >
> > Any idea?
> >
>
> Output from ftrace (syzkaller repro):
>
> root@syzkaller:~# cat /sys/kernel/tracing/trace
> # tracer: function_graph
> #
> # CPU  DURATION                  FUNCTION CALLS
> # |     |   |                     |   |   |   |
>  1)               |  hwsim_del_radio_nl() {
>  1)               |    hwsim_del() {
>  1)               |      hwsim_edge_unsubscribe_me() {
>  1) ! 310.041 us  |        hwsim_free_edge();
>  1) ! 665.221 us  |      }
>  1) * 52999.05 us |    }
>  1) * 53035.38 us |  }
>
> Cleanup function is not the case, I think :)

It seems like I spot the incorrect cleanup function (hwsim_remove is
the right one is in my mind). Let me learn how to use ftrace to log
the executed functions and then discuss this case with you guys.

>
>
>
> With regards,
> Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-14 15:04           ` Dongliang Mu
@ 2021-06-14 20:30             ` Pavel Skripkin
  2021-06-15 10:37               ` Dongliang Mu
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Skripkin @ 2021-06-14 20:30 UTC (permalink / raw)
  To: Dongliang Mu
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Mon, 14 Jun 2021 23:04:03 +0800
Dongliang Mu <mudongliangabcd@gmail.com> wrote:

> On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin
> <paskripkin@gmail.com> wrote:
> >
> > On Mon, 14 Jun 2021 22:40:55 +0800
> > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> >
> > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> > > <paskripkin@gmail.com> wrote:
> > > >
> > > > On Mon, 14 Jun 2021 22:19:10 +0800
> > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > >
> > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > > > <paskripkin@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > >
> > > > > > > Dear kernel developers,
> > > > > > >
> > > > > > > I was trying to debug the crash - memory leak in
> > > > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > > > disgusting issue: my breakpoint and printk/pr_alert in the
> > > > > > > functions that will be surely executed do not work. The
> > > > > > > stack trace is in the following. I wrote this email to
> > > > > > > ask for some suggestions on how to debug such cases?
> > > > > > >
> > > > > > > Thanks very much. Looking forward to your reply.
> > > > > > >
> > > > > >
> > > > > > Hi, Dongliang!
> > > > > >
> > > > > > This bug is not similar to others on the dashboard. I spent
> > > > > > some time debugging it a week ago. The main problem here,
> > > > > > that memory allocation happens in the boot time:
> > > > > >
> > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7
> > > > > > > init/main.c:1447
> > > > > >
> > > > >
> > > > > Oh, nice catch. No wonder why my debugging does not work. :(
> > > > >
> > > > > > and reproducer simply tries to
> > > > > > free this data. You can use ftrace to look at it. Smth like
> > > > > > this:
> > > > > >
> > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > > > >
> > > > > Thanks for your suggestion.
> > > > >
> > > > > Do you have any conclusions about this case? If you have found
> > > > > out the root cause and start writing patches, I will turn my
> > > > > focus to other cases.
> > > >
> > > > No, I had some busy days and I have nothing about this bug for
> > > > now. I've just traced the reproducer execution and that's all :)
> > > >
> > > > I guess, some error handling paths are broken, but Im not sure
> > >
> > > In the beginning, I agreed with you. However, after I manually
> > > checked functions: hwsim_probe (initialization) and  hwsim_remove
> > > (cleanup), then things may be different. The cleanup looks
> > > correct to me. I would like to debug but stuck with the debugging
> > > process.
> > >
> > > And there is another issue: the cleanup function also does not
> > > output anything or hit the breakpoint. I don't quite understand
> > > it since the cleanup is not at the boot time.
> > >
> > > Any idea?
> > >
> >
> > Output from ftrace (syzkaller repro):
> >
> > root@syzkaller:~# cat /sys/kernel/tracing/trace
> > # tracer: function_graph
> > #
> > # CPU  DURATION                  FUNCTION CALLS
> > # |     |   |                     |   |   |   |
> >  1)               |  hwsim_del_radio_nl() {
> >  1)               |    hwsim_del() {
> >  1)               |      hwsim_edge_unsubscribe_me() {
> >  1) ! 310.041 us  |        hwsim_free_edge();
> >  1) ! 665.221 us  |      }
> >  1) * 52999.05 us |    }
> >  1) * 53035.38 us |  }
> >
> > Cleanup function is not the case, I think :)
> 
> It seems like I spot the incorrect cleanup function (hwsim_remove is
> the right one is in my mind). Let me learn how to use ftrace to log
> the executed functions and then discuss this case with you guys.
> 

Hmmm, I think, there is a mess with lists.

I just want to share my debug results, I have no idea about the fix for
now.

In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces
sends a request to delete phy with idx == 0, so this check in
hwsim_edge_unsubscribe_me():

	if (e->endpoint->idx == phy->idx) { 
		... clean up code ...
	}

won't be passed and edge won't be freed (because it was allocated for
phy with idx == 1). Allocated edge for phy 1 becomes leaked after
hwsim_del(). I can't really see the code where phy with idx == 1 can
be deleted from list...

Maybe, it's kmemleak bug. Similar strange case was with this one
https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584.
I find it strange, that I could reach leaked pointers after kmemleak reported a
leak. Im not familiar with kmemleak internals and I might be wrong 


With regards,
Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-14 20:30             ` Pavel Skripkin
@ 2021-06-15 10:37               ` Dongliang Mu
  2021-06-15 13:34                 ` Pavel Skripkin
  0 siblings, 1 reply; 12+ messages in thread
From: Dongliang Mu @ 2021-06-15 10:37 UTC (permalink / raw)
  To: Pavel Skripkin
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com> wrote:
>
> On Mon, 14 Jun 2021 23:04:03 +0800
> Dongliang Mu <mudongliangabcd@gmail.com> wrote:
>
> > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin
> > <paskripkin@gmail.com> wrote:
> > >
> > > On Mon, 14 Jun 2021 22:40:55 +0800
> > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > >
> > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> > > > <paskripkin@gmail.com> wrote:
> > > > >
> > > > > On Mon, 14 Jun 2021 22:19:10 +0800
> > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > >
> > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > > > > <paskripkin@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > > >
> > > > > > > > Dear kernel developers,
> > > > > > > >
> > > > > > > > I was trying to debug the crash - memory leak in
> > > > > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > > > > disgusting issue: my breakpoint and printk/pr_alert in the
> > > > > > > > functions that will be surely executed do not work. The
> > > > > > > > stack trace is in the following. I wrote this email to
> > > > > > > > ask for some suggestions on how to debug such cases?
> > > > > > > >
> > > > > > > > Thanks very much. Looking forward to your reply.
> > > > > > > >
> > > > > > >
> > > > > > > Hi, Dongliang!
> > > > > > >
> > > > > > > This bug is not similar to others on the dashboard. I spent
> > > > > > > some time debugging it a week ago. The main problem here,
> > > > > > > that memory allocation happens in the boot time:
> > > > > > >
> > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7
> > > > > > > > init/main.c:1447
> > > > > > >
> > > > > >
> > > > > > Oh, nice catch. No wonder why my debugging does not work. :(
> > > > > >
> > > > > > > and reproducer simply tries to
> > > > > > > free this data. You can use ftrace to look at it. Smth like
> > > > > > > this:
> > > > > > >
> > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > > > > >
> > > > > > Thanks for your suggestion.
> > > > > >
> > > > > > Do you have any conclusions about this case? If you have found
> > > > > > out the root cause and start writing patches, I will turn my
> > > > > > focus to other cases.
> > > > >
> > > > > No, I had some busy days and I have nothing about this bug for
> > > > > now. I've just traced the reproducer execution and that's all :)
> > > > >
> > > > > I guess, some error handling paths are broken, but Im not sure
> > > >
> > > > In the beginning, I agreed with you. However, after I manually
> > > > checked functions: hwsim_probe (initialization) and  hwsim_remove
> > > > (cleanup), then things may be different. The cleanup looks
> > > > correct to me. I would like to debug but stuck with the debugging
> > > > process.
> > > >
> > > > And there is another issue: the cleanup function also does not
> > > > output anything or hit the breakpoint. I don't quite understand
> > > > it since the cleanup is not at the boot time.
> > > >
> > > > Any idea?
> > > >
> > >
> > > Output from ftrace (syzkaller repro):
> > >
> > > root@syzkaller:~# cat /sys/kernel/tracing/trace
> > > # tracer: function_graph
> > > #
> > > # CPU  DURATION                  FUNCTION CALLS
> > > # |     |   |                     |   |   |   |
> > >  1)               |  hwsim_del_radio_nl() {
> > >  1)               |    hwsim_del() {
> > >  1)               |      hwsim_edge_unsubscribe_me() {
> > >  1) ! 310.041 us  |        hwsim_free_edge();
> > >  1) ! 665.221 us  |      }
> > >  1) * 52999.05 us |    }
> > >  1) * 53035.38 us |  }
> > >
> > > Cleanup function is not the case, I think :)
> >
> > It seems like I spot the incorrect cleanup function (hwsim_remove is
> > the right one is in my mind). Let me learn how to use ftrace to log
> > the executed functions and then discuss this case with you guys.
> >
>
> Hmmm, I think, there is a mess with lists.
>
> I just want to share my debug results, I have no idea about the fix for
> now.
>
> In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces
> sends a request to delete phy with idx == 0, so this check in
> hwsim_edge_unsubscribe_me():
>
>         if (e->endpoint->idx == phy->idx) {
>                 ... clean up code ...
>         }
>
> won't be passed and edge won't be freed (because it was allocated for
> phy with idx == 1). Allocated edge for phy 1 becomes leaked after
> hwsim_del(). I can't really see the code where phy with idx == 1 can
> be deleted from list...

Thanks for sharing your debugging result.

              hwsim_phys
                       |
   ---------------------------------
   |                                      |
sub0 (edges)                 sub1 (edges)
   ----> e (idx = 1)               ----> e (idx = 0)

hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1).
However, in this function, it only deletes the e in the edge list of sub1.
Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it
leaves the e in the edge list of sub0 non-free.

I proposed a patch and test it successfully in the syzbot dashboard.

diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
b/drivers/net/ieee802154/mac802154_hwsim.c
index da9135231c07..b05159cff33a 100644
--- a/drivers/net/ieee802154/mac802154_hwsim.c
+++ b/drivers/net/ieee802154/mac802154_hwsim.c
@@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info,
struct device *dev,
 static void hwsim_del(struct hwsim_phy *phy)
 {
  struct hwsim_pib *pib;
+ struct hwsim_edge *e;

  hwsim_edge_unsubscribe_me(phy);

+ // remove the edges in the list
+ list_for_each_entry_rcu(e, &phy->edges, list) {
+ list_del_rcu(&e->list);
+ hwsim_free_edge(e);
+ }
+
  list_del(&phy->list);

  rcu_read_lock();

 I will send a patch later.


>
> Maybe, it's kmemleak bug. Similar strange case was with this one
> https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584.
> I find it strange, that I could reach leaked pointers after kmemleak reported a
> leak. Im not familiar with kmemleak internals and I might be wrong
>
>
> With regards,
> Pavel Skripkin

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-15 10:37               ` Dongliang Mu
@ 2021-06-15 13:34                 ` Pavel Skripkin
  2021-06-16  2:02                   ` Dongliang Mu
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Skripkin @ 2021-06-15 13:34 UTC (permalink / raw)
  To: Dongliang Mu
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Tue, 15 Jun 2021 18:37:14 +0800
Dongliang Mu <mudongliangabcd@gmail.com> wrote:

> On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com>
> wrote:
> >
> > On Mon, 14 Jun 2021 23:04:03 +0800
> > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> >
> > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin
> > > <paskripkin@gmail.com> wrote:
> > > >
> > > > On Mon, 14 Jun 2021 22:40:55 +0800
> > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > >
> > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> > > > > <paskripkin@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, 14 Jun 2021 22:19:10 +0800
> > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > >
> > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > > > > > <paskripkin@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Dear kernel developers,
> > > > > > > > >
> > > > > > > > > I was trying to debug the crash - memory leak in
> > > > > > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert
> > > > > > > > > in the functions that will be surely executed do not
> > > > > > > > > work. The stack trace is in the following. I wrote
> > > > > > > > > this email to ask for some suggestions on how to
> > > > > > > > > debug such cases?
> > > > > > > > >
> > > > > > > > > Thanks very much. Looking forward to your reply.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hi, Dongliang!
> > > > > > > >
> > > > > > > > This bug is not similar to others on the dashboard. I
> > > > > > > > spent some time debugging it a week ago. The main
> > > > > > > > problem here, that memory allocation happens in the
> > > > > > > > boot time:
> > > > > > > >
> > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7
> > > > > > > > > init/main.c:1447
> > > > > > > >
> > > > > > >
> > > > > > > Oh, nice catch. No wonder why my debugging does not work.
> > > > > > > :(
> > > > > > >
> > > > > > > > and reproducer simply tries to
> > > > > > > > free this data. You can use ftrace to look at it. Smth
> > > > > > > > like this:
> > > > > > > >
> > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > > > > > >
> > > > > > > Thanks for your suggestion.
> > > > > > >
> > > > > > > Do you have any conclusions about this case? If you have
> > > > > > > found out the root cause and start writing patches, I
> > > > > > > will turn my focus to other cases.
> > > > > >
> > > > > > No, I had some busy days and I have nothing about this bug
> > > > > > for now. I've just traced the reproducer execution and
> > > > > > that's all :)
> > > > > >
> > > > > > I guess, some error handling paths are broken, but Im not
> > > > > > sure
> > > > >
> > > > > In the beginning, I agreed with you. However, after I manually
> > > > > checked functions: hwsim_probe (initialization) and
> > > > > hwsim_remove (cleanup), then things may be different. The
> > > > > cleanup looks correct to me. I would like to debug but stuck
> > > > > with the debugging process.
> > > > >
> > > > > And there is another issue: the cleanup function also does not
> > > > > output anything or hit the breakpoint. I don't quite
> > > > > understand it since the cleanup is not at the boot time.
> > > > >
> > > > > Any idea?
> > > > >
> > > >
> > > > Output from ftrace (syzkaller repro):
> > > >
> > > > root@syzkaller:~# cat /sys/kernel/tracing/trace
> > > > # tracer: function_graph
> > > > #
> > > > # CPU  DURATION                  FUNCTION CALLS
> > > > # |     |   |                     |   |   |   |
> > > >  1)               |  hwsim_del_radio_nl() {
> > > >  1)               |    hwsim_del() {
> > > >  1)               |      hwsim_edge_unsubscribe_me() {
> > > >  1) ! 310.041 us  |        hwsim_free_edge();
> > > >  1) ! 665.221 us  |      }
> > > >  1) * 52999.05 us |    }
> > > >  1) * 53035.38 us |  }
> > > >
> > > > Cleanup function is not the case, I think :)
> > >
> > > It seems like I spot the incorrect cleanup function (hwsim_remove
> > > is the right one is in my mind). Let me learn how to use ftrace
> > > to log the executed functions and then discuss this case with you
> > > guys.
> > >
> >
> > Hmmm, I think, there is a mess with lists.
> >
> > I just want to share my debug results, I have no idea about the fix
> > for now.
> >
> > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces
> > sends a request to delete phy with idx == 0, so this check in
> > hwsim_edge_unsubscribe_me():
> >
> >         if (e->endpoint->idx == phy->idx) {
> >                 ... clean up code ...
> >         }
> >
> > won't be passed and edge won't be freed (because it was allocated
> > for phy with idx == 1). Allocated edge for phy 1 becomes leaked
> > after hwsim_del(). I can't really see the code where phy with idx
> > == 1 can be deleted from list...
> 
> Thanks for sharing your debugging result.
> 
>               hwsim_phys
>                        |
>    ---------------------------------
>    |                                      |
> sub0 (edges)                 sub1 (edges)
>    ----> e (idx = 1)               ----> e (idx = 0)
> 
> hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1).
> However, in this function, it only deletes the e in the edge list of
> sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it
> leaves the e in the edge list of sub0 non-free.
> 
> I proposed a patch and test it successfully in the syzbot dashboard.
> 

Cool! I thougth about similar fix before going to bed, but I had really
busy morning today :)

> diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
> b/drivers/net/ieee802154/mac802154_hwsim.c
> index da9135231c07..b05159cff33a 100644
> --- a/drivers/net/ieee802154/mac802154_hwsim.c
> +++ b/drivers/net/ieee802154/mac802154_hwsim.c
> @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info,
> struct device *dev,
>  static void hwsim_del(struct hwsim_phy *phy)
>  {
>   struct hwsim_pib *pib;
> + struct hwsim_edge *e;
> 
>   hwsim_edge_unsubscribe_me(phy);
> 
> + // remove the edges in the list
> + list_for_each_entry_rcu(e, &phy->edges, list) {
> + list_del_rcu(&e->list);
> + hwsim_free_edge(e);
> + }
> +

I think, rcu_read_lock() and rcu_read_unlock() are needed here (like in
hwsim_edge_unsubscribe_me()). Or you can delete this edges after deleting
phy node from global list, then, i guess, rcu locking won't be needed
here.

>   list_del(&phy->list);
> 
>   rcu_read_lock();
> 
>  I will send a patch later.
> 
> 
> >
> > Maybe, it's kmemleak bug. Similar strange case was with this one
> > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584.
> > I find it strange, that I could reach leaked pointers after
> > kmemleak reported a leak. Im not familiar with kmemleak internals
> > and I might be wrong
> >
> >
> > With regards,
> > Pavel Skripkin




With regards,
Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-15 13:34                 ` Pavel Skripkin
@ 2021-06-16  2:02                   ` Dongliang Mu
  2021-06-16  2:11                     ` Dongliang Mu
  0 siblings, 1 reply; 12+ messages in thread
From: Dongliang Mu @ 2021-06-16  2:02 UTC (permalink / raw)
  To: Pavel Skripkin
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Tue, Jun 15, 2021 at 9:35 PM Pavel Skripkin <paskripkin@gmail.com> wrote:
>
> On Tue, 15 Jun 2021 18:37:14 +0800
> Dongliang Mu <mudongliangabcd@gmail.com> wrote:
>
> > On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com>
> > wrote:
> > >
> > > On Mon, 14 Jun 2021 23:04:03 +0800
> > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > >
> > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin
> > > > <paskripkin@gmail.com> wrote:
> > > > >
> > > > > On Mon, 14 Jun 2021 22:40:55 +0800
> > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > >
> > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> > > > > > <paskripkin@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800
> > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > > >
> > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > > > > > > <paskripkin@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Dear kernel developers,
> > > > > > > > > >
> > > > > > > > > > I was trying to debug the crash - memory leak in
> > > > > > > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert
> > > > > > > > > > in the functions that will be surely executed do not
> > > > > > > > > > work. The stack trace is in the following. I wrote
> > > > > > > > > > this email to ask for some suggestions on how to
> > > > > > > > > > debug such cases?
> > > > > > > > > >
> > > > > > > > > > Thanks very much. Looking forward to your reply.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi, Dongliang!
> > > > > > > > >
> > > > > > > > > This bug is not similar to others on the dashboard. I
> > > > > > > > > spent some time debugging it a week ago. The main
> > > > > > > > > problem here, that memory allocation happens in the
> > > > > > > > > boot time:
> > > > > > > > >
> > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7
> > > > > > > > > > init/main.c:1447
> > > > > > > > >
> > > > > > > >
> > > > > > > > Oh, nice catch. No wonder why my debugging does not work.
> > > > > > > > :(
> > > > > > > >
> > > > > > > > > and reproducer simply tries to
> > > > > > > > > free this data. You can use ftrace to look at it. Smth
> > > > > > > > > like this:
> > > > > > > > >
> > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > > > > > > >
> > > > > > > > Thanks for your suggestion.
> > > > > > > >
> > > > > > > > Do you have any conclusions about this case? If you have
> > > > > > > > found out the root cause and start writing patches, I
> > > > > > > > will turn my focus to other cases.
> > > > > > >
> > > > > > > No, I had some busy days and I have nothing about this bug
> > > > > > > for now. I've just traced the reproducer execution and
> > > > > > > that's all :)
> > > > > > >
> > > > > > > I guess, some error handling paths are broken, but Im not
> > > > > > > sure
> > > > > >
> > > > > > In the beginning, I agreed with you. However, after I manually
> > > > > > checked functions: hwsim_probe (initialization) and
> > > > > > hwsim_remove (cleanup), then things may be different. The
> > > > > > cleanup looks correct to me. I would like to debug but stuck
> > > > > > with the debugging process.
> > > > > >
> > > > > > And there is another issue: the cleanup function also does not
> > > > > > output anything or hit the breakpoint. I don't quite
> > > > > > understand it since the cleanup is not at the boot time.
> > > > > >
> > > > > > Any idea?
> > > > > >
> > > > >
> > > > > Output from ftrace (syzkaller repro):
> > > > >
> > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace
> > > > > # tracer: function_graph
> > > > > #
> > > > > # CPU  DURATION                  FUNCTION CALLS
> > > > > # |     |   |                     |   |   |   |
> > > > >  1)               |  hwsim_del_radio_nl() {
> > > > >  1)               |    hwsim_del() {
> > > > >  1)               |      hwsim_edge_unsubscribe_me() {
> > > > >  1) ! 310.041 us  |        hwsim_free_edge();
> > > > >  1) ! 665.221 us  |      }
> > > > >  1) * 52999.05 us |    }
> > > > >  1) * 53035.38 us |  }
> > > > >
> > > > > Cleanup function is not the case, I think :)
> > > >
> > > > It seems like I spot the incorrect cleanup function (hwsim_remove
> > > > is the right one is in my mind). Let me learn how to use ftrace
> > > > to log the executed functions and then discuss this case with you
> > > > guys.
> > > >
> > >
> > > Hmmm, I think, there is a mess with lists.
> > >
> > > I just want to share my debug results, I have no idea about the fix
> > > for now.
> > >
> > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces
> > > sends a request to delete phy with idx == 0, so this check in
> > > hwsim_edge_unsubscribe_me():
> > >
> > >         if (e->endpoint->idx == phy->idx) {
> > >                 ... clean up code ...
> > >         }
> > >
> > > won't be passed and edge won't be freed (because it was allocated
> > > for phy with idx == 1). Allocated edge for phy 1 becomes leaked
> > > after hwsim_del(). I can't really see the code where phy with idx
> > > == 1 can be deleted from list...
> >
> > Thanks for sharing your debugging result.
> >
> >               hwsim_phys
> >                        |
> >    ---------------------------------
> >    |                                      |
> > sub0 (edges)                 sub1 (edges)
> >    ----> e (idx = 1)               ----> e (idx = 0)
> >
> > hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1).
> > However, in this function, it only deletes the e in the edge list of
> > sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it
> > leaves the e in the edge list of sub0 non-free.
> >
> > I proposed a patch and test it successfully in the syzbot dashboard.
> >
>
> Cool! I thougth about similar fix before going to bed, but I had really
> busy morning today :)
>
> > diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
> > b/drivers/net/ieee802154/mac802154_hwsim.c
> > index da9135231c07..b05159cff33a 100644
> > --- a/drivers/net/ieee802154/mac802154_hwsim.c
> > +++ b/drivers/net/ieee802154/mac802154_hwsim.c
> > @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info,
> > struct device *dev,
> >  static void hwsim_del(struct hwsim_phy *phy)
> >  {
> >   struct hwsim_pib *pib;
> > + struct hwsim_edge *e;
> >
> >   hwsim_edge_unsubscribe_me(phy);
> >
> > + // remove the edges in the list
> > + list_for_each_entry_rcu(e, &phy->edges, list) {
> > + list_del_rcu(&e->list);
> > + hwsim_free_edge(e);
> > + }
> > +
>
> I think, rcu_read_lock() and rcu_read_unlock() are needed here (like in
> hwsim_edge_unsubscribe_me()). Or you can delete this edges after deleting
> phy node from global list, then, i guess, rcu locking won't be needed
> here.

Yes, you're right. rcu_read_lock is needed here. However, from the
code below list_del(&phy->list), I think we'd better still add
rcu_read_lock for those statements.

How do you think about the following patch? BTW, I've sent a patch
with the prefix PATCH. Maybe we can discuss this patch there.

diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
b/drivers/net/ieee802154/mac802154_hwsim.c
index da9135231c07..cf659361a3fb 100644
--- a/drivers/net/ieee802154/mac802154_hwsim.c
+++ b/drivers/net/ieee802154/mac802154_hwsim.c
@@ -824,12 +824,17 @@ static int hwsim_add_one(struct genl_info *info,
struct device *dev,
 static void hwsim_del(struct hwsim_phy *phy)
 {
        struct hwsim_pib *pib;
+       struct hwsim_edge *e;

        hwsim_edge_unsubscribe_me(phy);

        list_del(&phy->list);

        rcu_read_lock();
+       list_for_each_entry_rcu(e, &phy->edges, list) {
+               list_del_rcu(&e->list);
+               hwsim_free_edge(e);
+       }
        pib = rcu_dereference(phy->pib);
        rcu_read_unlock();

>
> >   list_del(&phy->list);
> >
> >   rcu_read_lock();
> >
> >  I will send a patch later.
> >
> >
> > >
> > > Maybe, it's kmemleak bug. Similar strange case was with this one
> > > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584.
> > > I find it strange, that I could reach leaked pointers after
> > > kmemleak reported a leak. Im not familiar with kmemleak internals
> > > and I might be wrong
> > >
> > >
> > > With regards,
> > > Pavel Skripkin
>
>
>
>
> With regards,
> Pavel Skripkin

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
  2021-06-16  2:02                   ` Dongliang Mu
@ 2021-06-16  2:11                     ` Dongliang Mu
  0 siblings, 0 replies; 12+ messages in thread
From: Dongliang Mu @ 2021-06-16  2:11 UTC (permalink / raw)
  To: Pavel Skripkin
  Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel,
	linux-wpan, netdev, stefan, syzkaller-bugs,
	syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH

On Wed, Jun 16, 2021 at 10:02 AM Dongliang Mu <mudongliangabcd@gmail.com> wrote:
>
> On Tue, Jun 15, 2021 at 9:35 PM Pavel Skripkin <paskripkin@gmail.com> wrote:
> >
> > On Tue, 15 Jun 2021 18:37:14 +0800
> > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> >
> > > On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com>
> > > wrote:
> > > >
> > > > On Mon, 14 Jun 2021 23:04:03 +0800
> > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > >
> > > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin
> > > > > <paskripkin@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, 14 Jun 2021 22:40:55 +0800
> > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > >
> > > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> > > > > > > <paskripkin@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800
> > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > > > > > > > <paskripkin@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Dear kernel developers,
> > > > > > > > > > >
> > > > > > > > > > > I was trying to debug the crash - memory leak in
> > > > > > > > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert
> > > > > > > > > > > in the functions that will be surely executed do not
> > > > > > > > > > > work. The stack trace is in the following. I wrote
> > > > > > > > > > > this email to ask for some suggestions on how to
> > > > > > > > > > > debug such cases?
> > > > > > > > > > >
> > > > > > > > > > > Thanks very much. Looking forward to your reply.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi, Dongliang!
> > > > > > > > > >
> > > > > > > > > > This bug is not similar to others on the dashboard. I
> > > > > > > > > > spent some time debugging it a week ago. The main
> > > > > > > > > > problem here, that memory allocation happens in the
> > > > > > > > > > boot time:
> > > > > > > > > >
> > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7
> > > > > > > > > > > init/main.c:1447
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Oh, nice catch. No wonder why my debugging does not work.
> > > > > > > > > :(
> > > > > > > > >
> > > > > > > > > > and reproducer simply tries to
> > > > > > > > > > free this data. You can use ftrace to look at it. Smth
> > > > > > > > > > like this:
> > > > > > > > > >
> > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > > > > > > > >
> > > > > > > > > Thanks for your suggestion.
> > > > > > > > >
> > > > > > > > > Do you have any conclusions about this case? If you have
> > > > > > > > > found out the root cause and start writing patches, I
> > > > > > > > > will turn my focus to other cases.
> > > > > > > >
> > > > > > > > No, I had some busy days and I have nothing about this bug
> > > > > > > > for now. I've just traced the reproducer execution and
> > > > > > > > that's all :)
> > > > > > > >
> > > > > > > > I guess, some error handling paths are broken, but Im not
> > > > > > > > sure
> > > > > > >
> > > > > > > In the beginning, I agreed with you. However, after I manually
> > > > > > > checked functions: hwsim_probe (initialization) and
> > > > > > > hwsim_remove (cleanup), then things may be different. The
> > > > > > > cleanup looks correct to me. I would like to debug but stuck
> > > > > > > with the debugging process.
> > > > > > >
> > > > > > > And there is another issue: the cleanup function also does not
> > > > > > > output anything or hit the breakpoint. I don't quite
> > > > > > > understand it since the cleanup is not at the boot time.
> > > > > > >
> > > > > > > Any idea?
> > > > > > >
> > > > > >
> > > > > > Output from ftrace (syzkaller repro):
> > > > > >
> > > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace
> > > > > > # tracer: function_graph
> > > > > > #
> > > > > > # CPU  DURATION                  FUNCTION CALLS
> > > > > > # |     |   |                     |   |   |   |
> > > > > >  1)               |  hwsim_del_radio_nl() {
> > > > > >  1)               |    hwsim_del() {
> > > > > >  1)               |      hwsim_edge_unsubscribe_me() {
> > > > > >  1) ! 310.041 us  |        hwsim_free_edge();
> > > > > >  1) ! 665.221 us  |      }
> > > > > >  1) * 52999.05 us |    }
> > > > > >  1) * 53035.38 us |  }
> > > > > >
> > > > > > Cleanup function is not the case, I think :)
> > > > >
> > > > > It seems like I spot the incorrect cleanup function (hwsim_remove
> > > > > is the right one is in my mind). Let me learn how to use ftrace
> > > > > to log the executed functions and then discuss this case with you
> > > > > guys.
> > > > >
> > > >
> > > > Hmmm, I think, there is a mess with lists.
> > > >
> > > > I just want to share my debug results, I have no idea about the fix
> > > > for now.
> > > >
> > > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces
> > > > sends a request to delete phy with idx == 0, so this check in
> > > > hwsim_edge_unsubscribe_me():
> > > >
> > > >         if (e->endpoint->idx == phy->idx) {
> > > >                 ... clean up code ...
> > > >         }
> > > >
> > > > won't be passed and edge won't be freed (because it was allocated
> > > > for phy with idx == 1). Allocated edge for phy 1 becomes leaked
> > > > after hwsim_del(). I can't really see the code where phy with idx
> > > > == 1 can be deleted from list...
> > >
> > > Thanks for sharing your debugging result.
> > >
> > >               hwsim_phys
> > >                        |
> > >    ---------------------------------
> > >    |                                      |
> > > sub0 (edges)                 sub1 (edges)
> > >    ----> e (idx = 1)               ----> e (idx = 0)
> > >
> > > hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1).
> > > However, in this function, it only deletes the e in the edge list of
> > > sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it
> > > leaves the e in the edge list of sub0 non-free.
> > >
> > > I proposed a patch and test it successfully in the syzbot dashboard.
> > >
> >
> > Cool! I thougth about similar fix before going to bed, but I had really
> > busy morning today :)
> >
> > > diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
> > > b/drivers/net/ieee802154/mac802154_hwsim.c
> > > index da9135231c07..b05159cff33a 100644
> > > --- a/drivers/net/ieee802154/mac802154_hwsim.c
> > > +++ b/drivers/net/ieee802154/mac802154_hwsim.c
> > > @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info,
> > > struct device *dev,
> > >  static void hwsim_del(struct hwsim_phy *phy)
> > >  {
> > >   struct hwsim_pib *pib;
> > > + struct hwsim_edge *e;
> > >
> > >   hwsim_edge_unsubscribe_me(phy);
> > >
> > > + // remove the edges in the list
> > > + list_for_each_entry_rcu(e, &phy->edges, list) {
> > > + list_del_rcu(&e->list);
> > > + hwsim_free_edge(e);
> > > + }
> > > +
> >
> > I think, rcu_read_lock() and rcu_read_unlock() are needed here (like in
> > hwsim_edge_unsubscribe_me()). Or you can delete this edges after deleting
> > phy node from global list, then, i guess, rcu locking won't be needed
> > here.
>
> Yes, you're right. rcu_read_lock is needed here. However, from the
> code below list_del(&phy->list), I think we'd better still add
> rcu_read_lock for those statements.
>
> How do you think about the following patch? BTW, I've sent a patch
> with the prefix PATCH. Maybe we can discuss this patch there.
>
> diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
> b/drivers/net/ieee802154/mac802154_hwsim.c
> index da9135231c07..cf659361a3fb 100644
> --- a/drivers/net/ieee802154/mac802154_hwsim.c
> +++ b/drivers/net/ieee802154/mac802154_hwsim.c
> @@ -824,12 +824,17 @@ static int hwsim_add_one(struct genl_info *info,
> struct device *dev,
>  static void hwsim_del(struct hwsim_phy *phy)
>  {
>         struct hwsim_pib *pib;
> +       struct hwsim_edge *e;
>
>         hwsim_edge_unsubscribe_me(phy);
>
>         list_del(&phy->list);
>
>         rcu_read_lock();
> +       list_for_each_entry_rcu(e, &phy->edges, list) {
> +               list_del_rcu(&e->list);
> +               hwsim_free_edge(e);
> +       }
>         pib = rcu_dereference(phy->pib);
>         rcu_read_unlock();
>

I have sent a v2 patch to the mailing list, please discuss the patch
in the corresponding thread [1].

[1] https://lkml.org/lkml/2021/6/15/1585.

> >
> > >   list_del(&phy->list);
> > >
> > >   rcu_read_lock();
> > >
> > >  I will send a patch later.
> > >
> > >
> > > >
> > > > Maybe, it's kmemleak bug. Similar strange case was with this one
> > > > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584.
> > > > I find it strange, that I could reach leaked pointers after
> > > > kmemleak reported a leak. Im not familiar with kmemleak internals
> > > > and I might be wrong
> > > >
> > > >
> > > > With regards,
> > > > Pavel Skripkin
> >
> >
> >
> >
> > With regards,
> > Pavel Skripkin

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-06-16  2:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-14 13:22 Suggestions on how to debug kernel crashes where printk and gdb both does not work Dongliang Mu
2021-06-14 13:34 ` Pavel Skripkin
2021-06-14 14:19   ` Dongliang Mu
2021-06-14 14:25     ` Pavel Skripkin
2021-06-14 14:40       ` Dongliang Mu
2021-06-14 14:47         ` Pavel Skripkin
2021-06-14 15:04           ` Dongliang Mu
2021-06-14 20:30             ` Pavel Skripkin
2021-06-15 10:37               ` Dongliang Mu
2021-06-15 13:34                 ` Pavel Skripkin
2021-06-16  2:02                   ` Dongliang Mu
2021-06-16  2:11                     ` Dongliang Mu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.