All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Cc: Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>,
	"axboe@fb.com" <axboe@fb.com>, Gregory Joyce <gjoyce@ibm.com>,
	Srimannarayana Murthy Maram <msmurthy@imap.linux.ibm.com>
Subject: Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
Date: Sun, 21 Apr 2024 22:26:54 +0530	[thread overview]
Message-ID: <d33a5681-b195-4258-8eee-e0eae46ade5b@linux.ibm.com> (raw)
In-Reply-To: <579c82da-52a7-4425-81d7-480c676b8cbb@grimberg.me>



On 4/21/24 15:58, Sagi Grimberg wrote:
> 
> 
> On 18/04/2024 15:52, Nilay Shroff wrote:
>> Hi,
>>
>> We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.
>>
>> Notes and observations:
>> ======================
>> This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.
>>
>> Test details:
>> =============
>> Steps to reproduce this issue:
>>
>> 1. Run some disk IO using fio or any other tool
>> 2. While disk IO is running, inject pci error
>> 3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)
>>
>> Kernel Logs:
>> ============
>> When we follow steps described in the test details we get the below logs:
>>
>> [  295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
>> [  295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
>> [  295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
>> [  490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
>> [  490.381614]       Not tainted 6.9.0-rc4+ #8
>> [  490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [  490.381623] task:bash            state:D stack:0     pid:2510  tgid:2510  ppid:2509   flags:0x00042080
>> [  490.381632] Call Trace:
>> [  490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
>> [  490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
>> [  490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
>> [  490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
>> [  490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
>> [  490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
>> [  490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
>> [  490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
>> [  490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
>> [  490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
>> [  490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
>> [  490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
>> [  490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
>> [  490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
>> [  490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
>> [  490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
>> [  490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
>> [  490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
>> [  490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
>> [  490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
>> [  490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
>> [  490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
>> [  490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
>> [  490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>>
>> NVMe controller state:
>> ======================
>> # cat /sys/class/nvme/nvme1/state
>> deleting (no IO)
>>
>> Process State:
>> ==============
>> # ps -aex
>>     [..]
>>     2510 pts/2    Ds+    0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
>>     2549 ?        Ds     0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based
>>     [..]
>>
>> Observation:
>> ============
>> As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep)
>> for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were
>> in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error
>> "Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE.
>> As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of
>> "disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)"
>> indefinitely. The only way we found to come out of this state is to reboot the system.
>>
>> Proposed fix:
>> ============
>> static void nvme_remove(struct pci_dev *pdev)
>> {
>>     struct nvme_dev *dev = pci_get_drvdata(pdev);
>>
>>     nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>>     pci_set_drvdata(pdev, NULL);
>>
>>     if (!pci_device_is_present(pdev)) {
>>         nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>>         nvme_dev_disable(dev, true);
>>     }
>>     flush_work(&dev->ctrl.reset_work);
>>     nvme_stop_ctrl(&dev->ctrl);
>>     nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
>>          [..]
>> }
>>
>> As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the
>> device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes
>> sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if
>> we were to unconditionally invoke nvme_dev_disable() at the first place?
> 
> Shouldn't the correct place to handle the cancellation is nvme_error_detected() given that the
> pci error is preventing the request from completing and the timeout handler from addressing it?
> 
If a platform supports pci-error-recovery then it may be possible to cancel pending IOs from 
nvme_error_detected(), however, if the platform doesn't support pci error recovery then 
nvme_error_detected() would never be called. In fact, the issue which I reported above was
discovered on a platform which has pci-error-recovery disabled.

I also tested this scenario on a platform supporting the pci error recovery. On this platform, 
when I ran this test (PCI error injection followed by NVMe hot unpulg), I found that the 
pci-error-recovery thread races with hot-unplug task, please find below the dmesg logs 
collected when this issue manifests:

INFO: task eehd:187 blocked for more than 122 seconds.
      Not tainted 6.9.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:eehd            state:D stack:0     pid:187   tgid:187   ppid:2      flags:0x00000000
Call Trace:
[c0000000089bf890] [c000000000fb027c] vsnprintf+0x3f8/0x578 (unreliable)
[c0000000089bfa40] [c00000000001f3fc] __switch_to+0x13c/0x220
[c0000000089bfaa0] [c000000000fb87e0] __schedule+0x268/0x7c4
[c0000000089bfb70] [c000000000fb8d7c] schedule+0x40/0x108
[c0000000089bfbe0] [c000000000fb93f8] schedule_preempt_disabled+0x20/0x30
[c0000000089bfc00] [c000000000fbbe84] __mutex_lock.constprop.0+0x5f4/0xc54
[c0000000089bfca0] [c000000000916380] pci_lock_rescan_remove+0x28/0x3c
[c0000000089bfcc0] [c00000000004fa4c] eeh_pe_report_edev+0x3c/0x52c
[c0000000089bfda0] [c00000000004ffdc] eeh_pe_report+0xa0/0x158
[c0000000089bfe40] [c000000000050490] eeh_handle_normal_event+0x310/0xa24
[c0000000089bff30] [c000000000051078] eeh_event_handler+0x118/0x19c
[c0000000089bff90] [c00000000018d29c] kthread+0x138/0x140
[c0000000089bffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18

INFO: task bash:5420 blocked for more than 122 seconds.
      Not tainted 6.9.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash            state:D stack:0     pid:5420  tgid:5420  ppid:5419   flags:0x00042080
Call Trace:
[c000000054c67510] [c000000054c67550] 0xc000000054c67550 (unreliable)
[c000000054c676c0] [c00000000001f3fc] __switch_to+0x13c/0x220
[c000000054c67720] [c000000000fb87e0] __schedule+0x268/0x7c4
[c000000054c677f0] [c000000000fb8d7c] schedule+0x40/0x108
[c000000054c67860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
[c000000054c678c0] [c00000000081eba8] del_gendisk+0x284/0x464
[c000000054c67920] [c00800000b7174a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
[c000000054c67960] [c00800000b717704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
[c000000054c679d0] [c008000006294b70] nvme_remove+0x80/0x168 [nvme]
[c000000054c67a10] [c00000000092a10c] pci_device_remove+0x6c/0x110
[c000000054c67a50] [c000000000a4f504] device_remove+0x70/0xc4
[c000000054c67a80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
[c000000054c67ad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
[c000000054c67b10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
[c000000054c67b40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
[c000000054c67bd0] [c008000004380504] disable_slot+0x40/0x90 [rpaphp]
[c000000054c67c00] [c000000000956090] power_write_file+0xf8/0x19c
[c000000054c67c80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
[c000000054c67ca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
[c000000054c67cc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
[c000000054c67d10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
[c000000054c67dc0] [c0000000005e13c0] ksys_write+0x84/0x140
[c000000054c67e10] [c000000000030a84] system_call_exception+0x124/0x330
[c000000054c67e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec


As we can see above the task eehd:187 (pci-error-recovery thread) is blocked for 
acquiring the mutex (pci_rescan_remove_lock). The hot-unplug task (bash: 5420)
is blocked on the request queue to be freezed. The hot-unplug task first acquires
the pci_rescan_remove_lock and then invokes nvme_remove() method. So in summary, 
in the traces shown above, as hot-unplug task is unable to proceed, it has also 
blocked the pci-error-recovery thread. 

On another note, if pci-error-recovery thread could progress first acquiring 
pci_rescan_remove_lock then it shall be able to recover the pci error and hence
pending IOs could be finished. Later when hot-unplug task starts, it could 
forward progress and cleanup all resources used by the nvme disk. 

So does it make sense if we unconditionally cancel the pending IOs from 
nvme_remove() before it forward progress to remove namespaces?

Thanks,
--Nilay


  parent reply	other threads:[~2024-04-21 16:57 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-18 12:52 [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang Nilay Shroff
2024-04-21 10:28 ` Sagi Grimberg
2024-04-21 16:53   ` Nilay Shroff
2024-04-21 16:56   ` Nilay Shroff [this message]
2024-04-22 13:00     ` Sagi Grimberg
2024-04-22 13:52       ` Keith Busch
2024-04-22 14:35         ` Keith Busch
2024-04-23  9:52           ` Nilay Shroff
2024-04-24 17:36             ` Keith Busch
2024-04-25 13:49               ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d33a5681-b195-4258-8eee-e0eae46ade5b@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@fb.com \
    --cc=gjoyce@ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=msmurthy@imap.linux.ibm.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.