All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Yu Kuai <yukuai1@huaweicloud.com>
To: Changhui Zhong <czhong@redhat.com>, Yu Kuai <yukuai1@huaweicloud.com>
Cc: Ming Lei <ming.lei@redhat.com>,
	Linux Block Devices <linux-block@vger.kernel.org>,
	dm-devel@lists.linux.dev, Mike Snitzer <snitzer@kernel.org>,
	Mikulas Patocka <mpatocka@redhat.com>, Song Liu <song@kernel.org>,
	linux-raid@vger.kernel.org, Xiao Ni <xni@redhat.com>,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds
Date: Fri, 17 May 2024 10:49:21 +0800	[thread overview]
Message-ID: <1b35a177-670a-4d2f-0b68-6eda769af37d@huaweicloud.com> (raw)
In-Reply-To: <CAGVVp+XtThX7=bZm441VxyVd-wv_ycdqMU=19a2pa4wUkbkJ3g@mail.gmail.com>

Hi,

在 2024/05/17 10:25, Changhui Zhong 写道:
> On Thu, May 16, 2024 at 7:42 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/05/16 19:21, Ming Lei 写道:
>>> Cc raid and dm list.
>>>
>>> On Thu, May 16, 2024 at 06:24:18PM +0800, Changhui Zhong wrote:
>>>> Hello,
>>>>
>>>> when create lvm raid1, the command hang on for a long time.
>>>> please help check it and let me know if you need any info/testing for
>>>> it, thanks.
>>
>> Is this a new test, or a new problem?
> 
> it is a new problem, I am not hit this issue on 6.9.0-rc4+

There is just one patch for raid1 applied since v6.9-rc4, and I think
it's not related. Perhaps can you try to bisect?

> 
>>>>
>>>> repo:https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
>>>> branch:for-next
>>>> commit: 59ef8180748269837975c9656b586daa16bb9def
>>>>
>>>> reproducer:
>>>> dd if=/dev/zero bs=1M count=2000 of=file0.img
>>>> dd if=/dev/zero bs=1M count=2000 of=file1.img
>>>> dd if=/dev/zero bs=1M count=2000 of=file2.img
>>>> dd if=/dev/zero bs=1M count=2000 of=file4.img
>>>> losetup -fP --show file0.img
>>>> losetup -fP --show file1.img
>>>> losetup -fP --show file2.img
>>>> losetup -fP --show file3.img
>>
>> above dd creat file4, here is file3.
> 
> yeah,this is my spelling mistake, I created 4 files, file0/1/2/3
> 
>>
>>>> pvcreate -y  /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
>>>> vgcreate  black_bird  /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
>>>> lvcreate --type raid1 -m 3 -n non_synced_primary_raid_3legs_1   -L 1G
>>>> black_bird        /dev/loop0:0-300     /dev/loop1:0-300
>>>> /dev/loop2:0-300  /dev/loop3:0-300
>>
>> I don't understand what /dev/loopx:0-300 means, and I remove them, fix
>> the above file4 typo, test on a xfs filesystem, and I can't reporduce
>> the problem.
>>
> 
> I want to specify the space from disk blocks 0 to 300 of the loop
> device to create raid1,not all space of loop device,
> follow reproducer setps I can reproduced it 100%

Okay, I add the 0-300 and I still can't reporduce it, have no clue yet.
> 
>>>>
>>>>
>>>> console log:
>>>> May 21 21:57:41 dell-per640-04 journal: Create raid1
>>>> May 21 21:57:41 dell-per640-04 kernel: device-mapper: raid:
>>>> Superblocks created for new raid set
>>>> May 21 21:57:42 dell-per640-04 kernel: md/raid1:mdX: not clean --
>>>> starting background reconstruction
>>>> May 21 21:57:42 dell-per640-04 kernel: md/raid1:mdX: active with 4 out
>>>> of 4 mirrors
>>>> May 21 21:57:42 dell-per640-04 kernel: mdX: bitmap file is out of
>>>> date, doing full recovery
>>>> May 21 21:57:42 dell-per640-04 kernel: md: resync of RAID array mdX
>>>> May 21 21:57:42 dell-per640-04 systemd[1]: Started Device-mapper event daemon.
>>>> May 21 21:57:42 dell-per640-04 dmeventd[42170]: dmeventd ready for processing.
>>>> May 21 21:57:42 dell-per640-04 dmeventd[42170]: Monitoring RAID device
>>>> black_bird-non_synced_primary_raid_3legs_1 for events.
>>>> May 21 21:57:45 dell-per640-04 restraintd[1446]: *** Current Time: Tue
>>>> May 21 21:57:45 2024  Localwatchdog at: Tue May 21 22:56:45 2024
>>>> May 21 21:58:45 dell-per640-04 restraintd[1446]: *** Current Time: Tue
>>>> May 21 21:58:45 2024  Localwatchdog at: Tue May 21 22:56:45 2024
>>>> May 21 21:59:45 dell-per640-04 restraintd[1446]: *** Current Time: Tue
>>>> May 21 21:59:45 2024  Localwatchdog at: Tue May 21 22:56:45 2024
>>>> May 21 21:59:53 dell-per640-04 kernel: INFO: task mdX_resync:42168
>>>> blocked for more than 122 seconds.
>>>> May 21 21:59:53 dell-per640-04 kernel:      Not tainted 6.9.0+ #1
>>>> May 21 21:59:53 dell-per640-04 kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> May 21 21:59:53 dell-per640-04 kernel: task:mdX_resync      state:D
>>>> stack:0     pid:42168 tgid:42168 ppid:2      flags:0x00004000
>>>> May 21 21:59:53 dell-per640-04 kernel: Call Trace:
>>>> May 21 21:59:53 dell-per640-04 kernel: <TASK>
>>>> May 21 21:59:53 dell-per640-04 kernel: __schedule+0x222/0x670
>>>> May 21 21:59:53 dell-per640-04 kernel: ? blk_mq_flush_plug_list+0x5/0x20
>>>> May 21 21:59:53 dell-per640-04 kernel: schedule+0x2c/0xb0
>>>> May 21 21:59:53 dell-per640-04 kernel: raise_barrier+0x107/0x200 [raid1]
>>
>> Unless this is a deadlock, raise_barrier() should be waiting for normal
>> IO that is issued to underlying disk to return. If you can reporduce the
>> problem, can you check IO from underlying loop disks?
>>
>> cat /sys/block/loopx/inflight
> 
> when this issue was triggered, the log I collected:
> 
> [root@storageqe-103 ~]# cat /sys/block/loop0/inflight
>         0        0
> [root@storageqe-103 ~]# cat /sys/block/loop1/inflight
>         0        0
> [root@storageqe-103 ~]# cat /sys/block/loop2/inflight
>         0        0
> [root@storageqe-103 ~]# cat /sys/block/loop3/inflight
>         0        0
> [root@storageqe-103 ~]#

Thanks for the test, this do look like a deadlock, beside
raise_barrier(), is there any other victim? I can't reporduce this,
and I have no clue yet. The possible next step might be bisect to
locate the blame commit first. Maybe related to dm-raid1.

Thanks,
Kuai

> 
> 
> and the command "lvs" hang on also,
> 
> [root@storageqe-103 ~]# lvs
> ^C  Interrupted...
>    Giving up waiting for lock.
>    Can't get lock for black_bird.
>    Cannot process volume group black_bird
>    LV   VG                 Attr       LSize    Pool Origin Data%  Meta%
>   Move Log Cpy%Sync Convert
>    home rhel_storageqe-103 -wi-ao---- <368.43g
>    root rhel_storageqe-103 -wi-ao----   70.00g
>    swap rhel_storageqe-103 -wi-ao----    7.70g
> [root@storageqe-103 ~]#
> 
> [ 1352.761630] INFO: task mdX_resync:1547 blocked for more than 1105 seconds.
> [ 1352.769336]       Not tainted 6.9.0+ #1
> [ 1352.773629] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 1352.782372] task:mdX_resync      state:D stack:0     pid:1547
> tgid:1547  ppid:2      flags:0x00004000
> [ 1352.782380] Call Trace:
> [ 1352.782382]  <TASK>
> [ 1352.782386]  __schedule+0x222/0x670
> [ 1352.782396]  schedule+0x2c/0xb0
> [ 1352.782402]  raise_barrier+0x107/0x200 [raid1]
> [ 1352.782415]  ? __pfx_autoremove_wake_function+0x10/0x10
> [ 1352.782423]  raid1_sync_request+0x12d/0xa50 [raid1]
> [ 1352.782435]  ? prepare_to_wait_event+0x5f/0x190
> [ 1352.782442]  md_do_sync+0x660/0x1040
> [ 1352.782449]  ? __pfx_autoremove_wake_function+0x10/0x10
> [ 1352.782457]  md_thread+0xad/0x160
> [ 1352.782462]  ? __pfx_md_thread+0x10/0x10
> [ 1352.782465]  kthread+0xdc/0x110
> [ 1352.782470]  ? __pfx_kthread+0x10/0x10
> [ 1352.782474]  ret_from_fork+0x2d/0x50
> [ 1352.782481]  ? __pfx_kthread+0x10/0x10
> [ 1352.782485]  ret_from_fork_asm+0x1a/0x30
> [ 1352.782491]  </TASK>
> 
> Thanks,
> Changhui
> 
>>
>> Thanks,
>> Kuai
>>
>>>> May 21 21:59:53 dell-per640-04 kernel: ?
>>>> __pfx_autoremove_wake_function+0x10/0x10
>>>> May 21 21:59:53 dell-per640-04 kernel: raid1_sync_request+0x12d/0xa50 [raid1]
>>>> May 21 21:59:53 dell-per640-04 kernel: ?
>>>> __pfx_raid1_sync_request+0x10/0x10 [raid1]
>>>> May 21 21:59:53 dell-per640-04 kernel: md_do_sync+0x660/0x1040
>>>> May 21 21:59:53 dell-per640-04 kernel: ?
>>>> __pfx_autoremove_wake_function+0x10/0x10
>>>> May 21 21:59:53 dell-per640-04 kernel: md_thread+0xad/0x160
>>>> May 21 21:59:53 dell-per640-04 kernel: ? __pfx_md_thread+0x10/0x10
>>>> May 21 21:59:53 dell-per640-04 kernel: kthread+0xdc/0x110
>>>> May 21 21:59:53 dell-per640-04 kernel: ? __pfx_kthread+0x10/0x10
>>>> May 21 21:59:53 dell-per640-04 kernel: ret_from_fork+0x2d/0x50
>>>> May 21 21:59:53 dell-per640-04 kernel: ? __pfx_kthread+0x10/0x10
>>>> May 21 21:59:53 dell-per640-04 kernel: ret_from_fork_asm+0x1a/0x30
>>>> May 21 21:59:53 dell-per640-04 kernel: </TASK>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>>        Changhui
>>>>
>>>
>>
>>
> 
> .
> 


  reply	other threads:[~2024-05-17  2:49 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-16 10:24 [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds Changhui Zhong
2024-05-16 11:21 ` Ming Lei
2024-05-16 11:42   ` Yu Kuai
2024-05-17  2:25     ` Changhui Zhong
2024-05-17  2:49       ` Yu Kuai [this message]
2024-05-19  6:44         ` Changhui Zhong
2024-05-20  0:39           ` Changhui Zhong
2024-05-20  2:55             ` Yu Kuai
2024-05-20  7:27               ` Yu Kuai
2024-05-20 10:47                 ` Changhui Zhong
2024-05-20 10:38               ` Changhui Zhong
2024-05-21  1:08                 ` Yu Kuai
2024-05-21  4:28                   ` Changhui Zhong
2024-05-21  9:17                     ` Yu Kuai
2024-05-21 11:39                       ` Changhui Zhong
2024-05-21 20:03                         ` [PATCH -next] block: fix bio lost for plug enabeld bio based device Yu Kuai
2024-05-22  0:38                           ` Changhui Zhong
2024-05-22  1:38                           ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b35a177-670a-4d2f-0b68-6eda769af37d@huaweicloud.com \
    --to=yukuai1@huaweicloud.com \
    --cc=czhong@redhat.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=mpatocka@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=song@kernel.org \
    --cc=xni@redhat.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.