Re: [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Yu Kuai <yukuai1@huaweicloud.com>
To: Yu Kuai <yukuai1@huaweicloud.com>, Changhui Zhong <czhong@redhat.com>
Cc: Ming Lei <ming.lei@redhat.com>,
	Linux Block Devices <linux-block@vger.kernel.org>,
	dm-devel@lists.linux.dev, Mike Snitzer <snitzer@kernel.org>,
	Mikulas Patocka <mpatocka@redhat.com>, Song Liu <song@kernel.org>,
	linux-raid@vger.kernel.org, Xiao Ni <xni@redhat.com>,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds
Date: Mon, 20 May 2024 15:27:08 +0800	[thread overview]
Message-ID: <ca29a4b1-4b4a-3b1c-4981-6e05e0bb24be@huaweicloud.com> (raw)
In-Reply-To: <f1c98dd1-a62c-6857-3773-e05b80e6a763@huaweicloud.com>

Hi,

在 2024/05/20 10:55, Yu Kuai 写道:
> Hi, Changhui
> 
> 在 2024/05/20 8:39, Changhui Zhong 写道:
>> [czhong@vm linux-block]$ git bisect bad
>> 060406c61c7cb4bbd82a02d179decca9c9bb3443 is the first bad commit
>> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443
>> Author: Yu Kuai<yukuai3@huawei.com>
>> Date:   Thu May 9 20:38:25 2024 +0800
>>
>>      block: add plug while submitting IO
>>
>>      So that if caller didn't use plug, for example, 
>> __blkdev_direct_IO_simple()
>>      and __blkdev_direct_IO_async(), block layer can still benefit 
>> from caching
>>      nsec time in the plug.
>>
>>      Signed-off-by: Yu Kuai<yukuai3@huawei.com>
>>      
>> Link:https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com 
>>
>>      Signed-off-by: Jens Axboe<axboe@kernel.dk>
>>
>>   block/blk-core.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
> 
> Thanks for the test!
> 
> I was surprised to see this blamed commit, and after taking a look at
> raid1 barrier code, I found that there are some known problems, fixed in
> raid10, while raid1 still unfixed. So I wonder this patch maybe just
> making the exist problem easier to reporduce.
> 
> I'll start cooking patches to sync raid10 fixes to raid1, meanwhile,
> can you change your script to test raid10 as well, if raid10 is fine,
> I'll give you these patches later to test raid1.

Hi,

Sorry to ask, but since I can't reporduce the problem, and based on
code reiview, there are multiple potential problems, can you also
reporduce the problem with following debug patch(just add some debug
info, no functional changes). So that I can make sure of details of
the problem.

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 113135e7b5f2..b35b847a9e8b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -936,6 +936,45 @@ static void flush_pending_writes(struct r1conf *conf)
                 spin_unlock_irq(&conf->device_lock);
  }

+static bool waiting_barrier(struct r1conf *conf, int idx)
+{
+       int nr = atomic_read(&conf->nr_waiting[idx]);
+
+       if (nr) {
+               printk("%s: idx %d nr_waiting %d\n", __func__, idx, nr);
+               return true;
+       }
+
+       return false;
+}
+
+static bool waiting_pending(struct r1conf *conf, int idx)
+{
+       int nr;
+
+       if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery))
+               return false;
+
+       if (conf->array_frozen) {
+               printk("%s: array is frozen\n", __func__);
+               return true;
+       }
+
+       nr = atomic_read(&conf->nr_pending[idx]);
+       if (nr) {
+               printk("%s: idx %d nr_pending %d\n", __func__, idx, nr);
+               return true;
+       }
+
+       nr = atomic_read(&conf->barrier[idx]);
+       if (nr >= RESYNC_DEPTH) {
+               printk("%s: idx %d barrier %d exceeds %d\n", __func__, 
idx, nr, RESYNC_DEPTH);
+               return true;
+       }
+
+       return false;
+}
+
  /* Barriers....
   * Sometimes we need to suspend IO while we do something else,
   * either some resync/recovery, or reconfigure the array.
@@ -967,8 +1006,7 @@ static int raise_barrier(struct r1conf *conf, 
sector_t sector_nr)
         spin_lock_irq(&conf->resync_lock);

         /* Wait until no block IO is waiting */
-       wait_event_lock_irq(conf->wait_barrier,
-                           !atomic_read(&conf->nr_waiting[idx]),
+       wait_event_lock_irq(conf->wait_barrier, !waiting_barrier(conf, idx),
                             conf->resync_lock);

         /* block any new IO from starting */
@@ -990,11 +1028,7 @@ static int raise_barrier(struct r1conf *conf, 
sector_t sector_nr)
          * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning reaches
          *    max resync count which allowed on current I/O barrier bucket.
          */
-       wait_event_lock_irq(conf->wait_barrier,
-                           (!conf->array_frozen &&
-                            !atomic_read(&conf->nr_pending[idx]) &&
-                            atomic_read(&conf->barrier[idx]) < 
RESYNC_DEPTH) ||
-                               test_bit(MD_RECOVERY_INTR, 
&conf->mddev->recovery),
+       wait_event_lock_irq(conf->wait_barrier, !waiting_pending(conf, idx),
                             conf->resync_lock);

         if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {

Thanks,
Kuai

> 
> Thanks,
> Kuai
> 
> .
>

next prev parent reply	other threads:[~2024-05-20  7:27 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-16 10:24 [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds Changhui Zhong
2024-05-16 11:21 ` Ming Lei
2024-05-16 11:42   ` Yu Kuai
2024-05-17  2:25     ` Changhui Zhong
2024-05-17  2:49       ` Yu Kuai
2024-05-19  6:44         ` Changhui Zhong
2024-05-20  0:39           ` Changhui Zhong
2024-05-20  2:55             ` Yu Kuai
2024-05-20  7:27               ` Yu Kuai [this message]
2024-05-20 10:47                 ` Changhui Zhong
2024-05-20 10:38               ` Changhui Zhong
2024-05-21  1:08                 ` Yu Kuai
2024-05-21  4:28                   ` Changhui Zhong
2024-05-21  9:17                     ` Yu Kuai
2024-05-21 11:39                       ` Changhui Zhong
2024-05-21 20:03                         ` [PATCH -next] block: fix bio lost for plug enabeld bio based device Yu Kuai
2024-05-22  0:38                           ` Changhui Zhong
2024-05-22  1:38                           ` Jens Axboe

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:113135e7b5f dfblob:b35b847a9e8 )
 OR (
bs:"Re: [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ca29a4b1-4b4a-3b1c-4981-6e05e0bb24be@huaweicloud.com \
    --to=yukuai1@huaweicloud.com \
    --cc=czhong@redhat.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=mpatocka@redhat.com \
    --cc=snitzer@kernel.org \
    --cc=song@kernel.org \
    --cc=xni@redhat.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.