All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/8 V2] xfs: log fixes for for-next
Date: Sat, 19 Jun 2021 08:15:01 +1000	[thread overview]
Message-ID: <20210618221501.GJ664593@dread.disaster.area> (raw)
In-Reply-To: <YMyav1+JiSlQbDFH@bfoster>

On Fri, Jun 18, 2021 at 09:08:15AM -0400, Brian Foster wrote:
> On Fri, Jun 18, 2021 at 09:43:08AM +1000, Dave Chinner wrote:
> > On Thu, Jun 17, 2021 at 12:05:19PM -0700, Darrick J. Wong wrote:
> > > On Thu, Jun 17, 2021 at 02:32:30PM -0400, Brian Foster wrote:
> > > > On Thu, Jun 17, 2021 at 06:26:09PM +1000, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > This is followup from the first set of log fixes for for-next that
> > > > > were posted here:
> > > > > 
> > > > > https://lore.kernel.org/linux-xfs/20210615175719.GD158209@locust/T/#mde2cf0bb7d2ac369815a7e9371f0303efc89f51b
> > > > > 
> > > > > The first two patches of this series are updates for those patches,
> > > > > change log below. The rest is the fix for the bigger issue we
> > > > > uncovered in investigating the generic/019 failures, being that
> > > > > we're triggering a zero-day bug in the way log recovery assigns LSNs
> > > > > to checkpoints.
> > > > > 
> > > > > The "simple" fix of using the same ordering code as the commit
> > > > > record for the start records in the CIL push turned into a lot of
> > > > > patches once I started cleaning it up, separating out all the
> > > > > different bits and finally realising all the things I needed to
> > > > > change to avoid unintentional logic/behavioural changes. Hence
> > > > > there's some code movement, some factoring, API changes to
> > > > > xlog_write(), changing where we attach callbacks to commit iclogs so
> > > > > they remain correctly ordered if there are multiple commit records
> > > > > in the one iclog and then, finally, strictly ordering the start
> > > > > records....
> > > > > 
> > > > > The original "simple fix" I tested last night ran almost a thousand
> > > > > cycles of generic/019 without a log hang or recovery failure of any
> > > > > kind. The refactored patchset has run a couple hundred cycles of
> > > > > g/019 and g/475 over the last few hours without a failure, so I'm
> > > > > posting this so we can get a review iteration done while I sleep so
> > > > > we can - hopefully - get this sorted out before the end of the week.
> > > > > 
> > > > 
> > > > My first spin of this included generic/019 and generic/475, ran for 18
> > > > or so iterations and 475 exploded with a stream of asserts followed by a
> > > > NULL pointer crash:
> > > > 
> > > > # grep -e Assertion -e BUG dmesg.out
> > > > ...
> > > > [ 7951.878058] XFS: Assertion failed: atomic_read(&buip->bui_refcount) > 0, file: fs/xfs/xfs_bmap_item.c, line: 57
> > > > [ 7952.261251] XFS: Assertion failed: atomic_read(&buip->bui_refcount) > 0, file: fs/xfs/xfs_bmap_item.c, line: 57
> > > > [ 7952.644444] XFS: Assertion failed: atomic_read(&buip->bui_refcount) > 0, file: fs/xfs/xfs_bmap_item.c, line: 57
> > > > [ 7953.027626] XFS: Assertion failed: atomic_read(&buip->bui_refcount) > 0, file: fs/xfs/xfs_bmap_item.c, line: 57
> > > > [ 7953.410804] BUG: kernel NULL pointer dereference, address: 000000000000031f
> > > > [ 7954.118973] BUG: unable to handle page fault for address: ffffa57ccf99fa98
> > > > 
> > > > I don't know if this is a regression, but I've not seen it before. I've
> > > > attempted to spin generic/475 since then to see if it reproduces again,
> > > > but so far I'm only running into some of the preexisting issues
> > > > associated with that test.
> > 
> > I've not seen anything like that. I can't see how the changes in the
> > patchset would affect BUI reference counting in any way. That seems
> > more like an underlying intent item shutdown reference count issue
> > to me (and we've had a *lot* of them in the past)....
> > 
> 
> I've not made sense of it either, but at the same time, I've not seen it
> in all my testing thus far up until targeting this series, and now I've
> seen it twice in as many test runs as my overnight run fell into some
> kind of similar haywire state. Unfortunately it seemed to be
> spinning/streaming assert output so I lost any record of the initial
> crash signature. It wouldn't surprise me if the fundamental problem is
> some older bug in another area of code, but it's hard to believe it's
> not at least related to this series somehow.
> 
> Also FYI, earlier iterations of generic/475 triggered a couple instances
> of the following assert failure before things broke down more severely:
> 
>  XFS: Assertion failed: *log_offset + *len <= iclog->ic_size || iclog->ic_state == XLOG_STATE_WANT_SYNC, file: fs/xfs/xfs_log.c, line: 2115

Yup, that's a bogus state check in the asssert. I already have a
patch to fix that - the async shutdown can change the iclog state
to XLOG_STATE_IOERROR at any time, so any iclog state assert outside of
the log->l_icloglock needs also to allow for XLOG_STATE_IOERROR as
a valid state.

This is one of the problems I was alluding to on #xfs when I said:

[18/6/21 14:42] <dchinner> I'm really not liking getting repeatedly
caught out by racing, unreferenced iclog state changes during
shutdown and having to handle them everywhere.

Patch, FYI, below.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: fix incorrect assert in xlog_write_single

From: Dave Chinner <dchinner@redhat.com>

generic/475 failed with this assert after a log shutdown:

[ 3953.166235] XFS: Assertion failed: *log_offset + *len <= iclog->ic_size || iclog->ic_state == XLOG_STATE_WANT_SYNC, file: fs/xfs/xfs_log.c, line: 2115

The problem is that after the log has shut down, the iclog state is
XLOG_STATE_IOERROR. The shutdown can change the iclog state at any
time while we are writing to it, so we need to add IOERROR to the
valid states here.

Note that we already have similar IOERROR state checks in asserts
in the xlog_write() code for this reason (e.g. in
xlog_write_get_more_iclog_space()) so this is just a case where the
IOERROR state check was missed. The IOERROR state will be processed
when we release the iclog, so just add the state into the assert and
let the iclog release code handle the error.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 94b6bccb9de9..221c080df305 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2113,7 +2113,8 @@ xlog_write_single(
 	int			index;
 
 	ASSERT(*log_offset + *len <= iclog->ic_size ||
-		iclog->ic_state == XLOG_STATE_WANT_SYNC);
+		iclog->ic_state == XLOG_STATE_WANT_SYNC ||
+		iclog->ic_state == XLOG_STATE_IOERROR);
 
 	ptr = iclog->ic_datap + *log_offset;
 	for (lv = log_vector;

  parent reply	other threads:[~2021-06-18 22:15 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-17  8:26 [PATCH 0/8 V2] xfs: log fixes for for-next Dave Chinner
2021-06-17  8:26 ` [PATCH 1/8] xfs: add iclog state trace events Dave Chinner
2021-06-17 16:45   ` Darrick J. Wong
2021-06-18 14:09   ` Christoph Hellwig
2021-06-17  8:26 ` [PATCH 2/8] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner
2021-06-17 17:49   ` Darrick J. Wong
2021-06-17 21:55     ` Dave Chinner
2021-06-17  8:26 ` [PATCH 3/8] xfs: move xlog_commit_record to xfs_log_cil.c Dave Chinner
2021-06-17 12:57   ` kernel test robot
2021-06-17 12:57     ` kernel test robot
2021-06-17 17:50   ` Darrick J. Wong
2021-06-17 21:56     ` Dave Chinner
2021-06-18 14:16   ` Christoph Hellwig
2021-06-17  8:26 ` [PATCH 4/8] xfs: pass a CIL context to xlog_write() Dave Chinner
2021-06-17 14:46   ` kernel test robot
2021-06-17 14:46     ` kernel test robot
2021-06-17 20:24   ` Darrick J. Wong
2021-06-17 22:03     ` Dave Chinner
2021-06-17 22:18       ` Darrick J. Wong
2021-06-18 14:23   ` Christoph Hellwig
2021-06-17  8:26 ` [PATCH 5/8] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
2021-06-17 19:59   ` Darrick J. Wong
2021-06-18 14:27     ` Christoph Hellwig
2021-06-18 22:34       ` Dave Chinner
2021-06-17  8:26 ` [PATCH 6/8] xfs: separate out setting CIL context LSNs from xlog_write Dave Chinner
2021-06-17 20:28   ` Darrick J. Wong
2021-06-17 22:10     ` Dave Chinner
2021-06-17  8:26 ` [PATCH 7/8] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
2021-06-17 20:55   ` Darrick J. Wong
2021-06-17 22:20     ` Dave Chinner
2021-06-17  8:26 ` [PATCH 8/8] xfs: order CIL checkpoint start records Dave Chinner
2021-06-17 21:31   ` Darrick J. Wong
2021-06-17 22:49     ` Dave Chinner
2021-06-17 18:32 ` [PATCH 0/8 V2] xfs: log fixes for for-next Brian Foster
2021-06-17 19:05   ` Darrick J. Wong
2021-06-17 20:06     ` Brian Foster
2021-06-17 20:26       ` Darrick J. Wong
2021-06-17 23:31         ` Brian Foster
2021-06-17 23:43     ` Dave Chinner
2021-06-18 13:08       ` Brian Foster
2021-06-18 13:55         ` Christoph Hellwig
2021-06-18 14:02           ` Christoph Hellwig
2021-06-18 22:28           ` Dave Chinner
2021-06-18 22:15         ` Dave Chinner [this message]
2021-06-18 22:48 ` Dave Chinner
2021-06-19 20:22   ` Darrick J. Wong
2021-06-20 22:18     ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2021-06-26 23:10 [PATCH 4/8] xfs: pass a CIL context to xlog_write() kernel test robot
2021-06-28  8:58 ` Dan Carpenter
2021-06-28  8:58 ` Dan Carpenter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210618221501.GJ664593@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=bfoster@redhat.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.