All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 8/8] xfs: order CIL checkpoint start records
Date: Fri, 18 Jun 2021 08:49:12 +1000	[thread overview]
Message-ID: <20210617224912.GG664593@dread.disaster.area> (raw)
In-Reply-To: <20210617213143.GF158232@locust>

On Thu, Jun 17, 2021 at 02:31:43PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 17, 2021 at 06:26:17PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because log recovery depends on strictly ordered start records as
> > well as strictly ordered commit records.
> > 
> > This is a zero day bug in the way XFS writes pipelined transactions
> > to the journal which is exposed by commit facd77e4e38b ("xfs: CIL
> > work is serialised, not pipelined") which re-introduces explicit
> > concurrent commits back into the on-disk journal.
> > 
> > The XFS journal commit code has never ordered start records and we
> > have relied on strict commit record ordering for correct recovery
> > ordering of concurrently written transactions. Unfortunately, root
> > cause analysis uncovered the fact that log recovery uses the LSN of
> > the start record for transaction commit processing. Hence the
> > commits are processed in strict orderi by recovery, but the LSNs
> 
> s/orderi/order/ ?
> 
> > associated with the commits can be out of order and so recovery may
> > stamp incorrect LSNs into objects and/or misorder intents in the AIL
> > for later processing. This can result in log recovery failures
> > and/or on disk corruption, sometimes silent.
> > 
> > Because this is a long standing log recovery issue, we can't just
> > fix log recovery and call it good.
> 
> Could there be production filesystems out there that have this
> mismatched ordering of start lsn and commit lsn?  This still leaves the
> mystery of crashed customer filesystems containing btree blocks where
> 128 bytes in the middle clearly contain contents that are don't match or
> duplicate the rest of the block, as though someone forgot to replay a
> buffer vector or something.

Modulo bugs in delayed logging, I doubt there's any delayed logging
filesystems out there that have the problem. Older, non-delayed
logging filesystems are almost certain to see it, but they have much
smaller transactions and only EFIs to deal with so the corruption
risk is much, much, much lower.

> What would a fix to log recovery entail?  Not skipping recovered items
> if the start/commit sequencing is not the same?  Or am I not
> understanding the problem correctly?

I've been going back and forth on this trying to come up with a sane
solution, but I haven't come up with anything practical.

We could use the commit record LSN for recovery, but we write start
record LSNs into on-disk metadata when we flush it to disk and that
forces checkpoints that need recovery to use the same LSN in the
metadata it recovers and writes back as we use for runtime
writeback. Hence we then get problems with recovered filesystems not
having the same on-disk state as they would if the metadata was
written back from in-memory. i.e. two pieces of metadata in the same
atomic transaction could have different LSNs stamped in them
depending on whether they were written back at runtime or recovered
by log recovery at mount time...

And then my head explodes trying to work out what happens when we
have overlapping checkpoints and partial metadata writeback and
different LSN values for recovery vs writeback and recovery retries
after a failed recovery and <BOOM>

However, given that there are runtime integrity issues with out of
order start LSNs (log head can overwrite the log tail - I can give
more detail if you want), the only way out of this I can see is to
ensure that the start records are properly ordered at runtime to
avoid all the potential runtime issues that exist.  This also has
the nice "side effect" of avoiding the log recovery LSN ordering
problem.

IOWs, I'm not looking at this as log recovery bug that needs fixing.
Yes, there is a log recovery issue there (and has been forever), but
the more I think on this, the more I'm concerned about the potential
runtime impacts on data integrity correctness and potential
head-tail journal overwrite corruption. 

> > +	ctx->commit_lsn = lsn;
> > +	wake_up_all(&cil->xc_commit_wait);
> > +	spin_unlock(&cil->xc_push_lock);
> >  }
> >  
> >  /*
> > @@ -834,10 +849,16 @@ xlog_cil_set_ctx_write_state(
> >   * relies on the context LSN being zero until the log write has guaranteed the
> >   * LSN that the log write will start at via xlog_state_get_iclog_space().
> >   */
> > +enum {
> > +	_START_RECORD,
> > +	_COMMIT_RECORD,
> > +};
> 
> Stupid nit: If this enum had a name you could skip the default clause
> below because the compiler would typecheck the usage for you.

OK.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2021-06-17 22:49 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-17  8:26 [PATCH 0/8 V2] xfs: log fixes for for-next Dave Chinner
2021-06-17  8:26 ` [PATCH 1/8] xfs: add iclog state trace events Dave Chinner
2021-06-17 16:45   ` Darrick J. Wong
2021-06-18 14:09   ` Christoph Hellwig
2021-06-17  8:26 ` [PATCH 2/8] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner
2021-06-17 17:49   ` Darrick J. Wong
2021-06-17 21:55     ` Dave Chinner
2021-06-17  8:26 ` [PATCH 3/8] xfs: move xlog_commit_record to xfs_log_cil.c Dave Chinner
2021-06-17 12:57   ` kernel test robot
2021-06-17 12:57     ` kernel test robot
2021-06-17 17:50   ` Darrick J. Wong
2021-06-17 21:56     ` Dave Chinner
2021-06-18 14:16   ` Christoph Hellwig
2021-06-17  8:26 ` [PATCH 4/8] xfs: pass a CIL context to xlog_write() Dave Chinner
2021-06-17 14:46   ` kernel test robot
2021-06-17 14:46     ` kernel test robot
2021-06-17 20:24   ` Darrick J. Wong
2021-06-17 22:03     ` Dave Chinner
2021-06-17 22:18       ` Darrick J. Wong
2021-06-18 14:23   ` Christoph Hellwig
2021-06-17  8:26 ` [PATCH 5/8] xfs: factor out log write ordering from xlog_cil_push_work() Dave Chinner
2021-06-17 19:59   ` Darrick J. Wong
2021-06-18 14:27     ` Christoph Hellwig
2021-06-18 22:34       ` Dave Chinner
2021-06-17  8:26 ` [PATCH 6/8] xfs: separate out setting CIL context LSNs from xlog_write Dave Chinner
2021-06-17 20:28   ` Darrick J. Wong
2021-06-17 22:10     ` Dave Chinner
2021-06-17  8:26 ` [PATCH 7/8] xfs: attached iclog callbacks in xlog_cil_set_ctx_write_state() Dave Chinner
2021-06-17 20:55   ` Darrick J. Wong
2021-06-17 22:20     ` Dave Chinner
2021-06-17  8:26 ` [PATCH 8/8] xfs: order CIL checkpoint start records Dave Chinner
2021-06-17 21:31   ` Darrick J. Wong
2021-06-17 22:49     ` Dave Chinner [this message]
2021-06-17 18:32 ` [PATCH 0/8 V2] xfs: log fixes for for-next Brian Foster
2021-06-17 19:05   ` Darrick J. Wong
2021-06-17 20:06     ` Brian Foster
2021-06-17 20:26       ` Darrick J. Wong
2021-06-17 23:31         ` Brian Foster
2021-06-17 23:43     ` Dave Chinner
2021-06-18 13:08       ` Brian Foster
2021-06-18 13:55         ` Christoph Hellwig
2021-06-18 14:02           ` Christoph Hellwig
2021-06-18 22:28           ` Dave Chinner
2021-06-18 22:15         ` Dave Chinner
2021-06-18 22:48 ` Dave Chinner
2021-06-19 20:22   ` Darrick J. Wong
2021-06-20 22:18     ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2021-06-26 23:10 [PATCH 4/8] xfs: pass a CIL context to xlog_write() kernel test robot
2021-06-28  8:58 ` Dan Carpenter
2021-06-28  8:58 ` Dan Carpenter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210617224912.GG664593@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.