LKML Archive mirror
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>,
	Joe Thornber <thornber@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Jason Wang <jasowang@redhat.com>,
	Bart Van Assche <bvanassche@google.com>,
	linux-kernel@vger.kernel.org, Joe Thornber <ejt@redhat.com>,
	linux-block@vger.kernel.org, dm-devel@redhat.com,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Sarthak Kukreti <sarthakkukreti@chromium.org>,
	linux-fsdevel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
	linux-ext4@vger.kernel.org, Brian Foster <bfoster@redhat.com>,
	Alasdair Kergon <agk@redhat.com>
Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives
Date: Wed, 7 Jun 2023 19:50:25 -0400	[thread overview]
Message-ID: <ZIEXwTd17z0iYW4s@redhat.com> (raw)
In-Reply-To: <ZH/k9ss2Cg9HYrEV@dread.disaster.area>

On Tue, Jun 06 2023 at 10:01P -0400,
Dave Chinner <david@fromorbit.com> wrote:

> On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote:
> > On Fri, Jun 02 2023 at  8:52P -0400,
> > Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > Mike, I think you might have misunderstood what I have been proposing.
> > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> > > that's what I intended - the operation does not contain data at all.
> > > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> > > contains a range of sectors that need to be provisioned (or
> > > discarded), and nothing else.
> > 
> > No, I understood that.
> > 
> > > The write IOs themselves are not tagged with anything special at all.
> > 
> > I know, but I've been looking at how to also handle the delalloc
> > usecase (and yes I know you feel it doesn't need handling, the issue
> > is XFS does deal nicely with ensuring it has space when it tracks its
> > allocations on "thick" storage
> 
> Oh, no it doesn't. It -works for most cases-, but that does not mean
> it provides any guarantees at all. We can still get ENOSPC for user
> data when delayed allocation reservations "run out".
> 
> This may be news to you, but the ephemeral XFS delayed allocation
> space reservation is not accurate. It contains a "fudge factor"
> called "indirect length". This is a "wet finger in the wind"
> estimation of how much new metadata will need to be allocated to
> index the physical allocations when they are made. It assumes large
> data extents are allocated, which is good enough for most cases, but
> it is no guarantee when there are no large data extents available to
> allocate (e.g. near ENOSPC!).
> 
> And therein lies the fundamental problem with ephemeral range
> reservations: at the time of reservation, we don't know how many
> individual physical LBA ranges the reserved data range is actually
> going to span.
> 
> As a result, XFS delalloc reservations are a "close-but-not-quite"
> reservation backed by a global reserve pool that can be dipped into
> if we run out of delalloc reservation. If the reserve pool is then
> fully depleted before all delalloc conversion completes, we'll still
> give ENOSPC. The pool is sized such that the vast majority of
> workloads will complete delalloc conversion successfully before the
> pool is depleted.
> 
> Hence XFS gives everyone the -appearance- that it deals nicely with
> ENOSPC conditions, but it never provides a -guarantee- that any
> accepted write will always succeed without ENOSPC.
> 
> IMO, using this "close-but-not-quite" reservation as the basis of
> space requirements for other layers to provide "won't ENOSPC"
> guarantees is fraught with problems. We already know that it is
> insufficient in important corner cases at the filesystem level, and
> we also know that lower layers trying to do ephemeral space
> reservations will have exactly the same problems providing a
> guarantee. And these are problems we've been unable to engineer
> around in the past, so the likelihood we can engineer around them
> now or in the future is also very unlikely.

Thanks for clarifying. Wasn't aware of XFS delalloc's "wet finger in
the air" ;)

So do you think it reasonable to require applications to fallocate
their data files? Unaware if users are aware to take that extra step.

> > -- so adding coordination between XFS
> > and dm-thin layers provides comparable safety.. that safety is an
> > expected norm).
> >
> > But rather than discuss in terms of data vs metadata, the distinction
> > is:
> > 1) LBA range reservation (normal case, your proposal)
> > 2) non-LBA reservation (absolute value, LBA range is known at later stage)
> > 
> > But I'm clearly going off script for dwelling on wanting to handle
> > both.
> 
> Right, because if we do 1) then we don't need 2). :)

Sure.

> > My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
> > a crude simplification for branching between the 2 approaches.
> > 
> > And I understand I made you nervous by expanding the scope to a much
> > more muddled/shitty interface. ;)
> 
> Nervous? No, I'm simply trying to make sure that everyone is on the
> same page. i.e. that if we water down the guarantee that 1) relies
> on, then it's not actually useful to filesystems at all.

Yeah, makes sense.
 
> > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just
> > > REQ_META writes (or any other specific type of write operation) then
> > > it's simply not worth persuing at the filesystem level because the
> > > guarantees we actually need just aren't there and the complexity of
> > > discovering and handling those corner cases just isn't worth the
> > > effort.
> > 
> > Here is where I get to say: I think you misunderstood me (but it was
> > my fault for not being absolutely clear: I'm very much on the same
> > page as you and Joe; and your visions need to just be implemented
> > ASAP).
> 
> OK, good that we've clarified the misunderstandings on both sides
> quickly :)

Do you think you're OK to scope out, and/or implement, the XFS changes
if you use v7 of this patchset as the starting point? (v8 should just
be v7 minus the dm-thin.c and dm-snap.c changes).  The thinp
support in v7 will work enough to allow XFS to issue REQ_OP_PROVISION
and/or fallocate (via mkfs.xfs) to dm-thin devices.

And Joe and I can make independent progress on the dm-thin.c changes
needed to ensure the REQ_OP_PROVISION gaurantee you need.

Thanks,
Mike

  reply	other threads:[~2023-06-07 23:51 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes Sarthak Kukreti
2023-05-19  4:09   ` Christoph Hellwig
2023-05-19 15:17   ` Darrick J. Wong
2023-05-18 22:33 ` [PATCH v7 2/5] block: Introduce provisioning primitives Sarthak Kukreti
2023-05-19  4:18   ` Christoph Hellwig
2023-06-09 20:00   ` Mike Snitzer
2023-05-18 22:33 ` [PATCH v7 3/5] dm: Add block provisioning support Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support Sarthak Kukreti
2023-05-19 15:23   ` Mike Snitzer
2023-06-08 21:24     ` Mike Snitzer
2023-06-09  0:28       ` Mike Snitzer
2023-05-18 22:33 ` [PATCH v7 5/5] loop: Add support for provision requests Sarthak Kukreti
2023-05-22 16:37   ` [dm-devel] " Darrick J. Wong
2023-05-22 22:09     ` Sarthak Kukreti
2023-05-23  1:22       ` Darrick J. Wong
2023-10-07  1:29         ` Sarthak Kukreti
2023-05-19  4:09 ` [PATCH v7 0/5] Introduce provisioning primitives Christoph Hellwig
2023-05-19 14:41   ` Mike Snitzer
2023-05-19 23:07     ` Dave Chinner
2023-05-22 18:27       ` Mike Snitzer
2023-05-23 14:05         ` Brian Foster
2023-05-23 15:26           ` Mike Snitzer
2023-05-24  0:40             ` Dave Chinner
2023-05-24 20:02               ` Mike Snitzer
2023-05-25 11:39                 ` Dave Chinner
2023-05-25 16:00                   ` Mike Snitzer
2023-05-25 22:47                     ` Sarthak Kukreti
2023-05-26  1:36                       ` Dave Chinner
2023-05-26  2:35                         ` Sarthak Kukreti
2023-05-26 15:56                           ` Brian Foster
2023-05-25 16:19               ` Brian Foster
2023-05-26  9:37                 ` Dave Chinner
2023-05-26 15:47                   ` Brian Foster
     [not found]                   ` <CAJ0trDbspRaDKzTzTjFdPHdB9n0Q9unfu1cEk8giTWoNu3jP8g@mail.gmail.com>
2023-05-26 23:45                     ` Dave Chinner
     [not found]                       ` <CAJ0trDZJQwvAzngZLBJ1hB0XkQ1HRHQOdNQNTw9nK-U5i-0bLA@mail.gmail.com>
2023-05-30 14:02                         ` Mike Snitzer
     [not found]                           ` <CAJ0trDaUOevfiEpXasOESrLHTCcr=oz28ywJU+s+YOiuh7iWow@mail.gmail.com>
2023-05-30 15:28                             ` Mike Snitzer
2023-06-02 18:44                               ` Sarthak Kukreti
2023-06-02 21:50                                 ` Mike Snitzer
2023-06-03  0:52                                 ` Dave Chinner
2023-06-03 15:57                                   ` Mike Snitzer
2023-06-05 21:14                                     ` Sarthak Kukreti
2023-06-07  2:15                                       ` Dave Chinner
2023-06-07 23:27                                       ` Mike Snitzer
2023-06-09 20:31                                         ` Mike Snitzer
2023-06-09 21:54                                           ` Dave Chinner
2023-10-07  1:30                                           ` Sarthak Kukreti
2023-06-07  2:01                                     ` Dave Chinner
2023-06-07 23:50                                       ` Mike Snitzer [this message]
2023-06-09  3:32                                         ` Dave Chinner
2023-06-08  2:03                                   ` Martin K. Petersen
2023-06-09  0:10                                     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZIEXwTd17z0iYW4s@redhat.com \
    --to=snitzer@kernel.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bfoster@redhat.com \
    --cc=bvanassche@google.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=dm-devel@redhat.com \
    --cc=ejt@redhat.com \
    --cc=hch@infradead.org \
    --cc=jasowang@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=sarthakkukreti@chromium.org \
    --cc=stefanha@redhat.com \
    --cc=thornber@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).