From: Mike Snitzer <snitzer@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>,
Christoph Hellwig <hch@infradead.org>,
Joe Thornber <thornber@redhat.com>,
Stefan Hajnoczi <stefanha@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
"Darrick J. Wong" <djwong@kernel.org>,
Jason Wang <jasowang@redhat.com>,
Bart Van Assche <bvanassche@google.com>,
linux-kernel@vger.kernel.org, Joe Thornber <ejt@redhat.com>,
linux-block@vger.kernel.org, dm-devel@redhat.com,
Andreas Dilger <adilger.kernel@dilger.ca>,
Sarthak Kukreti <sarthakkukreti@chromium.org>,
linux-fsdevel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
linux-ext4@vger.kernel.org, Brian Foster <bfoster@redhat.com>,
Alasdair Kergon <agk@redhat.com>
Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives
Date: Wed, 7 Jun 2023 19:50:25 -0400 [thread overview]
Message-ID: <ZIEXwTd17z0iYW4s@redhat.com> (raw)
In-Reply-To: <ZH/k9ss2Cg9HYrEV@dread.disaster.area>
On Tue, Jun 06 2023 at 10:01P -0400,
Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote:
> > On Fri, Jun 02 2023 at 8:52P -0400,
> > Dave Chinner <david@fromorbit.com> wrote:
> >
> > > Mike, I think you might have misunderstood what I have been proposing.
> > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> > > that's what I intended - the operation does not contain data at all.
> > > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> > > contains a range of sectors that need to be provisioned (or
> > > discarded), and nothing else.
> >
> > No, I understood that.
> >
> > > The write IOs themselves are not tagged with anything special at all.
> >
> > I know, but I've been looking at how to also handle the delalloc
> > usecase (and yes I know you feel it doesn't need handling, the issue
> > is XFS does deal nicely with ensuring it has space when it tracks its
> > allocations on "thick" storage
>
> Oh, no it doesn't. It -works for most cases-, but that does not mean
> it provides any guarantees at all. We can still get ENOSPC for user
> data when delayed allocation reservations "run out".
>
> This may be news to you, but the ephemeral XFS delayed allocation
> space reservation is not accurate. It contains a "fudge factor"
> called "indirect length". This is a "wet finger in the wind"
> estimation of how much new metadata will need to be allocated to
> index the physical allocations when they are made. It assumes large
> data extents are allocated, which is good enough for most cases, but
> it is no guarantee when there are no large data extents available to
> allocate (e.g. near ENOSPC!).
>
> And therein lies the fundamental problem with ephemeral range
> reservations: at the time of reservation, we don't know how many
> individual physical LBA ranges the reserved data range is actually
> going to span.
>
> As a result, XFS delalloc reservations are a "close-but-not-quite"
> reservation backed by a global reserve pool that can be dipped into
> if we run out of delalloc reservation. If the reserve pool is then
> fully depleted before all delalloc conversion completes, we'll still
> give ENOSPC. The pool is sized such that the vast majority of
> workloads will complete delalloc conversion successfully before the
> pool is depleted.
>
> Hence XFS gives everyone the -appearance- that it deals nicely with
> ENOSPC conditions, but it never provides a -guarantee- that any
> accepted write will always succeed without ENOSPC.
>
> IMO, using this "close-but-not-quite" reservation as the basis of
> space requirements for other layers to provide "won't ENOSPC"
> guarantees is fraught with problems. We already know that it is
> insufficient in important corner cases at the filesystem level, and
> we also know that lower layers trying to do ephemeral space
> reservations will have exactly the same problems providing a
> guarantee. And these are problems we've been unable to engineer
> around in the past, so the likelihood we can engineer around them
> now or in the future is also very unlikely.
Thanks for clarifying. Wasn't aware of XFS delalloc's "wet finger in
the air" ;)
So do you think it reasonable to require applications to fallocate
their data files? Unaware if users are aware to take that extra step.
> > -- so adding coordination between XFS
> > and dm-thin layers provides comparable safety.. that safety is an
> > expected norm).
> >
> > But rather than discuss in terms of data vs metadata, the distinction
> > is:
> > 1) LBA range reservation (normal case, your proposal)
> > 2) non-LBA reservation (absolute value, LBA range is known at later stage)
> >
> > But I'm clearly going off script for dwelling on wanting to handle
> > both.
>
> Right, because if we do 1) then we don't need 2). :)
Sure.
> > My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
> > a crude simplification for branching between the 2 approaches.
> >
> > And I understand I made you nervous by expanding the scope to a much
> > more muddled/shitty interface. ;)
>
> Nervous? No, I'm simply trying to make sure that everyone is on the
> same page. i.e. that if we water down the guarantee that 1) relies
> on, then it's not actually useful to filesystems at all.
Yeah, makes sense.
> > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just
> > > REQ_META writes (or any other specific type of write operation) then
> > > it's simply not worth persuing at the filesystem level because the
> > > guarantees we actually need just aren't there and the complexity of
> > > discovering and handling those corner cases just isn't worth the
> > > effort.
> >
> > Here is where I get to say: I think you misunderstood me (but it was
> > my fault for not being absolutely clear: I'm very much on the same
> > page as you and Joe; and your visions need to just be implemented
> > ASAP).
>
> OK, good that we've clarified the misunderstandings on both sides
> quickly :)
Do you think you're OK to scope out, and/or implement, the XFS changes
if you use v7 of this patchset as the starting point? (v8 should just
be v7 minus the dm-thin.c and dm-snap.c changes). The thinp
support in v7 will work enough to allow XFS to issue REQ_OP_PROVISION
and/or fallocate (via mkfs.xfs) to dm-thin devices.
And Joe and I can make independent progress on the dm-thin.c changes
needed to ensure the REQ_OP_PROVISION gaurantee you need.
Thanks,
Mike
next prev parent reply other threads:[~2023-06-07 23:51 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes Sarthak Kukreti
2023-05-19 4:09 ` Christoph Hellwig
2023-05-19 15:17 ` Darrick J. Wong
2023-05-18 22:33 ` [PATCH v7 2/5] block: Introduce provisioning primitives Sarthak Kukreti
2023-05-19 4:18 ` Christoph Hellwig
2023-06-09 20:00 ` Mike Snitzer
2023-05-18 22:33 ` [PATCH v7 3/5] dm: Add block provisioning support Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support Sarthak Kukreti
2023-05-19 15:23 ` Mike Snitzer
2023-06-08 21:24 ` Mike Snitzer
2023-06-09 0:28 ` Mike Snitzer
2023-05-18 22:33 ` [PATCH v7 5/5] loop: Add support for provision requests Sarthak Kukreti
2023-05-22 16:37 ` [dm-devel] " Darrick J. Wong
2023-05-22 22:09 ` Sarthak Kukreti
2023-05-23 1:22 ` Darrick J. Wong
2023-10-07 1:29 ` Sarthak Kukreti
2023-05-19 4:09 ` [PATCH v7 0/5] Introduce provisioning primitives Christoph Hellwig
2023-05-19 14:41 ` Mike Snitzer
2023-05-19 23:07 ` Dave Chinner
2023-05-22 18:27 ` Mike Snitzer
2023-05-23 14:05 ` Brian Foster
2023-05-23 15:26 ` Mike Snitzer
2023-05-24 0:40 ` Dave Chinner
2023-05-24 20:02 ` Mike Snitzer
2023-05-25 11:39 ` Dave Chinner
2023-05-25 16:00 ` Mike Snitzer
2023-05-25 22:47 ` Sarthak Kukreti
2023-05-26 1:36 ` Dave Chinner
2023-05-26 2:35 ` Sarthak Kukreti
2023-05-26 15:56 ` Brian Foster
2023-05-25 16:19 ` Brian Foster
2023-05-26 9:37 ` Dave Chinner
2023-05-26 15:47 ` Brian Foster
[not found] ` <CAJ0trDbspRaDKzTzTjFdPHdB9n0Q9unfu1cEk8giTWoNu3jP8g@mail.gmail.com>
2023-05-26 23:45 ` Dave Chinner
[not found] ` <CAJ0trDZJQwvAzngZLBJ1hB0XkQ1HRHQOdNQNTw9nK-U5i-0bLA@mail.gmail.com>
2023-05-30 14:02 ` Mike Snitzer
[not found] ` <CAJ0trDaUOevfiEpXasOESrLHTCcr=oz28ywJU+s+YOiuh7iWow@mail.gmail.com>
2023-05-30 15:28 ` Mike Snitzer
2023-06-02 18:44 ` Sarthak Kukreti
2023-06-02 21:50 ` Mike Snitzer
2023-06-03 0:52 ` Dave Chinner
2023-06-03 15:57 ` Mike Snitzer
2023-06-05 21:14 ` Sarthak Kukreti
2023-06-07 2:15 ` Dave Chinner
2023-06-07 23:27 ` Mike Snitzer
2023-06-09 20:31 ` Mike Snitzer
2023-06-09 21:54 ` Dave Chinner
2023-10-07 1:30 ` Sarthak Kukreti
2023-06-07 2:01 ` Dave Chinner
2023-06-07 23:50 ` Mike Snitzer [this message]
2023-06-09 3:32 ` Dave Chinner
2023-06-08 2:03 ` Martin K. Petersen
2023-06-09 0:10 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZIEXwTd17z0iYW4s@redhat.com \
--to=snitzer@kernel.org \
--cc=adilger.kernel@dilger.ca \
--cc=agk@redhat.com \
--cc=axboe@kernel.dk \
--cc=bfoster@redhat.com \
--cc=bvanassche@google.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=dm-devel@redhat.com \
--cc=ejt@redhat.com \
--cc=hch@infradead.org \
--cc=jasowang@redhat.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mst@redhat.com \
--cc=sarthakkukreti@chromium.org \
--cc=stefanha@redhat.com \
--cc=thornber@redhat.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).