LKML Archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Mike Snitzer <snitzer@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	Sarthak Kukreti <sarthakkukreti@chromium.org>,
	dm-devel@redhat.com, linux-block@vger.kernel.org,
	linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Alasdair Kergon <agk@redhat.com>,
	Brian Foster <bfoster@redhat.com>, Theodore Ts'o <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Bart Van Assche <bvanassche@google.com>,
	"Darrick J. Wong" <djwong@kernel.org>
Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives
Date: Sat, 20 May 2023 09:07:46 +1000	[thread overview]
Message-ID: <ZGgBQhsbU9b0RiT1@dread.disaster.area> (raw)
In-Reply-To: <ZGeKm+jcBxzkMXQs@redhat.com>

On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote:
> On Fri, May 19 2023 at 12:09P -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > FYI, I really don't think this primitive is a good idea.  In the
> > concept of non-overwritable storage (NAND, SMR drives) the entire
> > concept of a one-shoot 'provisioning' that will guarantee later writes
> > are always possible is simply bogus.
> 
> Valid point for sure, such storage shouldn't advertise support (and
> will return -EOPNOTSUPP).
> 
> But the primitive still has utility for other classes of storage.

Yet the thing people are wanting to us filesystem developers to use
this with is thinly provisioned storage that has snapshot
capability. That, by definition, is non-overwritable storage. These
are the use cases people are asking filesystes to gracefully handle
and report errors when the sparse backing store runs out of space.

e.g. journal writes after a snapshot is taken on a busy filesystem
are always an overwrite and this requires more space in the storage
device for the write to succeed. ENOSPC from the backing device for
journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't
guarantee space for overwrites after snapshots, then it's not
actually useful for solving the real world use cases we actually
need device-level provisioning to solve.

It is not viable for filesystems to have to reprovision space for
in-place metadata overwrites after every snapshot - the filesystem
may not even know a snapshot has been taken! And it's not feasible
for filesystems to provision on demand before they modify metadata
because we don't know what metadata is going to need to be modified
before we start modifying metadata in transactions. If we get ENOSPC
from provisioning in the middle of a dirty transcation, it's all
over just the same as if we get ENOSPC during metadata writeback...

Hence what filesystems actually need is device provisioned space to
be -always over-writable- without ENOSPC occurring.  Ideally, if we
provision a range of the block device, the block device *must*
guarantee all future writes to that LBA range succeeds. That
guarantee needs to stand until we discard or unmap the LBA range,
and for however many writes we do to that LBA range.

e.g. If the device takes a snapshot, it needs to reprovision the
potential COW ranges that overlap with the provisioned LBA range at
snapshot time. e.g. by re-reserving the space from the backing pool
for the provisioned space so if a COW occurs there is space
guaranteed for it to succeed.  If there isn't space in the backing
pool for the reprovisioning, then whatever operation that triggers
the COW behaviour should fail with ENOSPC before doing anything
else....

Software devices like dm-thin/snapshot should really only need to
keep a persistent map of the provisioned space and refresh space
reservations for used space within that map whenever something that
triggers COW behaviour occurs. i.e. a snapshot needs to reset the
provisioned ranges back to "all ranges are freshly provisioned"
before the snapshot is started. If that space is not available in
the backing pool, then the snapshot attempt gets ENOSPC....

That means filesystems only need to provision space for journals and
fixed metadata at mkfs time, and they only need issue a
REQ_PROVISION bio when they first allocate over-write in place
metadata. We already have online discard and/or fstrim for releasing
provisioned space via discards.

This will require some mods to filesystems like ext4 and XFS to
issue REQ_PROVISION and fail gracefully during metadata allocation.
However, doing so means that we can actually harden filesystems
against sparse block device ENOSPC errors by ensuring they will
never occur in critical filesystem structures....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2023-05-19 23:07 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes Sarthak Kukreti
2023-05-19  4:09   ` Christoph Hellwig
2023-05-19 15:17   ` Darrick J. Wong
2023-05-18 22:33 ` [PATCH v7 2/5] block: Introduce provisioning primitives Sarthak Kukreti
2023-05-19  4:18   ` Christoph Hellwig
2023-06-09 20:00   ` Mike Snitzer
2023-05-18 22:33 ` [PATCH v7 3/5] dm: Add block provisioning support Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support Sarthak Kukreti
2023-05-19 15:23   ` Mike Snitzer
2023-06-08 21:24     ` Mike Snitzer
2023-06-09  0:28       ` Mike Snitzer
2023-05-18 22:33 ` [PATCH v7 5/5] loop: Add support for provision requests Sarthak Kukreti
2023-05-22 16:37   ` [dm-devel] " Darrick J. Wong
2023-05-22 22:09     ` Sarthak Kukreti
2023-05-23  1:22       ` Darrick J. Wong
2023-10-07  1:29         ` Sarthak Kukreti
2023-05-19  4:09 ` [PATCH v7 0/5] Introduce provisioning primitives Christoph Hellwig
2023-05-19 14:41   ` Mike Snitzer
2023-05-19 23:07     ` Dave Chinner [this message]
2023-05-22 18:27       ` Mike Snitzer
2023-05-23 14:05         ` Brian Foster
2023-05-23 15:26           ` Mike Snitzer
2023-05-24  0:40             ` Dave Chinner
2023-05-24 20:02               ` Mike Snitzer
2023-05-25 11:39                 ` Dave Chinner
2023-05-25 16:00                   ` Mike Snitzer
2023-05-25 22:47                     ` Sarthak Kukreti
2023-05-26  1:36                       ` Dave Chinner
2023-05-26  2:35                         ` Sarthak Kukreti
2023-05-26 15:56                           ` Brian Foster
2023-05-25 16:19               ` Brian Foster
2023-05-26  9:37                 ` Dave Chinner
2023-05-26 15:47                   ` Brian Foster
     [not found]                   ` <CAJ0trDbspRaDKzTzTjFdPHdB9n0Q9unfu1cEk8giTWoNu3jP8g@mail.gmail.com>
2023-05-26 23:45                     ` Dave Chinner
     [not found]                       ` <CAJ0trDZJQwvAzngZLBJ1hB0XkQ1HRHQOdNQNTw9nK-U5i-0bLA@mail.gmail.com>
2023-05-30 14:02                         ` Mike Snitzer
     [not found]                           ` <CAJ0trDaUOevfiEpXasOESrLHTCcr=oz28ywJU+s+YOiuh7iWow@mail.gmail.com>
2023-05-30 15:28                             ` Mike Snitzer
2023-06-02 18:44                               ` Sarthak Kukreti
2023-06-02 21:50                                 ` Mike Snitzer
2023-06-03  0:52                                 ` Dave Chinner
2023-06-03 15:57                                   ` Mike Snitzer
2023-06-05 21:14                                     ` Sarthak Kukreti
2023-06-07  2:15                                       ` Dave Chinner
2023-06-07 23:27                                       ` Mike Snitzer
2023-06-09 20:31                                         ` Mike Snitzer
2023-06-09 21:54                                           ` Dave Chinner
2023-10-07  1:30                                           ` Sarthak Kukreti
2023-06-07  2:01                                     ` Dave Chinner
2023-06-07 23:50                                       ` Mike Snitzer
2023-06-09  3:32                                         ` Dave Chinner
2023-06-08  2:03                                   ` Martin K. Petersen
2023-06-09  0:10                                     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZGgBQhsbU9b0RiT1@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bfoster@redhat.com \
    --cc=bvanassche@google.com \
    --cc=djwong@kernel.org \
    --cc=dm-devel@redhat.com \
    --cc=hch@infradead.org \
    --cc=jasowang@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=sarthakkukreti@chromium.org \
    --cc=snitzer@kernel.org \
    --cc=stefanha@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).