* [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
@ 2023-05-18 22:33 ` Sarthak Kukreti
2023-05-19 4:09 ` Christoph Hellwig
2023-05-19 15:17 ` Darrick J. Wong
2023-05-18 22:33 ` [PATCH v7 2/5] block: Introduce provisioning primitives Sarthak Kukreti
` (4 subsequent siblings)
5 siblings, 2 replies; 52+ messages in thread
From: Sarthak Kukreti @ 2023-05-18 22:33 UTC (permalink / raw)
To: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel
Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong, stable
Only call truncate_bdev_range() if the fallocate mode is
supported. This fixes a bug where data in the pagecache
could be invalidated if the fallocate() was called on the
block device with an invalid mode.
Fixes: 25f4c41415e5 ("block: implement (some of) fallocate for block devices")
Cc: stable@vger.kernel.org
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
block/fops.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index d2e6be4e3d1c..4c70fdc546e7 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -648,24 +648,35 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
filemap_invalidate_lock(inode->i_mapping);
- /* Invalidate the page cache, including dirty pages. */
- error = truncate_bdev_range(bdev, file->f_mode, start, end);
- if (error)
- goto fail;
-
+ /*
+ * Invalidate the page cache, including dirty pages, for valid
+ * de-allocate mode calls to fallocate().
+ */
switch (mode) {
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+ error = truncate_bdev_range(bdev, file->f_mode, start, end);
+ if (error)
+ goto fail;
+
error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
len >> SECTOR_SHIFT, GFP_KERNEL,
BLKDEV_ZERO_NOUNMAP);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
+ error = truncate_bdev_range(bdev, file->f_mode, start, end);
+ if (error)
+ goto fail;
+
error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
len >> SECTOR_SHIFT, GFP_KERNEL,
BLKDEV_ZERO_NOFALLBACK);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
+ error = truncate_bdev_range(bdev, file->f_mode, start, end);
+ if (error)
+ goto fail;
+
error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
len >> SECTOR_SHIFT, GFP_KERNEL);
break;
--
2.40.1.698.g37aff9b760-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes
2023-05-18 22:33 ` [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes Sarthak Kukreti
@ 2023-05-19 4:09 ` Christoph Hellwig
2023-05-19 15:17 ` Darrick J. Wong
1 sibling, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2023-05-19 4:09 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong, stable
On Thu, May 18, 2023 at 03:33:22PM -0700, Sarthak Kukreti wrote:
> Only call truncate_bdev_range() if the fallocate mode is
> supported. This fixes a bug where data in the pagecache
> could be invalidated if the fallocate() was called on the
> block device with an invalid mode.
>
> Fixes: 25f4c41415e5 ("block: implement (some of) fallocate for block devices")
> Cc: stable@vger.kernel.org
> Reported-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes
2023-05-18 22:33 ` [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes Sarthak Kukreti
2023-05-19 4:09 ` Christoph Hellwig
@ 2023-05-19 15:17 ` Darrick J. Wong
1 sibling, 0 replies; 52+ messages in thread
From: Darrick J. Wong @ 2023-05-19 15:17 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche, stable
On Thu, May 18, 2023 at 03:33:22PM -0700, Sarthak Kukreti wrote:
> Only call truncate_bdev_range() if the fallocate mode is
> supported. This fixes a bug where data in the pagecache
> could be invalidated if the fallocate() was called on the
> block device with an invalid mode.
>
> Fixes: 25f4c41415e5 ("block: implement (some of) fallocate for block devices")
> Cc: stable@vger.kernel.org
> Reported-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
Thanks for fixing this,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> block/fops.c | 21 ++++++++++++++++-----
> 1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/block/fops.c b/block/fops.c
> index d2e6be4e3d1c..4c70fdc546e7 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -648,24 +648,35 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>
> filemap_invalidate_lock(inode->i_mapping);
>
> - /* Invalidate the page cache, including dirty pages. */
> - error = truncate_bdev_range(bdev, file->f_mode, start, end);
> - if (error)
> - goto fail;
> -
> + /*
> + * Invalidate the page cache, including dirty pages, for valid
> + * de-allocate mode calls to fallocate().
> + */
> switch (mode) {
> case FALLOC_FL_ZERO_RANGE:
> case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
> + error = truncate_bdev_range(bdev, file->f_mode, start, end);
> + if (error)
> + goto fail;
> +
> error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
> len >> SECTOR_SHIFT, GFP_KERNEL,
> BLKDEV_ZERO_NOUNMAP);
> break;
> case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
> + error = truncate_bdev_range(bdev, file->f_mode, start, end);
> + if (error)
> + goto fail;
> +
> error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
> len >> SECTOR_SHIFT, GFP_KERNEL,
> BLKDEV_ZERO_NOFALLBACK);
> break;
> case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
> + error = truncate_bdev_range(bdev, file->f_mode, start, end);
> + if (error)
> + goto fail;
> +
> error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> len >> SECTOR_SHIFT, GFP_KERNEL);
> break;
> --
> 2.40.1.698.g37aff9b760-goog
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v7 2/5] block: Introduce provisioning primitives
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes Sarthak Kukreti
@ 2023-05-18 22:33 ` Sarthak Kukreti
2023-05-19 4:18 ` Christoph Hellwig
2023-06-09 20:00 ` Mike Snitzer
2023-05-18 22:33 ` [PATCH v7 3/5] dm: Add block provisioning support Sarthak Kukreti
` (3 subsequent siblings)
5 siblings, 2 replies; 52+ messages in thread
From: Sarthak Kukreti @ 2023-05-18 22:33 UTC (permalink / raw)
To: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel
Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong
Introduce block request REQ_OP_PROVISION. The intent of this request
is to request underlying storage to preallocate disk space for the given
block range. Block devices that support this capability will export
a provision limit within their request queues.
This patch also adds the capability to call fallocate() in mode 0
on block devices, which will send REQ_OP_PROVISION to the block
device for the specified range,
Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
block/blk-core.c | 5 ++++
block/blk-lib.c | 51 +++++++++++++++++++++++++++++++++++++++
block/blk-merge.c | 18 ++++++++++++++
block/blk-settings.c | 19 +++++++++++++++
block/blk-sysfs.c | 9 +++++++
block/bounce.c | 1 +
block/fops.c | 10 +++++++-
include/linux/bio.h | 6 +++--
include/linux/blk_types.h | 5 +++-
include/linux/blkdev.h | 16 ++++++++++++
10 files changed, 136 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 00c74330fa92..f515c6c2e030 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -123,6 +123,7 @@ static const char *const blk_op_name[] = {
REQ_OP_NAME(WRITE_ZEROES),
REQ_OP_NAME(DRV_IN),
REQ_OP_NAME(DRV_OUT),
+ REQ_OP_NAME(PROVISION)
};
#undef REQ_OP_NAME
@@ -792,6 +793,10 @@ void submit_bio_noacct(struct bio *bio)
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
break;
+ case REQ_OP_PROVISION:
+ if (!q->limits.max_provision_sectors)
+ goto not_supported;
+ break;
default:
break;
}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..3cff5fb654f5 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -343,3 +343,54 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
return ret;
}
EXPORT_SYMBOL(blkdev_issue_secure_erase);
+
+/**
+ * blkdev_issue_provision - provision a block range
+ * @bdev: blockdev to write
+ * @sector: start sector
+ * @nr_sects: number of sectors to provision
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ * Issues a provision request to the block device for the range of sectors.
+ * For thinly provisioned block devices, this acts as a signal for the
+ * underlying storage pool to allocate space for this block range.
+ */
+int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp)
+{
+ sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+ unsigned int max_sectors = bdev_max_provision_sectors(bdev);
+ struct bio *bio = NULL;
+ struct blk_plug plug;
+ int ret = 0;
+
+ if (max_sectors == 0)
+ return -EOPNOTSUPP;
+ if ((sector | nr_sects) & bs_mask)
+ return -EINVAL;
+ if (bdev_read_only(bdev))
+ return -EPERM;
+
+ blk_start_plug(&plug);
+ for (;;) {
+ unsigned int req_sects = min_t(sector_t, nr_sects, max_sectors);
+
+ bio = blk_next_bio(bio, bdev, 0, REQ_OP_PROVISION, gfp);
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_iter.bi_size = req_sects << SECTOR_SHIFT;
+
+ sector += req_sects;
+ nr_sects -= req_sects;
+ if (!nr_sects) {
+ ret = submit_bio_wait(bio);
+ bio_put(bio);
+ break;
+ }
+ cond_resched();
+ }
+ blk_finish_plug(&plug);
+
+ return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_provision);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..83e516d2121f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -158,6 +158,21 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
}
+static struct bio *bio_split_provision(struct bio *bio,
+ const struct queue_limits *lim,
+ unsigned int *nsegs, struct bio_set *bs)
+{
+ *nsegs = 0;
+
+ if (!lim->max_provision_sectors)
+ return NULL;
+
+ if (bio_sectors(bio) <= lim->max_provision_sectors)
+ return NULL;
+
+ return bio_split(bio, lim->max_provision_sectors, GFP_NOIO, bs);
+}
+
/*
* Return the maximum number of sectors from the start of a bio that may be
* submitted as a single request to a block device. If enough sectors remain,
@@ -366,6 +381,9 @@ struct bio *__bio_split_to_limits(struct bio *bio,
case REQ_OP_WRITE_ZEROES:
split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
break;
+ case REQ_OP_PROVISION:
+ split = bio_split_provision(bio, lim, nr_segs, bs);
+ break;
default:
split = bio_split_rw(bio, lim, nr_segs, bs,
get_max_io_size(bio, lim) << SECTOR_SHIFT);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 896b4654ab00..d303e6614c36 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+ lim->max_provision_sectors = 0;
}
/**
@@ -82,6 +83,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+ lim->max_provision_sectors = UINT_MAX;
}
EXPORT_SYMBOL(blk_set_stacking_limits);
@@ -208,6 +210,20 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
}
EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
+/**
+ * blk_queue_max_provision_sectors - set max sectors for a single provision
+ *
+ * @q: the request queue for the device
+ * @max_provision_sectors: maximum number of sectors to provision per command
+ **/
+
+void blk_queue_max_provision_sectors(struct request_queue *q,
+ unsigned int max_provision_sectors)
+{
+ q->limits.max_provision_sectors = max_provision_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_provision_sectors);
+
/**
* blk_queue_max_zone_append_sectors - set max sectors for a single zone append
* @q: the request queue for the device
@@ -578,6 +594,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
b->max_segment_size);
+ t->max_provision_sectors = min_not_zero(t->max_provision_sectors,
+ b->max_provision_sectors);
+
t->misaligned |= b->misaligned;
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a64208583853..094f0a65bd78 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -212,6 +212,13 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
return queue_var_show(0, page);
}
+static ssize_t queue_provision_max_show(struct request_queue *q,
+ char *page)
+{
+ return sprintf(page, "%llu\n",
+ (unsigned long long)q->limits.max_provision_sectors << 9);
+}
+
static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
{
return queue_var_show(0, page);
@@ -580,6 +587,7 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
+QUEUE_RO_ENTRY(queue_provision_max, "provision_max_bytes");
QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
@@ -637,6 +645,7 @@ static struct attribute *queue_attrs[] = {
&queue_discard_max_entry.attr,
&queue_discard_max_hw_entry.attr,
&queue_discard_zeroes_data_entry.attr,
+ &queue_provision_max_entry.attr,
&queue_write_same_max_entry.attr,
&queue_write_zeroes_max_entry.attr,
&queue_zone_append_max_entry.attr,
diff --git a/block/bounce.c b/block/bounce.c
index 7cfcb242f9a1..ab9d8723ae64 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -176,6 +176,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src)
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_PROVISION:
break;
default:
bio_for_each_segment(bv, bio_src, iter)
diff --git a/block/fops.c b/block/fops.c
index 4c70fdc546e7..be2e41f160bf 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -613,7 +613,8 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
#define BLKDEV_FALLOC_FL_SUPPORTED \
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
- FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+ FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE | \
+ FALLOC_FL_UNSHARE_RANGE)
static long blkdev_fallocate(struct file *file, int mode, loff_t start,
loff_t len)
@@ -653,6 +654,13 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
* de-allocate mode calls to fallocate().
*/
switch (mode) {
+ case 0:
+ case FALLOC_FL_UNSHARE_RANGE:
+ case FALLOC_FL_KEEP_SIZE:
+ case FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE:
+ error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
+ len >> SECTOR_SHIFT, GFP_KERNEL);
+ break;
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
error = truncate_bdev_range(bdev, file->f_mode, start, end);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b3e7529ff55e..de85d5faf221 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -57,7 +57,8 @@ static inline bool bio_has_data(struct bio *bio)
bio->bi_iter.bi_size &&
bio_op(bio) != REQ_OP_DISCARD &&
bio_op(bio) != REQ_OP_SECURE_ERASE &&
- bio_op(bio) != REQ_OP_WRITE_ZEROES)
+ bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+ bio_op(bio) != REQ_OP_PROVISION)
return true;
return false;
@@ -67,7 +68,8 @@ static inline bool bio_no_advance_iter(const struct bio *bio)
{
return bio_op(bio) == REQ_OP_DISCARD ||
bio_op(bio) == REQ_OP_SECURE_ERASE ||
- bio_op(bio) == REQ_OP_WRITE_ZEROES;
+ bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+ bio_op(bio) == REQ_OP_PROVISION;
}
static inline void *bio_data(struct bio *bio)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 740afe80f297..b7bb0226fdee 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -390,7 +390,10 @@ enum req_op {
REQ_OP_DRV_IN = (__force blk_opf_t)34,
REQ_OP_DRV_OUT = (__force blk_opf_t)35,
- REQ_OP_LAST = (__force blk_opf_t)36,
+ /* request device to provision block */
+ REQ_OP_PROVISION = (__force blk_opf_t)37,
+
+ REQ_OP_LAST = (__force blk_opf_t)38,
};
enum req_flag_bits {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b441e633f4dd..462ce586d46f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -294,6 +294,7 @@ struct queue_limits {
unsigned int discard_granularity;
unsigned int discard_alignment;
unsigned int zone_write_granularity;
+ unsigned int max_provision_sectors;
unsigned short max_segments;
unsigned short max_integrity_segments;
@@ -906,6 +907,8 @@ extern void blk_queue_max_discard_sectors(struct request_queue *q,
unsigned int max_discard_sectors);
extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
unsigned int max_write_same_sectors);
+extern void blk_queue_max_provision_sectors(struct request_queue *q,
+ unsigned int max_provision_sectors);
extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
unsigned int max_zone_append_sectors);
@@ -1045,6 +1048,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp);
+extern int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask);
+
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
@@ -1124,6 +1130,11 @@ static inline unsigned short queue_max_discard_segments(const struct request_que
return q->limits.max_discard_segments;
}
+static inline unsigned short queue_max_provision_sectors(const struct request_queue *q)
+{
+ return q->limits.max_provision_sectors;
+}
+
static inline unsigned int queue_max_segment_size(const struct request_queue *q)
{
return q->limits.max_segment_size;
@@ -1266,6 +1277,11 @@ static inline bool bdev_nowait(struct block_device *bdev)
return test_bit(QUEUE_FLAG_NOWAIT, &bdev_get_queue(bdev)->queue_flags);
}
+static inline unsigned int bdev_max_provision_sectors(struct block_device *bdev)
+{
+ return bdev_get_queue(bdev)->limits.max_provision_sectors;
+}
+
static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
{
return blk_queue_zoned_model(bdev_get_queue(bdev));
--
2.40.1.698.g37aff9b760-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v7 2/5] block: Introduce provisioning primitives
2023-05-18 22:33 ` [PATCH v7 2/5] block: Introduce provisioning primitives Sarthak Kukreti
@ 2023-05-19 4:18 ` Christoph Hellwig
2023-06-09 20:00 ` Mike Snitzer
1 sibling, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2023-05-19 4:18 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong
On Thu, May 18, 2023 at 03:33:23PM -0700, Sarthak Kukreti wrote:
> +EXPORT_SYMBOL(blkdev_issue_provision);
IFF this gets in in some form, please make sure all exports for new
block layer functionality are EXPORT_SYMBOL_GPL, thanks.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 2/5] block: Introduce provisioning primitives
2023-05-18 22:33 ` [PATCH v7 2/5] block: Introduce provisioning primitives Sarthak Kukreti
2023-05-19 4:18 ` Christoph Hellwig
@ 2023-06-09 20:00 ` Mike Snitzer
1 sibling, 0 replies; 52+ messages in thread
From: Mike Snitzer @ 2023-06-09 20:00 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong
On Thu, May 18 2023 at 6:33P -0400,
Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
> Introduce block request REQ_OP_PROVISION. The intent of this request
> is to request underlying storage to preallocate disk space for the given
> block range. Block devices that support this capability will export
> a provision limit within their request queues.
>
> This patch also adds the capability to call fallocate() in mode 0
> on block devices, which will send REQ_OP_PROVISION to the block
> device for the specified range,
>
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
...
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 896b4654ab00..d303e6614c36 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,7 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->zoned = BLK_ZONED_NONE;
> lim->zone_write_granularity = 0;
> lim->dma_alignment = 511;
> + lim->max_provision_sectors = 0;
> }
>
> /**
> @@ -82,6 +83,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
> lim->max_dev_sectors = UINT_MAX;
> lim->max_write_zeroes_sectors = UINT_MAX;
> lim->max_zone_append_sectors = UINT_MAX;
> + lim->max_provision_sectors = UINT_MAX;
> }
> EXPORT_SYMBOL(blk_set_stacking_limits);
>
> @@ -578,6 +594,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> t->max_segment_size = min_not_zero(t->max_segment_size,
> b->max_segment_size);
>
> + t->max_provision_sectors = min_not_zero(t->max_provision_sectors,
> + b->max_provision_sectors);
> +
This needs to use min() since max_provision_sectors also serves to
indicate if the device supports REQ_OP_PROVISION. Otherwise, if I set
max_provision_sectors to 0 on a dm thin-pool the blk_stack_limits()
will ignore my having set it to 0 (to disable) and it'll remain as
UINT_MAX (thanks to blk_set_default_limits).
Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v7 3/5] dm: Add block provisioning support
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 1/5] block: Don't invalidate pagecache for invalid falloc modes Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 2/5] block: Introduce provisioning primitives Sarthak Kukreti
@ 2023-05-18 22:33 ` Sarthak Kukreti
2023-05-18 22:33 ` [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support Sarthak Kukreti
` (2 subsequent siblings)
5 siblings, 0 replies; 52+ messages in thread
From: Sarthak Kukreti @ 2023-05-18 22:33 UTC (permalink / raw)
To: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel
Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong
Add block provisioning support for device-mapper targets.
dm-crypt, dm-snap and dm-linear will, by default, passthrough
REQ_OP_PROVISION requests to the underlying device, if
supported.
Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
drivers/md/dm-crypt.c | 4 +++-
drivers/md/dm-linear.c | 1 +
drivers/md/dm-snap.c | 7 +++++++
drivers/md/dm-table.c | 23 +++++++++++++++++++++++
drivers/md/dm.c | 6 ++++++
include/linux/device-mapper.h | 17 +++++++++++++++++
6 files changed, 57 insertions(+), 1 deletion(-)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 8b47b913ee83..5a7c475ce6fc 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -3336,6 +3336,8 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
cc->tag_pool_max_sectors <<= cc->sector_shift;
}
+ ti->num_provision_bios = 1;
+
ret = -ENOMEM;
cc->io_queue = alloc_workqueue("kcryptd_io/%s", WQ_MEM_RECLAIM, 1, devname);
if (!cc->io_queue) {
@@ -3390,7 +3392,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
* - for REQ_OP_DISCARD caller must use flush if IO ordering matters
*/
if (unlikely(bio->bi_opf & REQ_PREFLUSH ||
- bio_op(bio) == REQ_OP_DISCARD)) {
+ bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_PROVISION)) {
bio_set_dev(bio, cc->dev->bdev);
if (bio_sectors(bio))
bio->bi_iter.bi_sector = cc->start +
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..74ee27ca551a 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+ ti->num_provision_bios = 1;
ti->private = lc;
return 0;
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 9c49f53760d0..0dfda50ac4e0 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1358,6 +1358,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
if (s->discard_zeroes_cow)
ti->num_discard_bios = (s->discard_passdown_origin ? 2 : 1);
ti->per_io_data_size = sizeof(struct dm_snap_tracked_chunk);
+ ti->num_provision_bios = 1;
/* Add snapshot to the list of snapshots for this origin */
/* Exceptions aren't triggered till snapshot_resume() is called */
@@ -2003,6 +2004,11 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
/* If the block is already remapped - use that, else remap it */
e = dm_lookup_exception(&s->complete, chunk);
if (e) {
+ if (unlikely(bio_op(bio) == REQ_OP_PROVISION)) {
+ bio_endio(bio);
+ r = DM_MAPIO_SUBMITTED;
+ goto out_unlock;
+ }
remap_exception(s, e, bio, chunk);
if (unlikely(bio_op(bio) == REQ_OP_DISCARD) &&
io_overlaps_chunk(s, bio)) {
@@ -2413,6 +2419,7 @@ static void snapshot_io_hints(struct dm_target *ti, struct queue_limits *limits)
/* All discards are split on chunk_size boundary */
limits->discard_granularity = snap->store->chunk_size;
limits->max_discard_sectors = snap->store->chunk_size;
+ limits->max_provision_sectors = snap->store->chunk_size;
up_read(&_origins_lock);
}
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 1398f1d6e83e..4b2998c1e1dc 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1845,6 +1845,26 @@ static bool dm_table_supports_write_zeroes(struct dm_table *t)
return true;
}
+static int device_provision_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+ return bdev_max_provision_sectors(dev->bdev);
+}
+
+static bool dm_table_supports_provision(struct dm_table *t)
+{
+ for (unsigned int i = 0; i < t->num_targets; i++) {
+ struct dm_target *ti = dm_table_get_target(t, i);
+
+ if (ti->provision_supported ||
+ (ti->type->iterate_devices &&
+ ti->type->iterate_devices(ti, device_provision_capable, NULL)))
+ return true;
+ }
+
+ return false;
+}
+
static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1978,6 +1998,9 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (!dm_table_supports_write_zeroes(t))
q->limits.max_write_zeroes_sectors = 0;
+ if (!dm_table_supports_provision(t))
+ q->limits.max_provision_sectors = 0;
+
dm_table_verify_integrity(t);
/*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3b694ba3a106..9b94121b8d38 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1609,6 +1609,7 @@ static bool is_abnormal_io(struct bio *bio)
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_PROVISION:
return true;
default:
break;
@@ -1641,6 +1642,11 @@ static blk_status_t __process_abnormal_io(struct clone_info *ci,
if (ti->max_write_zeroes_granularity)
max_granularity = limits->max_write_zeroes_sectors;
break;
+ case REQ_OP_PROVISION:
+ num_bios = ti->num_provision_bios;
+ if (ti->max_provision_granularity)
+ max_granularity = limits->max_provision_sectors;
+ break;
default:
break;
}
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a52d2b9a6846..9981378457d2 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -334,6 +334,12 @@ struct dm_target {
*/
unsigned int num_write_zeroes_bios;
+ /*
+ * The number of PROVISION bios that will be submitted to the target.
+ * The bio number can be accessed with dm_bio_get_target_bio_nr.
+ */
+ unsigned int num_provision_bios;
+
/*
* The minimum number of extra bytes allocated in each io for the
* target to use.
@@ -358,6 +364,11 @@ struct dm_target {
*/
bool discards_supported:1;
+ /* Set if this target needs to receive provision requests regardless of
+ * whether or not its underlying devices have support.
+ */
+ bool provision_supported:1;
+
/*
* Set if this target requires that discards be split on
* 'max_discard_sectors' boundaries.
@@ -376,6 +387,12 @@ struct dm_target {
*/
bool max_write_zeroes_granularity:1;
+ /*
+ * Set if this target requires that provisions be split on
+ * 'max_provision_sectors' boundaries.
+ */
+ bool max_provision_granularity:1;
+
/*
* Set if we need to limit the number of in-flight bios when swapping.
*/
--
2.40.1.698.g37aff9b760-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
` (2 preceding siblings ...)
2023-05-18 22:33 ` [PATCH v7 3/5] dm: Add block provisioning support Sarthak Kukreti
@ 2023-05-18 22:33 ` Sarthak Kukreti
2023-05-19 15:23 ` Mike Snitzer
2023-05-18 22:33 ` [PATCH v7 5/5] loop: Add support for provision requests Sarthak Kukreti
2023-05-19 4:09 ` [PATCH v7 0/5] Introduce provisioning primitives Christoph Hellwig
5 siblings, 1 reply; 52+ messages in thread
From: Sarthak Kukreti @ 2023-05-18 22:33 UTC (permalink / raw)
To: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel
Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong
dm-thinpool uses the provision request to provision
blocks for a dm-thin device. dm-thinpool currently does not
pass through REQ_OP_PROVISION to underlying devices.
For shared blocks, provision requests will break sharing and copy the
contents of the entire block. Additionally, if 'skip_block_zeroing'
is not set, dm-thin will opt to zero out the entire range as a part
of provisioning.
Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
drivers/md/dm-thin.c | 74 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 70 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 2b13c949bd72..f1b68b558cf0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -274,6 +274,7 @@ struct pool {
process_bio_fn process_bio;
process_bio_fn process_discard;
+ process_bio_fn process_provision;
process_cell_fn process_cell;
process_cell_fn process_discard_cell;
@@ -913,7 +914,8 @@ static void __inc_remap_and_issue_cell(void *context,
struct bio *bio;
while ((bio = bio_list_pop(&cell->bios))) {
- if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD)
+ if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
+ bio_op(bio) == REQ_OP_PROVISION)
bio_list_add(&info->defer_bios, bio);
else {
inc_all_io_entry(info->tc->pool, bio);
@@ -1245,8 +1247,8 @@ static int io_overlaps_block(struct pool *pool, struct bio *bio)
static int io_overwrites_block(struct pool *pool, struct bio *bio)
{
- return (bio_data_dir(bio) == WRITE) &&
- io_overlaps_block(pool, bio);
+ return (bio_data_dir(bio) == WRITE) && io_overlaps_block(pool, bio) &&
+ bio_op(bio) != REQ_OP_PROVISION;
}
static void save_and_set_endio(struct bio *bio, bio_end_io_t **save,
@@ -1394,6 +1396,9 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
m->data_block = data_block;
m->cell = cell;
+ if (bio && bio_op(bio) == REQ_OP_PROVISION)
+ m->bio = bio;
+
/*
* If the whole block of data is being overwritten or we are not
* zeroing pre-existing data, we can issue the bio immediately.
@@ -1953,6 +1958,51 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
}
}
+static void process_provision_bio(struct thin_c *tc, struct bio *bio)
+{
+ int r;
+ struct pool *pool = tc->pool;
+ dm_block_t block = get_bio_block(tc, bio);
+ struct dm_bio_prison_cell *cell;
+ struct dm_cell_key key;
+ struct dm_thin_lookup_result lookup_result;
+
+ /*
+ * If cell is already occupied, then the block is already
+ * being provisioned so we have nothing further to do here.
+ */
+ build_virtual_key(tc->td, block, &key);
+ if (bio_detain(pool, &key, bio, &cell))
+ return;
+
+ if (tc->requeue_mode) {
+ cell_requeue(pool, cell);
+ return;
+ }
+
+ r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
+ switch (r) {
+ case 0:
+ if (lookup_result.shared) {
+ process_shared_bio(tc, bio, block, &lookup_result, cell);
+ } else {
+ bio_endio(bio);
+ cell_defer_no_holder(tc, cell);
+ }
+ break;
+ case -ENODATA:
+ provision_block(tc, bio, block, cell);
+ break;
+
+ default:
+ DMERR_LIMIT("%s: dm_thin_find_block() failed: error = %d",
+ __func__, r);
+ cell_defer_no_holder(tc, cell);
+ bio_io_error(bio);
+ break;
+ }
+}
+
static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
{
int r;
@@ -2228,6 +2278,8 @@ static void process_thin_deferred_bios(struct thin_c *tc)
if (bio_op(bio) == REQ_OP_DISCARD)
pool->process_discard(tc, bio);
+ else if (bio_op(bio) == REQ_OP_PROVISION)
+ pool->process_provision(tc, bio);
else
pool->process_bio(tc, bio);
@@ -2579,6 +2631,7 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
dm_pool_metadata_read_only(pool->pmd);
pool->process_bio = process_bio_fail;
pool->process_discard = process_bio_fail;
+ pool->process_provision = process_bio_fail;
pool->process_cell = process_cell_fail;
pool->process_discard_cell = process_cell_fail;
pool->process_prepared_mapping = process_prepared_mapping_fail;
@@ -2592,6 +2645,7 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
dm_pool_metadata_read_only(pool->pmd);
pool->process_bio = process_bio_read_only;
pool->process_discard = process_bio_success;
+ pool->process_provision = process_bio_fail;
pool->process_cell = process_cell_read_only;
pool->process_discard_cell = process_cell_success;
pool->process_prepared_mapping = process_prepared_mapping_fail;
@@ -2612,6 +2666,7 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
pool->out_of_data_space = true;
pool->process_bio = process_bio_read_only;
pool->process_discard = process_discard_bio;
+ pool->process_provision = process_bio_fail;
pool->process_cell = process_cell_read_only;
pool->process_prepared_mapping = process_prepared_mapping;
set_discard_callbacks(pool);
@@ -2628,6 +2683,7 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
dm_pool_metadata_read_write(pool->pmd);
pool->process_bio = process_bio;
pool->process_discard = process_discard_bio;
+ pool->process_provision = process_provision_bio;
pool->process_cell = process_cell;
pool->process_prepared_mapping = process_prepared_mapping;
set_discard_callbacks(pool);
@@ -2749,7 +2805,8 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_SUBMITTED;
}
- if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
+ if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD ||
+ bio_op(bio) == REQ_OP_PROVISION) {
thin_defer_bio_with_throttle(tc, bio);
return DM_MAPIO_SUBMITTED;
}
@@ -3396,6 +3453,9 @@ static int pool_ctr(struct dm_target *ti, unsigned int argc, char **argv)
pt->adjusted_pf = pt->requested_pf = pf;
ti->num_flush_bios = 1;
ti->limit_swap_bios = true;
+ ti->num_provision_bios = 1;
+ ti->provision_supported = true;
+ ti->max_provision_granularity = true;
/*
* Only need to enable discards if the pool should pass
@@ -4094,6 +4154,8 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
}
+ limits->max_provision_sectors = pool->sectors_per_block;
+
/*
* pt->adjusted_pf is a staging area for the actual features to use.
* They get transferred to the live pool in bind_control_target()
@@ -4288,6 +4350,10 @@ static int thin_ctr(struct dm_target *ti, unsigned int argc, char **argv)
ti->max_discard_granularity = true;
}
+ ti->num_provision_bios = 1;
+ ti->provision_supported = true;
+ ti->max_provision_granularity = true;
+
mutex_unlock(&dm_thin_pool_table.mutex);
spin_lock_irq(&tc->pool->lock);
--
2.40.1.698.g37aff9b760-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support
2023-05-18 22:33 ` [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support Sarthak Kukreti
@ 2023-05-19 15:23 ` Mike Snitzer
2023-06-08 21:24 ` Mike Snitzer
0 siblings, 1 reply; 52+ messages in thread
From: Mike Snitzer @ 2023-05-19 15:23 UTC (permalink / raw)
To: Sarthak Kukreti, Joe Thornber
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
Darrick J. Wong, Jason Wang, Bart Van Assche, Christoph Hellwig,
Andreas Dilger, Stefan Hajnoczi, Brian Foster, Alasdair Kergon
On Thu, May 18 2023 at 6:33P -0400,
Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
> dm-thinpool uses the provision request to provision
> blocks for a dm-thin device. dm-thinpool currently does not
> pass through REQ_OP_PROVISION to underlying devices.
>
> For shared blocks, provision requests will break sharing and copy the
> contents of the entire block. Additionally, if 'skip_block_zeroing'
> is not set, dm-thin will opt to zero out the entire range as a part
> of provisioning.
>
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
> drivers/md/dm-thin.c | 74 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 70 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 2b13c949bd72..f1b68b558cf0 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1245,8 +1247,8 @@ static int io_overlaps_block(struct pool *pool, struct bio *bio)
>
> static int io_overwrites_block(struct pool *pool, struct bio *bio)
> {
> - return (bio_data_dir(bio) == WRITE) &&
> - io_overlaps_block(pool, bio);
> + return (bio_data_dir(bio) == WRITE) && io_overlaps_block(pool, bio) &&
> + bio_op(bio) != REQ_OP_PROVISION;
> }
>
> static void save_and_set_endio(struct bio *bio, bio_end_io_t **save,
> @@ -1394,6 +1396,9 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
> m->data_block = data_block;
> m->cell = cell;
>
> + if (bio && bio_op(bio) == REQ_OP_PROVISION)
> + m->bio = bio;
> +
> /*
> * If the whole block of data is being overwritten or we are not
> * zeroing pre-existing data, we can issue the bio immediately.
This doesn't seem like the best way to address avoiding passdown of
provision bios (relying on process_prepared_mapping's implementation
that happens to do the right thing if m->bio set). Doing so cascades
into relying on complete_overwrite_bio() happening to _not_ actually
being specific to "overwrite" bios.
I don't have a better suggestion yet but will look closer. Just think
this needs to be formalized a bit more rather than it happening to
"just work".
Cc'ing Joe to see what he thinks too. This is something we can clean
up with a follow-on patch though, so not a show-stopper for this
series.
Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support
2023-05-19 15:23 ` Mike Snitzer
@ 2023-06-08 21:24 ` Mike Snitzer
2023-06-09 0:28 ` Mike Snitzer
0 siblings, 1 reply; 52+ messages in thread
From: Mike Snitzer @ 2023-06-08 21:24 UTC (permalink / raw)
To: Sarthak Kukreti, Joe Thornber, Brian Foster
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
Darrick J. Wong, Jason Wang, Bart Van Assche, Christoph Hellwig,
Andreas Dilger, Stefan Hajnoczi, Alasdair Kergon
On Fri, May 19 2023 at 11:23P -0400,
Mike Snitzer <snitzer@kernel.org> wrote:
> On Thu, May 18 2023 at 6:33P -0400,
> Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
>
> > dm-thinpool uses the provision request to provision
> > blocks for a dm-thin device. dm-thinpool currently does not
> > pass through REQ_OP_PROVISION to underlying devices.
> >
> > For shared blocks, provision requests will break sharing and copy the
> > contents of the entire block. Additionally, if 'skip_block_zeroing'
> > is not set, dm-thin will opt to zero out the entire range as a part
> > of provisioning.
> >
> > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > ---
> > drivers/md/dm-thin.c | 74 +++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 70 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index 2b13c949bd72..f1b68b558cf0 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
> > @@ -1245,8 +1247,8 @@ static int io_overlaps_block(struct pool *pool, struct bio *bio)
> >
> > static int io_overwrites_block(struct pool *pool, struct bio *bio)
> > {
> > - return (bio_data_dir(bio) == WRITE) &&
> > - io_overlaps_block(pool, bio);
> > + return (bio_data_dir(bio) == WRITE) && io_overlaps_block(pool, bio) &&
> > + bio_op(bio) != REQ_OP_PROVISION;
> > }
> >
> > static void save_and_set_endio(struct bio *bio, bio_end_io_t **save,
> > @@ -1394,6 +1396,9 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
> > m->data_block = data_block;
> > m->cell = cell;
> >
> > + if (bio && bio_op(bio) == REQ_OP_PROVISION)
> > + m->bio = bio;
> > +
> > /*
> > * If the whole block of data is being overwritten or we are not
> > * zeroing pre-existing data, we can issue the bio immediately.
>
> This doesn't seem like the best way to address avoiding passdown of
> provision bios (relying on process_prepared_mapping's implementation
> that happens to do the right thing if m->bio set). Doing so cascades
> into relying on complete_overwrite_bio() happening to _not_ actually
> being specific to "overwrite" bios.
>
> I don't have a better suggestion yet but will look closer. Just think
> this needs to be formalized a bit more rather than it happening to
> "just work".
>
> Cc'ing Joe to see what he thinks too. This is something we can clean
> up with a follow-on patch though, so not a show-stopper for this
> series.
I haven't circled back to look close enough at this but
REQ_OP_PROVISION bios _are_ being passed down to the thin-pool's
underlying data device.
Brian Foster reported that if he issues a REQ_OP_PROVISION to a thin
device after a snapshot (to break sharing), it'll fail with
-EOPNOTSUPP (response is from bio being passed down to device that
doesn't support it). I was able to reproduce with:
# fallocate --offset 0 --length 1048576 /dev/test/thin
# lvcreate -n snap --snapshot test/thin
# fallocate --offset 0 --length 1048576 /dev/test/thin
fallocate: fallocate failed: Operation not supported
But reports success when retried:
# fallocate --offset 0 --length 1048576 /dev/test/thin
# echo $?
0
It's somewhat moot in that Joe will be reimplementing handling for
REQ_OP_PROVISION _but_ in the meantime it'd be nice to have a version
of this patch that doesn't error (due to passdown of REQ_OP_PROVISION)
when breaking sharing. Primarily so the XFS guys (Dave and Brian) can
make progress.
I'll take a closer look tomorrow but figured I'd let you know.
Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support
2023-06-08 21:24 ` Mike Snitzer
@ 2023-06-09 0:28 ` Mike Snitzer
0 siblings, 0 replies; 52+ messages in thread
From: Mike Snitzer @ 2023-06-09 0:28 UTC (permalink / raw)
To: Sarthak Kukreti, Joe Thornber, Brian Foster
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Theodore Ts'o, Michael S. Tsirkin,
Darrick J. Wong, Jason Wang, Bart Van Assche, Christoph Hellwig,
Andreas Dilger, Stefan Hajnoczi, Alasdair Kergon
On Thu, Jun 08 2023 at 5:24P -0400,
Mike Snitzer <snitzer@kernel.org> wrote:
> On Fri, May 19 2023 at 11:23P -0400,
> Mike Snitzer <snitzer@kernel.org> wrote:
>
> > On Thu, May 18 2023 at 6:33P -0400,
> > Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
> >
> > > dm-thinpool uses the provision request to provision
> > > blocks for a dm-thin device. dm-thinpool currently does not
> > > pass through REQ_OP_PROVISION to underlying devices.
> > >
> > > For shared blocks, provision requests will break sharing and copy the
> > > contents of the entire block. Additionally, if 'skip_block_zeroing'
> > > is not set, dm-thin will opt to zero out the entire range as a part
> > > of provisioning.
> > >
> > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > > ---
> > > drivers/md/dm-thin.c | 74 +++++++++++++++++++++++++++++++++++++++++---
> > > 1 file changed, 70 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > index 2b13c949bd72..f1b68b558cf0 100644
> > > --- a/drivers/md/dm-thin.c
> > > +++ b/drivers/md/dm-thin.c
> > > @@ -1245,8 +1247,8 @@ static int io_overlaps_block(struct pool *pool, struct bio *bio)
> > >
> > > static int io_overwrites_block(struct pool *pool, struct bio *bio)
> > > {
> > > - return (bio_data_dir(bio) == WRITE) &&
> > > - io_overlaps_block(pool, bio);
> > > + return (bio_data_dir(bio) == WRITE) && io_overlaps_block(pool, bio) &&
> > > + bio_op(bio) != REQ_OP_PROVISION;
> > > }
> > >
> > > static void save_and_set_endio(struct bio *bio, bio_end_io_t **save,
> > > @@ -1394,6 +1396,9 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
> > > m->data_block = data_block;
> > > m->cell = cell;
> > >
> > > + if (bio && bio_op(bio) == REQ_OP_PROVISION)
> > > + m->bio = bio;
> > > +
> > > /*
> > > * If the whole block of data is being overwritten or we are not
> > > * zeroing pre-existing data, we can issue the bio immediately.
> >
> > This doesn't seem like the best way to address avoiding passdown of
> > provision bios (relying on process_prepared_mapping's implementation
> > that happens to do the right thing if m->bio set). Doing so cascades
> > into relying on complete_overwrite_bio() happening to _not_ actually
> > being specific to "overwrite" bios.
> >
> > I don't have a better suggestion yet but will look closer. Just think
> > this needs to be formalized a bit more rather than it happening to
> > "just work".
> >
> > Cc'ing Joe to see what he thinks too. This is something we can clean
> > up with a follow-on patch though, so not a show-stopper for this
> > series.
>
> I haven't circled back to look close enough at this but
> REQ_OP_PROVISION bios _are_ being passed down to the thin-pool's
> underlying data device.
>
> Brian Foster reported that if he issues a REQ_OP_PROVISION to a thin
> device after a snapshot (to break sharing), it'll fail with
> -EOPNOTSUPP (response is from bio being passed down to device that
> doesn't support it). I was able to reproduce with:
>
> # fallocate --offset 0 --length 1048576 /dev/test/thin
> # lvcreate -n snap --snapshot test/thin
>
> # fallocate --offset 0 --length 1048576 /dev/test/thin
> fallocate: fallocate failed: Operation not supported
>
> But reports success when retried:
> # fallocate --offset 0 --length 1048576 /dev/test/thin
> # echo $?
> 0
>
> It's somewhat moot in that Joe will be reimplementing handling for
> REQ_OP_PROVISION _but_ in the meantime it'd be nice to have a version
> of this patch that doesn't error (due to passdown of REQ_OP_PROVISION)
> when breaking sharing. Primarily so the XFS guys (Dave and Brian) can
> make progress.
>
> I'll take a closer look tomorrow but figured I'd let you know.
This fixes the issue for me (causes process_prepared_mapping to end
the bio without REQ_OP_PROVISION passdown).
But like I said above back on May 19: needs cleanup to make it less of
a hack for the REQ_OP_PROVISION case. At a minimum
complete_overwrite_bio() would need renaming.
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 43a6702f9efe..434a3b37af2f 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1324,6 +1324,9 @@ static void schedule_copy(struct thin_c *tc, dm_block_t virt_block,
m->data_block = data_dest;
m->cell = cell;
+ if (bio_op(bio) == REQ_OP_PROVISION)
+ m->bio = bio;
+
/*
* quiesce action + copy action + an extra reference held for the
* duration of this function (we may need to inc later for a
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v7 5/5] loop: Add support for provision requests
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
` (3 preceding siblings ...)
2023-05-18 22:33 ` [PATCH v7 4/5] dm-thin: Add REQ_OP_PROVISION support Sarthak Kukreti
@ 2023-05-18 22:33 ` Sarthak Kukreti
2023-05-22 16:37 ` [dm-devel] " Darrick J. Wong
2023-05-19 4:09 ` [PATCH v7 0/5] Introduce provisioning primitives Christoph Hellwig
5 siblings, 1 reply; 52+ messages in thread
From: Sarthak Kukreti @ 2023-05-18 22:33 UTC (permalink / raw)
To: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel
Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong
Add support for provision requests to loopback devices.
Loop devices will configure provision support based on
whether the underlying block device/file can support
the provision request and upon receiving a provision bio,
will map it to the backing device/storage. For loop devices
over files, a REQ_OP_PROVISION request will translate to
an fallocate mode 0 call on the backing file.
Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
drivers/block/loop.c | 34 +++++++++++++++++++++++++++++++---
1 file changed, 31 insertions(+), 3 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index bc31bb7072a2..7fe1a6629754 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -311,16 +311,20 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
{
/*
* We use fallocate to manipulate the space mappings used by the image
- * a.k.a. discard/zerorange.
+ * a.k.a. discard/provision/zerorange.
*/
struct file *file = lo->lo_backing_file;
int ret;
- mode |= FALLOC_FL_KEEP_SIZE;
+ if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE) &&
+ !bdev_max_discard_sectors(lo->lo_device))
+ return -EOPNOTSUPP;
- if (!bdev_max_discard_sectors(lo->lo_device))
+ if (mode == 0 && !bdev_max_provision_sectors(lo->lo_device))
return -EOPNOTSUPP;
+ mode |= FALLOC_FL_KEEP_SIZE;
+
ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq));
if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
return -EIO;
@@ -488,6 +492,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
FALLOC_FL_PUNCH_HOLE);
case REQ_OP_DISCARD:
return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
+ case REQ_OP_PROVISION:
+ return lo_fallocate(lo, rq, pos, 0);
case REQ_OP_WRITE:
if (cmd->use_aio)
return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
@@ -754,6 +760,25 @@ static void loop_sysfs_exit(struct loop_device *lo)
&loop_attribute_group);
}
+static void loop_config_provision(struct loop_device *lo)
+{
+ struct file *file = lo->lo_backing_file;
+ struct inode *inode = file->f_mapping->host;
+
+ /*
+ * If the backing device is a block device, mirror its provisioning
+ * capability.
+ */
+ if (S_ISBLK(inode->i_mode)) {
+ blk_queue_max_provision_sectors(lo->lo_queue,
+ bdev_max_provision_sectors(I_BDEV(inode)));
+ } else if (file->f_op->fallocate) {
+ blk_queue_max_provision_sectors(lo->lo_queue, UINT_MAX >> 9);
+ } else {
+ blk_queue_max_provision_sectors(lo->lo_queue, 0);
+ }
+}
+
static void loop_config_discard(struct loop_device *lo)
{
struct file *file = lo->lo_backing_file;
@@ -1092,6 +1117,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
blk_queue_io_min(lo->lo_queue, bsize);
loop_config_discard(lo);
+ loop_config_provision(lo);
loop_update_rotational(lo);
loop_update_dio(lo);
loop_sysfs_init(lo);
@@ -1304,6 +1330,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
}
loop_config_discard(lo);
+ loop_config_provision(lo);
/* update dio if lo_offset or transfer is changed */
__loop_update_dio(lo, lo->use_dio);
@@ -1830,6 +1857,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
case REQ_OP_FLUSH:
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_PROVISION:
cmd->use_aio = false;
break;
default:
--
2.40.1.698.g37aff9b760-goog
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [dm-devel] [PATCH v7 5/5] loop: Add support for provision requests
2023-05-18 22:33 ` [PATCH v7 5/5] loop: Add support for provision requests Sarthak Kukreti
@ 2023-05-22 16:37 ` Darrick J. Wong
2023-05-22 22:09 ` Sarthak Kukreti
0 siblings, 1 reply; 52+ messages in thread
From: Darrick J. Wong @ 2023-05-22 16:37 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Theodore Ts'o, Michael S. Tsirkin, Jason Wang,
Bart Van Assche, Mike Snitzer, Christoph Hellwig, Andreas Dilger,
Stefan Hajnoczi, Brian Foster, Alasdair Kergon
On Thu, May 18, 2023 at 03:33:26PM -0700, Sarthak Kukreti wrote:
> Add support for provision requests to loopback devices.
> Loop devices will configure provision support based on
> whether the underlying block device/file can support
> the provision request and upon receiving a provision bio,
> will map it to the backing device/storage. For loop devices
> over files, a REQ_OP_PROVISION request will translate to
> an fallocate mode 0 call on the backing file.
>
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
> drivers/block/loop.c | 34 +++++++++++++++++++++++++++++++---
> 1 file changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index bc31bb7072a2..7fe1a6629754 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -311,16 +311,20 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
> {
> /*
> * We use fallocate to manipulate the space mappings used by the image
> - * a.k.a. discard/zerorange.
> + * a.k.a. discard/provision/zerorange.
> */
> struct file *file = lo->lo_backing_file;
> int ret;
>
> - mode |= FALLOC_FL_KEEP_SIZE;
> + if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE) &&
> + !bdev_max_discard_sectors(lo->lo_device))
> + return -EOPNOTSUPP;
>
> - if (!bdev_max_discard_sectors(lo->lo_device))
> + if (mode == 0 && !bdev_max_provision_sectors(lo->lo_device))
> return -EOPNOTSUPP;
>
> + mode |= FALLOC_FL_KEEP_SIZE;
> +
> ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq));
> if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
> return -EIO;
> @@ -488,6 +492,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> FALLOC_FL_PUNCH_HOLE);
> case REQ_OP_DISCARD:
> return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
> + case REQ_OP_PROVISION:
> + return lo_fallocate(lo, rq, pos, 0);
If someone calls fallocate(UNSHARE_RANGE) on a loop bdev, shouldn't
there be a way to pass that through to the fallocate call to the backing
file?
--D
> case REQ_OP_WRITE:
> if (cmd->use_aio)
> return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
> @@ -754,6 +760,25 @@ static void loop_sysfs_exit(struct loop_device *lo)
> &loop_attribute_group);
> }
>
> +static void loop_config_provision(struct loop_device *lo)
> +{
> + struct file *file = lo->lo_backing_file;
> + struct inode *inode = file->f_mapping->host;
> +
> + /*
> + * If the backing device is a block device, mirror its provisioning
> + * capability.
> + */
> + if (S_ISBLK(inode->i_mode)) {
> + blk_queue_max_provision_sectors(lo->lo_queue,
> + bdev_max_provision_sectors(I_BDEV(inode)));
> + } else if (file->f_op->fallocate) {
> + blk_queue_max_provision_sectors(lo->lo_queue, UINT_MAX >> 9);
> + } else {
> + blk_queue_max_provision_sectors(lo->lo_queue, 0);
> + }
> +}
> +
> static void loop_config_discard(struct loop_device *lo)
> {
> struct file *file = lo->lo_backing_file;
> @@ -1092,6 +1117,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
> blk_queue_io_min(lo->lo_queue, bsize);
>
> loop_config_discard(lo);
> + loop_config_provision(lo);
> loop_update_rotational(lo);
> loop_update_dio(lo);
> loop_sysfs_init(lo);
> @@ -1304,6 +1330,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
> }
>
> loop_config_discard(lo);
> + loop_config_provision(lo);
>
> /* update dio if lo_offset or transfer is changed */
> __loop_update_dio(lo, lo->use_dio);
> @@ -1830,6 +1857,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> case REQ_OP_FLUSH:
> case REQ_OP_DISCARD:
> case REQ_OP_WRITE_ZEROES:
> + case REQ_OP_PROVISION:
> cmd->use_aio = false;
> break;
> default:
> --
> 2.40.1.698.g37aff9b760-goog
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v7 5/5] loop: Add support for provision requests
2023-05-22 16:37 ` [dm-devel] " Darrick J. Wong
@ 2023-05-22 22:09 ` Sarthak Kukreti
2023-05-23 1:22 ` Darrick J. Wong
0 siblings, 1 reply; 52+ messages in thread
From: Sarthak Kukreti @ 2023-05-22 22:09 UTC (permalink / raw)
To: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel
Cc: Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong, Sarthak Kukreti
On Mon, May 22, 2023 at 9:37 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> If someone calls fallocate(UNSHARE_RANGE) on a loop bdev, shouldn't
> there be a way to pass that through to the fallocate call to the backing
> file?
>
> --D
>
Yeah, I think we could add a REQ_UNSHARE bit (similar to REQ_NOUNMAP) to pass down the intent to the backing file (and possibly beyond...).
I took a stab at implementing it as a follow up patch so that there's less review churn on the current series. If it looks good, I can add it to the end of the series (or incorporate this into the existing block and loop patches):
From: Sarthak Kukreti <sarthakkukreti@chromium.org>
Date: Mon, 22 May 2023 14:18:15 -0700
Subject: [PATCH] block: Pass unshare intent via REQ_OP_PROVISION
Allow REQ_OP_PROVISION to pass in an extra REQ_UNSHARE bit to
annotate unshare requests to underlying layers. Layers that support
FALLOC_FL_UNSHARE will be able to use this as an indicator of which
fallocate() mode to use.
Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
block/blk-lib.c | 6 +++++-
block/fops.c | 6 +++++-
drivers/block/loop.c | 35 +++++++++++++++++++++++++++++------
include/linux/blk_types.h | 3 +++
include/linux/blkdev.h | 3 ++-
5 files changed, 44 insertions(+), 9 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 3cff5fb654f5..bea6f5a700b3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -350,6 +350,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
* @sector: start sector
* @nr_sects: number of sectors to provision
* @gfp_mask: memory allocation flags (for bio_alloc)
+ * @flags: controls detailed behavior
*
* Description:
* Issues a provision request to the block device for the range of sectors.
@@ -357,7 +358,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
* underlying storage pool to allocate space for this block range.
*/
int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
- sector_t nr_sects, gfp_t gfp)
+ sector_t nr_sects, gfp_t gfp, unsigned flags)
{
sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
unsigned int max_sectors = bdev_max_provision_sectors(bdev);
@@ -380,6 +381,9 @@ int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
bio->bi_iter.bi_sector = sector;
bio->bi_iter.bi_size = req_sects << SECTOR_SHIFT;
+ if (flags & BLKDEV_UNSHARE_RANGE)
+ bio->bi_opf |= REQ_UNSHARE;
+
sector += req_sects;
nr_sects -= req_sects;
if (!nr_sects) {
diff --git a/block/fops.c b/block/fops.c
index be2e41f160bf..6848756f0557 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -659,7 +659,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
case FALLOC_FL_KEEP_SIZE:
case FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE:
error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
- len >> SECTOR_SHIFT, GFP_KERNEL);
+ len >> SECTOR_SHIFT, GFP_KERNEL,
+ (mode &
+ FALLOC_FL_UNSHARE_RANGE) ?
+ BLKDEV_UNSHARE_RANGE :
+ 0);
break;
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 7fe1a6629754..c844b145d666 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -306,6 +306,30 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
return 0;
}
+static bool validate_fallocate_mode(struct loop_device *lo, int mode)
+{
+ bool ret = true;
+
+ switch (mode) {
+ case FALLOC_FL_PUNCH_HOLE:
+ case FALLOC_FL_ZERO_RANGE:
+ if (!bdev_max_discard_sectors(lo->lo_device))
+ ret = false;
+ break;
+ case 0:
+ case FALLOC_FL_UNSHARE_RANGE:
+ if (!bdev_max_provision_sectors(lo->lo_device))
+ ret = false;
+ break;
+
+ default:
+ ret = false;
+ }
+
+ return ret;
+}
+
+
static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
int mode)
{
@@ -316,11 +340,7 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
struct file *file = lo->lo_backing_file;
int ret;
- if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE) &&
- !bdev_max_discard_sectors(lo->lo_device))
- return -EOPNOTSUPP;
-
- if (mode == 0 && !bdev_max_provision_sectors(lo->lo_device))
+ if (!validate_fallocate_mode(lo, mode))
return -EOPNOTSUPP;
mode |= FALLOC_FL_KEEP_SIZE;
@@ -493,7 +513,10 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
case REQ_OP_DISCARD:
return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
case REQ_OP_PROVISION:
- return lo_fallocate(lo, rq, pos, 0);
+ return lo_fallocate(lo, rq, pos,
+ (rq->cmd_flags & REQ_UNSHARE) ?
+ FALLOC_FL_UNSHARE_RANGE :
+ 0);
case REQ_OP_WRITE:
if (cmd->use_aio)
return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index b7bb0226fdee..1a536fd897cb 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -423,6 +423,8 @@ enum req_flag_bits {
*/
/* for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
+ /* for REQ_OP_PROVISION: */
+ __REQ_UNSHARE, /* unshare blocks */
__REQ_NR_BITS, /* stops here */
};
@@ -451,6 +453,7 @@ enum req_flag_bits {
#define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
#define REQ_NOUNMAP (__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
+#define REQ_UNSHARE (__force blk_opf_t)(1ULL << __REQ_UNSHARE)
#define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 462ce586d46f..60c09b0d3fc9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1049,10 +1049,11 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp);
extern int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
- sector_t nr_sects, gfp_t gfp_mask);
+ sector_t nr_sects, gfp_t gfp_mask, unsigned int flags);
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
+#define BLKDEV_UNSHARE_RANGE (1 << 2) /* unshare range on provision */
extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
--
2.39.2
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v7 5/5] loop: Add support for provision requests
2023-05-22 22:09 ` Sarthak Kukreti
@ 2023-05-23 1:22 ` Darrick J. Wong
2023-10-07 1:29 ` Sarthak Kukreti
0 siblings, 1 reply; 52+ messages in thread
From: Darrick J. Wong @ 2023-05-23 1:22 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche
On Mon, May 22, 2023 at 03:09:55PM -0700, Sarthak Kukreti wrote:
> On Mon, May 22, 2023 at 9:37 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > If someone calls fallocate(UNSHARE_RANGE) on a loop bdev, shouldn't
> > there be a way to pass that through to the fallocate call to the backing
> > file?
> >
> > --D
> >
>
> Yeah, I think we could add a REQ_UNSHARE bit (similar to REQ_NOUNMAP) to pass down the intent to the backing file (and possibly beyond...).
>
> I took a stab at implementing it as a follow up patch so that there's
> less review churn on the current series. If it looks good, I can add
> it to the end of the series (or incorporate this into the existing
> block and loop patches):
It looks like a reasonable addition to the end of the series, assuming
that filling holes in thinp devices is cheap but unsharing snapshot
blocks is not.
> From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> Date: Mon, 22 May 2023 14:18:15 -0700
> Subject: [PATCH] block: Pass unshare intent via REQ_OP_PROVISION
>
> Allow REQ_OP_PROVISION to pass in an extra REQ_UNSHARE bit to
> annotate unshare requests to underlying layers. Layers that support
> FALLOC_FL_UNSHARE will be able to use this as an indicator of which
> fallocate() mode to use.
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> ---
> block/blk-lib.c | 6 +++++-
> block/fops.c | 6 +++++-
> drivers/block/loop.c | 35 +++++++++++++++++++++++++++++------
> include/linux/blk_types.h | 3 +++
> include/linux/blkdev.h | 3 ++-
> 5 files changed, 44 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 3cff5fb654f5..bea6f5a700b3 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -350,6 +350,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
> * @sector: start sector
> * @nr_sects: number of sectors to provision
> * @gfp_mask: memory allocation flags (for bio_alloc)
> + * @flags: controls detailed behavior
> *
> * Description:
> * Issues a provision request to the block device for the range of sectors.
> @@ -357,7 +358,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
> * underlying storage pool to allocate space for this block range.
> */
> int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
> - sector_t nr_sects, gfp_t gfp)
> + sector_t nr_sects, gfp_t gfp, unsigned flags)
> {
> sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
> unsigned int max_sectors = bdev_max_provision_sectors(bdev);
> @@ -380,6 +381,9 @@ int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
> bio->bi_iter.bi_sector = sector;
> bio->bi_iter.bi_size = req_sects << SECTOR_SHIFT;
>
> + if (flags & BLKDEV_UNSHARE_RANGE)
This is a provisioning flag, shouldn't this be ...
BLKDEV_PROVISION_UNSHARE or something?
> + bio->bi_opf |= REQ_UNSHARE;
> +
> sector += req_sects;
> nr_sects -= req_sects;
> if (!nr_sects) {
> diff --git a/block/fops.c b/block/fops.c
> index be2e41f160bf..6848756f0557 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -659,7 +659,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> case FALLOC_FL_KEEP_SIZE:
> case FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE:
> error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> - len >> SECTOR_SHIFT, GFP_KERNEL);
> + len >> SECTOR_SHIFT, GFP_KERNEL,
> + (mode &
> + FALLOC_FL_UNSHARE_RANGE) ?
> + BLKDEV_UNSHARE_RANGE :
> + 0);
You might want to do something about the six level indent here;
Linus hates that.
--D
> break;
> case FALLOC_FL_ZERO_RANGE:
> case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 7fe1a6629754..c844b145d666 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -306,6 +306,30 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
> return 0;
> }
>
> +static bool validate_fallocate_mode(struct loop_device *lo, int mode)
> +{
> + bool ret = true;
> +
> + switch (mode) {
> + case FALLOC_FL_PUNCH_HOLE:
> + case FALLOC_FL_ZERO_RANGE:
> + if (!bdev_max_discard_sectors(lo->lo_device))
> + ret = false;
> + break;
> + case 0:
> + case FALLOC_FL_UNSHARE_RANGE:
> + if (!bdev_max_provision_sectors(lo->lo_device))
> + ret = false;
> + break;
> +
> + default:
> + ret = false;
> + }
> +
> + return ret;
> +}
> +
> +
> static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
> int mode)
> {
> @@ -316,11 +340,7 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
> struct file *file = lo->lo_backing_file;
> int ret;
>
> - if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE) &&
> - !bdev_max_discard_sectors(lo->lo_device))
> - return -EOPNOTSUPP;
> -
> - if (mode == 0 && !bdev_max_provision_sectors(lo->lo_device))
> + if (!validate_fallocate_mode(lo, mode))
> return -EOPNOTSUPP;
>
> mode |= FALLOC_FL_KEEP_SIZE;
> @@ -493,7 +513,10 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> case REQ_OP_DISCARD:
> return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
> case REQ_OP_PROVISION:
> - return lo_fallocate(lo, rq, pos, 0);
> + return lo_fallocate(lo, rq, pos,
> + (rq->cmd_flags & REQ_UNSHARE) ?
> + FALLOC_FL_UNSHARE_RANGE :
> + 0);
> case REQ_OP_WRITE:
> if (cmd->use_aio)
> return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index b7bb0226fdee..1a536fd897cb 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -423,6 +423,8 @@ enum req_flag_bits {
> */
> /* for REQ_OP_WRITE_ZEROES: */
> __REQ_NOUNMAP, /* do not free blocks when zeroing */
> + /* for REQ_OP_PROVISION: */
> + __REQ_UNSHARE, /* unshare blocks */
>
> __REQ_NR_BITS, /* stops here */
> };
> @@ -451,6 +453,7 @@ enum req_flag_bits {
> #define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
>
> #define REQ_NOUNMAP (__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
> +#define REQ_UNSHARE (__force blk_opf_t)(1ULL << __REQ_UNSHARE)
>
> #define REQ_FAILFAST_MASK \
> (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 462ce586d46f..60c09b0d3fc9 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1049,10 +1049,11 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
> sector_t nr_sects, gfp_t gfp);
>
> extern int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
> - sector_t nr_sects, gfp_t gfp_mask);
> + sector_t nr_sects, gfp_t gfp_mask, unsigned int flags);
>
> #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
> #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
> +#define BLKDEV_UNSHARE_RANGE (1 << 2) /* unshare range on provision */
>
> extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 5/5] loop: Add support for provision requests
2023-05-23 1:22 ` Darrick J. Wong
@ 2023-10-07 1:29 ` Sarthak Kukreti
0 siblings, 0 replies; 52+ messages in thread
From: Sarthak Kukreti @ 2023-10-07 1:29 UTC (permalink / raw)
To: Darrick J. Wong
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche
On Mon, May 22, 2023 at 6:22 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Mon, May 22, 2023 at 03:09:55PM -0700, Sarthak Kukreti wrote:
> > On Mon, May 22, 2023 at 9:37 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > If someone calls fallocate(UNSHARE_RANGE) on a loop bdev, shouldn't
> > > there be a way to pass that through to the fallocate call to the backing
> > > file?
> > >
> > > --D
> > >
> >
> > Yeah, I think we could add a REQ_UNSHARE bit (similar to REQ_NOUNMAP) to pass down the intent to the backing file (and possibly beyond...).
> >
> > I took a stab at implementing it as a follow up patch so that there's
> > less review churn on the current series. If it looks good, I can add
> > it to the end of the series (or incorporate this into the existing
> > block and loop patches):
>
> It looks like a reasonable addition to the end of the series, assuming
> that filling holes in thinp devices is cheap but unsharing snapshot
> blocks is not.
>
> > From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > Date: Mon, 22 May 2023 14:18:15 -0700
> > Subject: [PATCH] block: Pass unshare intent via REQ_OP_PROVISION
> >
> > Allow REQ_OP_PROVISION to pass in an extra REQ_UNSHARE bit to
> > annotate unshare requests to underlying layers. Layers that support
> > FALLOC_FL_UNSHARE will be able to use this as an indicator of which
> > fallocate() mode to use.
>
> > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > ---
> > block/blk-lib.c | 6 +++++-
> > block/fops.c | 6 +++++-
> > drivers/block/loop.c | 35 +++++++++++++++++++++++++++++------
> > include/linux/blk_types.h | 3 +++
> > include/linux/blkdev.h | 3 ++-
> > 5 files changed, 44 insertions(+), 9 deletions(-)
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index 3cff5fb654f5..bea6f5a700b3 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -350,6 +350,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
> > * @sector: start sector
> > * @nr_sects: number of sectors to provision
> > * @gfp_mask: memory allocation flags (for bio_alloc)
> > + * @flags: controls detailed behavior
> > *
> > * Description:
> > * Issues a provision request to the block device for the range of sectors.
> > @@ -357,7 +358,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
> > * underlying storage pool to allocate space for this block range.
> > */
> > int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
> > - sector_t nr_sects, gfp_t gfp)
> > + sector_t nr_sects, gfp_t gfp, unsigned flags)
> > {
> > sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
> > unsigned int max_sectors = bdev_max_provision_sectors(bdev);
> > @@ -380,6 +381,9 @@ int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
> > bio->bi_iter.bi_sector = sector;
> > bio->bi_iter.bi_size = req_sects << SECTOR_SHIFT;
> >
> > + if (flags & BLKDEV_UNSHARE_RANGE)
>
> This is a provisioning flag, shouldn't this be ...
> BLKDEV_PROVISION_UNSHARE or something?
>
Done in v8, thanks!
> > + bio->bi_opf |= REQ_UNSHARE;
> > +
> > sector += req_sects;
> > nr_sects -= req_sects;
> > if (!nr_sects) {
> > diff --git a/block/fops.c b/block/fops.c
> > index be2e41f160bf..6848756f0557 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -659,7 +659,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > case FALLOC_FL_KEEP_SIZE:
> > case FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE:
> > error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> > - len >> SECTOR_SHIFT, GFP_KERNEL);
> > + len >> SECTOR_SHIFT, GFP_KERNEL,
> > + (mode &
> > + FALLOC_FL_UNSHARE_RANGE) ?
> > + BLKDEV_UNSHARE_RANGE :
> > + 0);
>
> You might want to do something about the six level indent here;
> Linus hates that.
>
Thanks for pointing it out, I switched it up a bit in v8 but it still
looks a bit weird to me.
Best
Sarthak
> --D
>
> > break;
> > case FALLOC_FL_ZERO_RANGE:
> > case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 7fe1a6629754..c844b145d666 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -306,6 +306,30 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
> > return 0;
> > }
> >
> > +static bool validate_fallocate_mode(struct loop_device *lo, int mode)
> > +{
> > + bool ret = true;
> > +
> > + switch (mode) {
> > + case FALLOC_FL_PUNCH_HOLE:
> > + case FALLOC_FL_ZERO_RANGE:
> > + if (!bdev_max_discard_sectors(lo->lo_device))
> > + ret = false;
> > + break;
> > + case 0:
> > + case FALLOC_FL_UNSHARE_RANGE:
> > + if (!bdev_max_provision_sectors(lo->lo_device))
> > + ret = false;
> > + break;
> > +
> > + default:
> > + ret = false;
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +
> > static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
> > int mode)
> > {
> > @@ -316,11 +340,7 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
> > struct file *file = lo->lo_backing_file;
> > int ret;
> >
> > - if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE) &&
> > - !bdev_max_discard_sectors(lo->lo_device))
> > - return -EOPNOTSUPP;
> > -
> > - if (mode == 0 && !bdev_max_provision_sectors(lo->lo_device))
> > + if (!validate_fallocate_mode(lo, mode))
> > return -EOPNOTSUPP;
> >
> > mode |= FALLOC_FL_KEEP_SIZE;
> > @@ -493,7 +513,10 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > case REQ_OP_DISCARD:
> > return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
> > case REQ_OP_PROVISION:
> > - return lo_fallocate(lo, rq, pos, 0);
> > + return lo_fallocate(lo, rq, pos,
> > + (rq->cmd_flags & REQ_UNSHARE) ?
> > + FALLOC_FL_UNSHARE_RANGE :
> > + 0);
> > case REQ_OP_WRITE:
> > if (cmd->use_aio)
> > return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index b7bb0226fdee..1a536fd897cb 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -423,6 +423,8 @@ enum req_flag_bits {
> > */
> > /* for REQ_OP_WRITE_ZEROES: */
> > __REQ_NOUNMAP, /* do not free blocks when zeroing */
> > + /* for REQ_OP_PROVISION: */
> > + __REQ_UNSHARE, /* unshare blocks */
> >
> > __REQ_NR_BITS, /* stops here */
> > };
> > @@ -451,6 +453,7 @@ enum req_flag_bits {
> > #define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
> >
> > #define REQ_NOUNMAP (__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
> > +#define REQ_UNSHARE (__force blk_opf_t)(1ULL << __REQ_UNSHARE)
> >
> > #define REQ_FAILFAST_MASK \
> > (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 462ce586d46f..60c09b0d3fc9 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1049,10 +1049,11 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
> > sector_t nr_sects, gfp_t gfp);
> >
> > extern int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
> > - sector_t nr_sects, gfp_t gfp_mask);
> > + sector_t nr_sects, gfp_t gfp_mask, unsigned int flags);
> >
> > #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
> > #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
> > +#define BLKDEV_UNSHARE_RANGE (1 << 2) /* unshare range on provision */
> >
> > extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
> > sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
> > --
> > 2.39.2
> >
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-18 22:33 [PATCH v7 0/5] Introduce provisioning primitives Sarthak Kukreti
` (4 preceding siblings ...)
2023-05-18 22:33 ` [PATCH v7 5/5] loop: Add support for provision requests Sarthak Kukreti
@ 2023-05-19 4:09 ` Christoph Hellwig
2023-05-19 14:41 ` Mike Snitzer
5 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2023-05-19 4:09 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: dm-devel, linux-block, linux-ext4, linux-kernel, linux-fsdevel,
Jens Axboe, Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi,
Alasdair Kergon, Mike Snitzer, Christoph Hellwig, Brian Foster,
Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong
FYI, I really don't think this primitive is a good idea. In the
concept of non-overwritable storage (NAND, SMR drives) the entire
concept of a one-shoot 'provisioning' that will guarantee later writes
are always possible is simply bogus.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-19 4:09 ` [PATCH v7 0/5] Introduce provisioning primitives Christoph Hellwig
@ 2023-05-19 14:41 ` Mike Snitzer
2023-05-19 23:07 ` Dave Chinner
0 siblings, 1 reply; 52+ messages in thread
From: Mike Snitzer @ 2023-05-19 14:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sarthak Kukreti, dm-devel, linux-block, linux-ext4, linux-kernel,
linux-fsdevel, Jens Axboe, Michael S. Tsirkin, Jason Wang,
Stefan Hajnoczi, Alasdair Kergon, Brian Foster, Theodore Ts'o,
Andreas Dilger, Bart Van Assche, Darrick J. Wong
On Fri, May 19 2023 at 12:09P -0400,
Christoph Hellwig <hch@infradead.org> wrote:
> FYI, I really don't think this primitive is a good idea. In the
> concept of non-overwritable storage (NAND, SMR drives) the entire
> concept of a one-shoot 'provisioning' that will guarantee later writes
> are always possible is simply bogus.
Valid point for sure, such storage shouldn't advertise support (and
will return -EOPNOTSUPP).
But the primitive still has utility for other classes of storage.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-19 14:41 ` Mike Snitzer
@ 2023-05-19 23:07 ` Dave Chinner
2023-05-22 18:27 ` Mike Snitzer
0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2023-05-19 23:07 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Sarthak Kukreti, dm-devel, linux-block,
linux-ext4, linux-kernel, linux-fsdevel, Jens Axboe,
Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi, Alasdair Kergon,
Brian Foster, Theodore Ts'o, Andreas Dilger, Bart Van Assche,
Darrick J. Wong
On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote:
> On Fri, May 19 2023 at 12:09P -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
>
> > FYI, I really don't think this primitive is a good idea. In the
> > concept of non-overwritable storage (NAND, SMR drives) the entire
> > concept of a one-shoot 'provisioning' that will guarantee later writes
> > are always possible is simply bogus.
>
> Valid point for sure, such storage shouldn't advertise support (and
> will return -EOPNOTSUPP).
>
> But the primitive still has utility for other classes of storage.
Yet the thing people are wanting to us filesystem developers to use
this with is thinly provisioned storage that has snapshot
capability. That, by definition, is non-overwritable storage. These
are the use cases people are asking filesystes to gracefully handle
and report errors when the sparse backing store runs out of space.
e.g. journal writes after a snapshot is taken on a busy filesystem
are always an overwrite and this requires more space in the storage
device for the write to succeed. ENOSPC from the backing device for
journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't
guarantee space for overwrites after snapshots, then it's not
actually useful for solving the real world use cases we actually
need device-level provisioning to solve.
It is not viable for filesystems to have to reprovision space for
in-place metadata overwrites after every snapshot - the filesystem
may not even know a snapshot has been taken! And it's not feasible
for filesystems to provision on demand before they modify metadata
because we don't know what metadata is going to need to be modified
before we start modifying metadata in transactions. If we get ENOSPC
from provisioning in the middle of a dirty transcation, it's all
over just the same as if we get ENOSPC during metadata writeback...
Hence what filesystems actually need is device provisioned space to
be -always over-writable- without ENOSPC occurring. Ideally, if we
provision a range of the block device, the block device *must*
guarantee all future writes to that LBA range succeeds. That
guarantee needs to stand until we discard or unmap the LBA range,
and for however many writes we do to that LBA range.
e.g. If the device takes a snapshot, it needs to reprovision the
potential COW ranges that overlap with the provisioned LBA range at
snapshot time. e.g. by re-reserving the space from the backing pool
for the provisioned space so if a COW occurs there is space
guaranteed for it to succeed. If there isn't space in the backing
pool for the reprovisioning, then whatever operation that triggers
the COW behaviour should fail with ENOSPC before doing anything
else....
Software devices like dm-thin/snapshot should really only need to
keep a persistent map of the provisioned space and refresh space
reservations for used space within that map whenever something that
triggers COW behaviour occurs. i.e. a snapshot needs to reset the
provisioned ranges back to "all ranges are freshly provisioned"
before the snapshot is started. If that space is not available in
the backing pool, then the snapshot attempt gets ENOSPC....
That means filesystems only need to provision space for journals and
fixed metadata at mkfs time, and they only need issue a
REQ_PROVISION bio when they first allocate over-write in place
metadata. We already have online discard and/or fstrim for releasing
provisioned space via discards.
This will require some mods to filesystems like ext4 and XFS to
issue REQ_PROVISION and fail gracefully during metadata allocation.
However, doing so means that we can actually harden filesystems
against sparse block device ENOSPC errors by ensuring they will
never occur in critical filesystem structures....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-19 23:07 ` Dave Chinner
@ 2023-05-22 18:27 ` Mike Snitzer
2023-05-23 14:05 ` Brian Foster
0 siblings, 1 reply; 52+ messages in thread
From: Mike Snitzer @ 2023-05-22 18:27 UTC (permalink / raw)
To: Dave Chinner, Joe Thornber
Cc: Jens Axboe, linux-block, Theodore Ts'o, Stefan Hajnoczi,
Michael S. Tsirkin, Darrick J. Wong, Jason Wang, Bart Van Assche,
linux-kernel, Christoph Hellwig, dm-devel, Andreas Dilger,
Sarthak Kukreti, linux-fsdevel, linux-ext4, Brian Foster,
Alasdair Kergon
On Fri, May 19 2023 at 7:07P -0400,
Dave Chinner <david@fromorbit.com> wrote:
> On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote:
> > On Fri, May 19 2023 at 12:09P -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> >
> > > FYI, I really don't think this primitive is a good idea. In the
> > > concept of non-overwritable storage (NAND, SMR drives) the entire
> > > concept of a one-shoot 'provisioning' that will guarantee later writes
> > > are always possible is simply bogus.
> >
> > Valid point for sure, such storage shouldn't advertise support (and
> > will return -EOPNOTSUPP).
> >
> > But the primitive still has utility for other classes of storage.
>
> Yet the thing people are wanting to us filesystem developers to use
> this with is thinly provisioned storage that has snapshot
> capability. That, by definition, is non-overwritable storage. These
> are the use cases people are asking filesystes to gracefully handle
> and report errors when the sparse backing store runs out of space.
DM thinp falls into this category but as you detailed it can be made
to work reliably. To carry that forward we need to first establish
the REQ_PROVISION primitive (with this series).
Follow-on associated dm-thinp enhancements can then serve as reference
for how to take advantage of XFS's ability to operate reliably of
thinly provisioned storage.
> e.g. journal writes after a snapshot is taken on a busy filesystem
> are always an overwrite and this requires more space in the storage
> device for the write to succeed. ENOSPC from the backing device for
> journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't
> guarantee space for overwrites after snapshots, then it's not
> actually useful for solving the real world use cases we actually
> need device-level provisioning to solve.
>
> It is not viable for filesystems to have to reprovision space for
> in-place metadata overwrites after every snapshot - the filesystem
> may not even know a snapshot has been taken! And it's not feasible
> for filesystems to provision on demand before they modify metadata
> because we don't know what metadata is going to need to be modified
> before we start modifying metadata in transactions. If we get ENOSPC
> from provisioning in the middle of a dirty transcation, it's all
> over just the same as if we get ENOSPC during metadata writeback...
>
> Hence what filesystems actually need is device provisioned space to
> be -always over-writable- without ENOSPC occurring. Ideally, if we
> provision a range of the block device, the block device *must*
> guarantee all future writes to that LBA range succeeds. That
> guarantee needs to stand until we discard or unmap the LBA range,
> and for however many writes we do to that LBA range.
>
> e.g. If the device takes a snapshot, it needs to reprovision the
> potential COW ranges that overlap with the provisioned LBA range at
> snapshot time. e.g. by re-reserving the space from the backing pool
> for the provisioned space so if a COW occurs there is space
> guaranteed for it to succeed. If there isn't space in the backing
> pool for the reprovisioning, then whatever operation that triggers
> the COW behaviour should fail with ENOSPC before doing anything
> else....
Happy to implement this in dm-thinp. Each thin block will need a bit
to say if the block must be REQ_PROVISION'd at time of snapshot (and
the resulting block will need the same bit set).
Walking all blocks of a thin device and triggering REQ_PROVISION for
each will obviously make thin snapshot creation take more time.
I think this approach is better than having a dedicated bitmap hooked
off each thin device's metadata (with bitmap being copied and walked
at the time of snapshot). But we'll see... I'll get with Joe to
discuss further.
> Software devices like dm-thin/snapshot should really only need to
> keep a persistent map of the provisioned space and refresh space
> reservations for used space within that map whenever something that
> triggers COW behaviour occurs. i.e. a snapshot needs to reset the
> provisioned ranges back to "all ranges are freshly provisioned"
> before the snapshot is started. If that space is not available in
> the backing pool, then the snapshot attempt gets ENOSPC....
>
> That means filesystems only need to provision space for journals and
> fixed metadata at mkfs time, and they only need issue a
> REQ_PROVISION bio when they first allocate over-write in place
> metadata. We already have online discard and/or fstrim for releasing
> provisioned space via discards.
>
> This will require some mods to filesystems like ext4 and XFS to
> issue REQ_PROVISION and fail gracefully during metadata allocation.
> However, doing so means that we can actually harden filesystems
> against sparse block device ENOSPC errors by ensuring they will
> never occur in critical filesystem structures....
Yes, let's finally _do_ this! ;)
Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-22 18:27 ` Mike Snitzer
@ 2023-05-23 14:05 ` Brian Foster
2023-05-23 15:26 ` Mike Snitzer
0 siblings, 1 reply; 52+ messages in thread
From: Brian Foster @ 2023-05-23 14:05 UTC (permalink / raw)
To: Mike Snitzer
Cc: Dave Chinner, Joe Thornber, Jens Axboe, linux-block,
Theodore Ts'o, Stefan Hajnoczi, Michael S. Tsirkin,
Darrick J. Wong, Jason Wang, Bart Van Assche, linux-kernel,
Christoph Hellwig, dm-devel, Andreas Dilger, Sarthak Kukreti,
linux-fsdevel, linux-ext4, Alasdair Kergon
On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> On Fri, May 19 2023 at 7:07P -0400,
> Dave Chinner <david@fromorbit.com> wrote:
>
> > On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote:
> > > On Fri, May 19 2023 at 12:09P -0400,
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > > FYI, I really don't think this primitive is a good idea. In the
> > > > concept of non-overwritable storage (NAND, SMR drives) the entire
> > > > concept of a one-shoot 'provisioning' that will guarantee later writes
> > > > are always possible is simply bogus.
> > >
> > > Valid point for sure, such storage shouldn't advertise support (and
> > > will return -EOPNOTSUPP).
> > >
> > > But the primitive still has utility for other classes of storage.
> >
> > Yet the thing people are wanting to us filesystem developers to use
> > this with is thinly provisioned storage that has snapshot
> > capability. That, by definition, is non-overwritable storage. These
> > are the use cases people are asking filesystes to gracefully handle
> > and report errors when the sparse backing store runs out of space.
>
> DM thinp falls into this category but as you detailed it can be made
> to work reliably. To carry that forward we need to first establish
> the REQ_PROVISION primitive (with this series).
>
> Follow-on associated dm-thinp enhancements can then serve as reference
> for how to take advantage of XFS's ability to operate reliably of
> thinly provisioned storage.
>
> > e.g. journal writes after a snapshot is taken on a busy filesystem
> > are always an overwrite and this requires more space in the storage
> > device for the write to succeed. ENOSPC from the backing device for
> > journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't
> > guarantee space for overwrites after snapshots, then it's not
> > actually useful for solving the real world use cases we actually
> > need device-level provisioning to solve.
> >
> > It is not viable for filesystems to have to reprovision space for
> > in-place metadata overwrites after every snapshot - the filesystem
> > may not even know a snapshot has been taken! And it's not feasible
> > for filesystems to provision on demand before they modify metadata
> > because we don't know what metadata is going to need to be modified
> > before we start modifying metadata in transactions. If we get ENOSPC
> > from provisioning in the middle of a dirty transcation, it's all
> > over just the same as if we get ENOSPC during metadata writeback...
> >
> > Hence what filesystems actually need is device provisioned space to
> > be -always over-writable- without ENOSPC occurring. Ideally, if we
> > provision a range of the block device, the block device *must*
> > guarantee all future writes to that LBA range succeeds. That
> > guarantee needs to stand until we discard or unmap the LBA range,
> > and for however many writes we do to that LBA range.
> >
> > e.g. If the device takes a snapshot, it needs to reprovision the
> > potential COW ranges that overlap with the provisioned LBA range at
> > snapshot time. e.g. by re-reserving the space from the backing pool
> > for the provisioned space so if a COW occurs there is space
> > guaranteed for it to succeed. If there isn't space in the backing
> > pool for the reprovisioning, then whatever operation that triggers
> > the COW behaviour should fail with ENOSPC before doing anything
> > else....
>
> Happy to implement this in dm-thinp. Each thin block will need a bit
> to say if the block must be REQ_PROVISION'd at time of snapshot (and
> the resulting block will need the same bit set).
>
> Walking all blocks of a thin device and triggering REQ_PROVISION for
> each will obviously make thin snapshot creation take more time.
>
> I think this approach is better than having a dedicated bitmap hooked
> off each thin device's metadata (with bitmap being copied and walked
> at the time of snapshot). But we'll see... I'll get with Joe to
> discuss further.
>
Hi Mike,
If you recall our most recent discussions on this topic, I was thinking
about the prospect of reserving the entire volume at mount time as an
initial solution to this problem. When looking through some of the old
reservation bits we prototyped years ago, it occurred to me that we have
enough mechanism to actually prototype this.
So FYI, I have some hacky prototype code that essentially has the
filesystem at mount time tell dm it's using the volume and expects all
further writes to succeed. dm-thin acquires reservation for the entire
range of the volume for which writes would require block allocation
(i.e., holes and shared dm blocks) or otherwise warns that the fs cannot
be "safely" mounted.
The reservation pool associates with the thin volume (not the
filesystem), so if a snapshot is requested from dm, the snapshot request
locates the snapshot origin and if it's currently active, increases the
reservation pool to account for outstanding blocks that are about to
become shared, or otherwise fails the snapshot with -ENOSPC. (I suspect
discard needs similar treatment, but I hadn't got to that yet.). If the
fs is not active, there is nothing to protect and so the snapshot
proceeds as normal.
This seems to work on my simple, initial tests for protecting actively
mounted filesystems from dm-thin -ENOSPC. This definitely needs a sanity
check from dm-thin folks, however, because I don't know enough about the
broader subsystem to reason about whether it's sufficiently correct. I
just managed to beat the older prototype code into submission to get it
to do what I wanted on simple experiments.
Thoughts on something like this? I think the main advantage is that it
significantly reduces the requirements on the fs to track individual
allocations. It's basically an on/off switch from the fs perspective,
doesn't require any explicit provisioning whatsoever (though it can be
done to improve things in the future) and in fact could probably be tied
to thin volume activation to be made completely filesystem agnostic.
Another advantage is that it requires no on-disk changes, no breaking
COWs up front during snapshots, etc.
The disadvantages are that it's space inefficient wrt to thin pool free
space, but IIUC this is essentially what userspace management layers
(such as Stratis) are doing today, they just put restrictions up front
at volume configuration/creation time instead of at runtime. There also
needs to be some kind of interface between the fs and dm. I suppose we
could co-opt provision and discard primitives with a "reservation"
modifier flag to get around that in a simple way, but that sounds
potentially ugly. TBH, the more I think about this the more I think it
makes sense to reserve on volume activation (with some caveats to allow
a read-only mode, explicit bypass, etc.) and then let the
cross-subsystem interface be dictated by granularity improvements...
... since I also happen to think there is a potentially interesting
development path to make this sort of reserve pool configurable in terms
of size and active/inactive state, which would allow the fs to use an
emergency pool scheme for managing metadata provisioning and not have to
track and provision individual metadata buffers at all (dealing with
user data is much easier to provision explicitly). So the space
inefficiency thing is potentially just a tradeoff for simplicity, and
filesystems that want more granularity for better behavior could achieve
that with more work. Filesystems that don't would be free to rely on the
simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
protection with very minimal changes.
That's getting too far into the weeds on the future bits, though. This
is essentially 99% a dm-thin approach, so I'm mainly curious if there's
sufficient interest in this sort of "reserve mode" approach to try and
clean it up further and have dm guys look at it, or if you guys see any
obvious issues in what it does that makes it potentially problematic, or
if you would just prefer to go down the path described above...
Brian
> > Software devices like dm-thin/snapshot should really only need to
> > keep a persistent map of the provisioned space and refresh space
> > reservations for used space within that map whenever something that
> > triggers COW behaviour occurs. i.e. a snapshot needs to reset the
> > provisioned ranges back to "all ranges are freshly provisioned"
> > before the snapshot is started. If that space is not available in
> > the backing pool, then the snapshot attempt gets ENOSPC....
> >
> > That means filesystems only need to provision space for journals and
> > fixed metadata at mkfs time, and they only need issue a
> > REQ_PROVISION bio when they first allocate over-write in place
> > metadata. We already have online discard and/or fstrim for releasing
> > provisioned space via discards.
> >
> > This will require some mods to filesystems like ext4 and XFS to
> > issue REQ_PROVISION and fail gracefully during metadata allocation.
> > However, doing so means that we can actually harden filesystems
> > against sparse block device ENOSPC errors by ensuring they will
> > never occur in critical filesystem structures....
>
> Yes, let's finally _do_ this! ;)
>
> Mike
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-23 14:05 ` Brian Foster
@ 2023-05-23 15:26 ` Mike Snitzer
2023-05-24 0:40 ` Dave Chinner
0 siblings, 1 reply; 52+ messages in thread
From: Mike Snitzer @ 2023-05-23 15:26 UTC (permalink / raw)
To: Brian Foster
Cc: Jens Axboe, Christoph Hellwig, Theodore Ts'o, Sarthak Kukreti,
dm-devel, Michael S. Tsirkin, Darrick J. Wong, Jason Wang,
Bart Van Assche, Dave Chinner, linux-kernel, linux-block,
Joe Thornber, Andreas Dilger, Stefan Hajnoczi, linux-fsdevel,
linux-ext4, Alasdair Kergon
On Tue, May 23 2023 at 10:05P -0400,
Brian Foster <bfoster@redhat.com> wrote:
> On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > On Fri, May 19 2023 at 7:07P -0400,
> > Dave Chinner <david@fromorbit.com> wrote:
> >
...
> > > e.g. If the device takes a snapshot, it needs to reprovision the
> > > potential COW ranges that overlap with the provisioned LBA range at
> > > snapshot time. e.g. by re-reserving the space from the backing pool
> > > for the provisioned space so if a COW occurs there is space
> > > guaranteed for it to succeed. If there isn't space in the backing
> > > pool for the reprovisioning, then whatever operation that triggers
> > > the COW behaviour should fail with ENOSPC before doing anything
> > > else....
> >
> > Happy to implement this in dm-thinp. Each thin block will need a bit
> > to say if the block must be REQ_PROVISION'd at time of snapshot (and
> > the resulting block will need the same bit set).
> >
> > Walking all blocks of a thin device and triggering REQ_PROVISION for
> > each will obviously make thin snapshot creation take more time.
> >
> > I think this approach is better than having a dedicated bitmap hooked
> > off each thin device's metadata (with bitmap being copied and walked
> > at the time of snapshot). But we'll see... I'll get with Joe to
> > discuss further.
> >
>
> Hi Mike,
>
> If you recall our most recent discussions on this topic, I was thinking
> about the prospect of reserving the entire volume at mount time as an
> initial solution to this problem. When looking through some of the old
> reservation bits we prototyped years ago, it occurred to me that we have
> enough mechanism to actually prototype this.
>
> So FYI, I have some hacky prototype code that essentially has the
> filesystem at mount time tell dm it's using the volume and expects all
> further writes to succeed. dm-thin acquires reservation for the entire
> range of the volume for which writes would require block allocation
> (i.e., holes and shared dm blocks) or otherwise warns that the fs cannot
> be "safely" mounted.
>
> The reservation pool associates with the thin volume (not the
> filesystem), so if a snapshot is requested from dm, the snapshot request
> locates the snapshot origin and if it's currently active, increases the
> reservation pool to account for outstanding blocks that are about to
> become shared, or otherwise fails the snapshot with -ENOSPC. (I suspect
> discard needs similar treatment, but I hadn't got to that yet.). If the
> fs is not active, there is nothing to protect and so the snapshot
> proceeds as normal.
>
> This seems to work on my simple, initial tests for protecting actively
> mounted filesystems from dm-thin -ENOSPC. This definitely needs a sanity
> check from dm-thin folks, however, because I don't know enough about the
> broader subsystem to reason about whether it's sufficiently correct. I
> just managed to beat the older prototype code into submission to get it
> to do what I wanted on simple experiments.
Feel free to share what you have.
But my initial gut on the approach is: why even use thin provisioning
at all if you're just going to reserve the entire logical address
space of each thin device?
> Thoughts on something like this? I think the main advantage is that it
> significantly reduces the requirements on the fs to track individual
> allocations. It's basically an on/off switch from the fs perspective,
> doesn't require any explicit provisioning whatsoever (though it can be
> done to improve things in the future) and in fact could probably be tied
> to thin volume activation to be made completely filesystem agnostic.
> Another advantage is that it requires no on-disk changes, no breaking
> COWs up front during snapshots, etc.
I'm just really unclear on the details without seeing it.
You shared a roll-up of the code we did from years ago so I can kind
of imagine the nature of the changes. I'm concerned about snapshots,
and the implicit need to compound the reservation for each snapshot.
> The disadvantages are that it's space inefficient wrt to thin pool free
> space, but IIUC this is essentially what userspace management layers
> (such as Stratis) are doing today, they just put restrictions up front
> at volume configuration/creation time instead of at runtime. There also
> needs to be some kind of interface between the fs and dm. I suppose we
> could co-opt provision and discard primitives with a "reservation"
> modifier flag to get around that in a simple way, but that sounds
> potentially ugly. TBH, the more I think about this the more I think it
> makes sense to reserve on volume activation (with some caveats to allow
> a read-only mode, explicit bypass, etc.) and then let the
> cross-subsystem interface be dictated by granularity improvements...
It just feels imprecise to the point of being both excessive and
nebulous.
thin devices, and snapshots of them, can be active without associated
filesystem mounts being active. It just takes a single origin volume
to be mounted, with a snapshot active, to force thin blocks' sharing
to be broken.
> ... since I also happen to think there is a potentially interesting
> development path to make this sort of reserve pool configurable in terms
> of size and active/inactive state, which would allow the fs to use an
> emergency pool scheme for managing metadata provisioning and not have to
> track and provision individual metadata buffers at all (dealing with
> user data is much easier to provision explicitly). So the space
> inefficiency thing is potentially just a tradeoff for simplicity, and
> filesystems that want more granularity for better behavior could achieve
> that with more work. Filesystems that don't would be free to rely on the
> simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> protection with very minimal changes.
>
> That's getting too far into the weeds on the future bits, though. This
> is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> sufficient interest in this sort of "reserve mode" approach to try and
> clean it up further and have dm guys look at it, or if you guys see any
> obvious issues in what it does that makes it potentially problematic, or
> if you would just prefer to go down the path described above...
The model that Dave detailed, which builds on REQ_PROVISION and is
sticky (by provisioning same blocks for snapshot) seems more useful to
me because it is quite precise. That said, it doesn't account for
hard requirements that _all_ blocks will always succeed. I'm really
not sure we need to go to your extreme (even though stratis
has.. difference is they did so as a crude means to an end because the
existing filesystem code can easily get caught out by -ENOSPC at
exactly the wrong time).
Mike
> > > Software devices like dm-thin/snapshot should really only need to
> > > keep a persistent map of the provisioned space and refresh space
> > > reservations for used space within that map whenever something that
> > > triggers COW behaviour occurs. i.e. a snapshot needs to reset the
> > > provisioned ranges back to "all ranges are freshly provisioned"
> > > before the snapshot is started. If that space is not available in
> > > the backing pool, then the snapshot attempt gets ENOSPC....
> > >
> > > That means filesystems only need to provision space for journals and
> > > fixed metadata at mkfs time, and they only need issue a
> > > REQ_PROVISION bio when they first allocate over-write in place
> > > metadata. We already have online discard and/or fstrim for releasing
> > > provisioned space via discards.
> > >
> > > This will require some mods to filesystems like ext4 and XFS to
> > > issue REQ_PROVISION and fail gracefully during metadata allocation.
> > > However, doing so means that we can actually harden filesystems
> > > against sparse block device ENOSPC errors by ensuring they will
> > > never occur in critical filesystem structures....
> >
> > Yes, let's finally _do_ this! ;)
> >
> > Mike
> >
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-23 15:26 ` Mike Snitzer
@ 2023-05-24 0:40 ` Dave Chinner
2023-05-24 20:02 ` Mike Snitzer
2023-05-25 16:19 ` Brian Foster
0 siblings, 2 replies; 52+ messages in thread
From: Dave Chinner @ 2023-05-24 0:40 UTC (permalink / raw)
To: Mike Snitzer
Cc: Brian Foster, Jens Axboe, Christoph Hellwig, Theodore Ts'o,
Sarthak Kukreti, dm-devel, Michael S. Tsirkin, Darrick J. Wong,
Jason Wang, Bart Van Assche, linux-kernel, linux-block,
Joe Thornber, Andreas Dilger, Stefan Hajnoczi, linux-fsdevel,
linux-ext4, Alasdair Kergon
On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote:
> > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > ... since I also happen to think there is a potentially interesting
> > development path to make this sort of reserve pool configurable in terms
> > of size and active/inactive state, which would allow the fs to use an
> > emergency pool scheme for managing metadata provisioning and not have to
> > track and provision individual metadata buffers at all (dealing with
> > user data is much easier to provision explicitly). So the space
> > inefficiency thing is potentially just a tradeoff for simplicity, and
> > filesystems that want more granularity for better behavior could achieve
> > that with more work. Filesystems that don't would be free to rely on the
> > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > protection with very minimal changes.
> >
> > That's getting too far into the weeds on the future bits, though. This
> > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > sufficient interest in this sort of "reserve mode" approach to try and
> > clean it up further and have dm guys look at it, or if you guys see any
> > obvious issues in what it does that makes it potentially problematic, or
> > if you would just prefer to go down the path described above...
>
> The model that Dave detailed, which builds on REQ_PROVISION and is
> sticky (by provisioning same blocks for snapshot) seems more useful to
> me because it is quite precise. That said, it doesn't account for
> hard requirements that _all_ blocks will always succeed.
Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
but I don't think we'd ever need a hard guarantee from the block
device that every write bio issued from the filesystem will succeed
without ENOSPC.
If the block device can provide a guarantee that a provisioned LBA
range is always writable, then everything else is a filesystem level
optimisation problem and we don't have to involve the block device
in any way. All we need is a flag we can ready out of the bdev at
mount time to determine if the filesystem should be operating with
LBA provisioning enabled...
e.g. If we need to "pre-provision" a chunk of the LBA space for
filesystem metadata, we can do that ahead of time and track the
pre-provisioned range(s) in the filesystem itself.
In XFS, That could be as simple as having small chunks of each AG
reserved to metadata (e.g. start with the first 100MB) and limiting
all metadata allocation free space searches to that specific block
range. When we run low on that space, we pre-provision another 100MB
chunk and then allocate all metadata out of that new range. If we
start getting ENOSPC to pre-provisioning, then we reduce the size of
the regions and log low space warnings to userspace. If we can't
pre-provision any space at all and we've completely run out, we
simply declare ENOSPC for all incoming operations that require
metadata allocation until pre-provisioning succeeds again.
This is built entirely on the premise that once proactive backing
device provisioning fails, the backing device is at ENOSPC and we
have to wait for that situation to go away before allowing new data
to be ingested. Hence the block device really doesn't need to know
anything about what the filesystem is doing and vice versa - The
block dev just says "yes" or "no" and the filesystem handles
everything else.
It's worth noting that XFS already has a coarse-grained
implementation of preferred regions for metadata storage. It will
currently not use those metadata-preferred regions for user data
unless all the remaining user data space is full. Hence I'm pretty
sure that a pre-provisioning enhancment like this can be done
entirely in-memory without requiring any new on-disk state to be
added.
Sure, if we crash and remount, then we might chose a different LBA
region for pre-provisioning. But that's not really a huge deal as we
could also run an internal background post-mount fstrim operation to
remove any unused pre-provisioning that was left over from when the
system went down.
Further, managing shared pool exhaustion doesn't require a
reservation pool in the backing device and for the filesystems to
request space from it. Filesystems already have their own reserve
pools via pre-provisioning. If we want the filesystems to be able to
release that space back to the shared pool (e.g. because the shared
backing pool is critically short on space) then all we need is an
extension to FITRIM to tell the filesystem to also release internal
pre-provisioned reserves.
Then the backing pool admin (person or automated daemon!) can simply
issue a trim on all the filesystems in the pool and spce will be
returned. Then filesystems will ask for new pre-provisioned space
when they next need to ingest modifications, and the backing pool
can manage the new pre-provisioning space requests directly....
Hence I think if we get the basic REQ_PROVISION overwrite-in-place
guarantees defined and implemented as previously outlined, then we
don't need any special coordination between the fs and block devices
to avoid fatal ENOSPC issues with sparse and/or snapshot capable
block devices...
As a bonus, if we can implement the guarantees in dm-thin/-snapshot
and have a filesystem make use of it, then we also have a reference
implementation to point at device vendors and standards
associations....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-24 0:40 ` Dave Chinner
@ 2023-05-24 20:02 ` Mike Snitzer
2023-05-25 11:39 ` Dave Chinner
2023-05-25 16:19 ` Brian Foster
1 sibling, 1 reply; 52+ messages in thread
From: Mike Snitzer @ 2023-05-24 20:02 UTC (permalink / raw)
To: Dave Chinner
Cc: Jens Axboe, linux-block, Theodore Ts'o, Stefan Hajnoczi,
Michael S. Tsirkin, Darrick J. Wong, Brian Foster,
Bart Van Assche, linux-kernel, Joe Thornber, Christoph Hellwig,
dm-devel, Andreas Dilger, Sarthak Kukreti, linux-fsdevel,
linux-ext4, Jason Wang, Alasdair Kergon
On Tue, May 23 2023 at 8:40P -0400,
Dave Chinner <david@fromorbit.com> wrote:
> On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote:
> > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > ... since I also happen to think there is a potentially interesting
> > > development path to make this sort of reserve pool configurable in terms
> > > of size and active/inactive state, which would allow the fs to use an
> > > emergency pool scheme for managing metadata provisioning and not have to
> > > track and provision individual metadata buffers at all (dealing with
> > > user data is much easier to provision explicitly). So the space
> > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > filesystems that want more granularity for better behavior could achieve
> > > that with more work. Filesystems that don't would be free to rely on the
> > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > protection with very minimal changes.
> > >
> > > That's getting too far into the weeds on the future bits, though. This
> > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > sufficient interest in this sort of "reserve mode" approach to try and
> > > clean it up further and have dm guys look at it, or if you guys see any
> > > obvious issues in what it does that makes it potentially problematic, or
> > > if you would just prefer to go down the path described above...
> >
> > The model that Dave detailed, which builds on REQ_PROVISION and is
> > sticky (by provisioning same blocks for snapshot) seems more useful to
> > me because it is quite precise. That said, it doesn't account for
> > hard requirements that _all_ blocks will always succeed.
>
> Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> but I don't think we'd ever need a hard guarantee from the block
> device that every write bio issued from the filesystem will succeed
> without ENOSPC.
>
> If the block device can provide a guarantee that a provisioned LBA
> range is always writable, then everything else is a filesystem level
> optimisation problem and we don't have to involve the block device
> in any way. All we need is a flag we can ready out of the bdev at
> mount time to determine if the filesystem should be operating with
> LBA provisioning enabled...
>
> e.g. If we need to "pre-provision" a chunk of the LBA space for
> filesystem metadata, we can do that ahead of time and track the
> pre-provisioned range(s) in the filesystem itself.
>
> In XFS, That could be as simple as having small chunks of each AG
> reserved to metadata (e.g. start with the first 100MB) and limiting
> all metadata allocation free space searches to that specific block
> range. When we run low on that space, we pre-provision another 100MB
> chunk and then allocate all metadata out of that new range. If we
> start getting ENOSPC to pre-provisioning, then we reduce the size of
> the regions and log low space warnings to userspace. If we can't
> pre-provision any space at all and we've completely run out, we
> simply declare ENOSPC for all incoming operations that require
> metadata allocation until pre-provisioning succeeds again.
This is basically saying the same thing but:
It could be that the LBA space is fragmented and so falling back to
the smallest region size (that matches the thinp block size) would be
the last resort? Then if/when thinp cannot even service allocating a
new free thin block, dm-thinp will transition to out-of-data-space
mode.
> This is built entirely on the premise that once proactive backing
> device provisioning fails, the backing device is at ENOSPC and we
> have to wait for that situation to go away before allowing new data
> to be ingested. Hence the block device really doesn't need to know
> anything about what the filesystem is doing and vice versa - The
> block dev just says "yes" or "no" and the filesystem handles
> everything else.
Yes.
> It's worth noting that XFS already has a coarse-grained
> implementation of preferred regions for metadata storage. It will
> currently not use those metadata-preferred regions for user data
> unless all the remaining user data space is full. Hence I'm pretty
> sure that a pre-provisioning enhancment like this can be done
> entirely in-memory without requiring any new on-disk state to be
> added.
>
> Sure, if we crash and remount, then we might chose a different LBA
> region for pre-provisioning. But that's not really a huge deal as we
> could also run an internal background post-mount fstrim operation to
> remove any unused pre-provisioning that was left over from when the
> system went down.
This would be the FITRIM with extension you mention below? Which is a
filesystem interface detail? So dm-thinp would _not_ need to have new
state that tracks "provisioned but unused" block? Nor would the block
layer need an extra discard flag for a new class of "provisioned"
blocks.
If XFS tracked this "provisioned but unused" state, dm-thinp could
just discard the block like its told. Would be nice to avoid dm-thinp
needing to track "provisioned but unused".
That said, dm-thinp does still need to know if a block was provisioned
(given our previous designed discussion, to allow proper guarantees
from this interface at snapshot time) so that XFS and other
filesystems don't need to re-provision areas they already
pre-provisioned.
However, it may be that if thinp did track "provisioned but unused"
it'd be useful to allow snapshots to share provisioned blocks that
were never used. Meaning, we could then avoid "breaking sharing" at
snapshot-time for "provisioned but unused" blocks. But allowing this
"optimization" undercuts the gaurantee that XFS needs for thinp
storage that allows snapshots... SO, I think I answered my own
question: thinp doesnt need to track "provisioned but unused" blocks
but we must always ensure snapshots inherit provisoned blocks ;)
> Further, managing shared pool exhaustion doesn't require a
> reservation pool in the backing device and for the filesystems to
> request space from it. Filesystems already have their own reserve
> pools via pre-provisioning. If we want the filesystems to be able to
> release that space back to the shared pool (e.g. because the shared
> backing pool is critically short on space) then all we need is an
> extension to FITRIM to tell the filesystem to also release internal
> pre-provisioned reserves.
So by default FITRIM will _not_ discard provisioned blocks. Only if
a flag is used will it result in discarding provisioned blocks.
My dwelling on this is just double-checking that the
> Then the backing pool admin (person or automated daemon!) can simply
> issue a trim on all the filesystems in the pool and spce will be
> returned. Then filesystems will ask for new pre-provisioned space
> when they next need to ingest modifications, and the backing pool
> can manage the new pre-provisioning space requests directly....
>
> Hence I think if we get the basic REQ_PROVISION overwrite-in-place
> guarantees defined and implemented as previously outlined, then we
> don't need any special coordination between the fs and block devices
> to avoid fatal ENOSPC issues with sparse and/or snapshot capable
> block devices...
>
> As a bonus, if we can implement the guarantees in dm-thin/-snapshot
> and have a filesystem make use of it, then we also have a reference
> implementation to point at device vendors and standards
> associations....
Yeap.
Thanks,
Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-24 20:02 ` Mike Snitzer
@ 2023-05-25 11:39 ` Dave Chinner
2023-05-25 16:00 ` Mike Snitzer
0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2023-05-25 11:39 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jens Axboe, linux-block, Theodore Ts'o, Stefan Hajnoczi,
Michael S. Tsirkin, Darrick J. Wong, Brian Foster,
Bart Van Assche, linux-kernel, Joe Thornber, Christoph Hellwig,
dm-devel, Andreas Dilger, Sarthak Kukreti, linux-fsdevel,
linux-ext4, Jason Wang, Alasdair Kergon
On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> On Tue, May 23 2023 at 8:40P -0400,
> Dave Chinner <david@fromorbit.com> wrote:
>
> > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote:
> > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > ... since I also happen to think there is a potentially interesting
> > > > development path to make this sort of reserve pool configurable in terms
> > > > of size and active/inactive state, which would allow the fs to use an
> > > > emergency pool scheme for managing metadata provisioning and not have to
> > > > track and provision individual metadata buffers at all (dealing with
> > > > user data is much easier to provision explicitly). So the space
> > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > filesystems that want more granularity for better behavior could achieve
> > > > that with more work. Filesystems that don't would be free to rely on the
> > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > > protection with very minimal changes.
> > > >
> > > > That's getting too far into the weeds on the future bits, though. This
> > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > clean it up further and have dm guys look at it, or if you guys see any
> > > > obvious issues in what it does that makes it potentially problematic, or
> > > > if you would just prefer to go down the path described above...
> > >
> > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > me because it is quite precise. That said, it doesn't account for
> > > hard requirements that _all_ blocks will always succeed.
> >
> > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > but I don't think we'd ever need a hard guarantee from the block
> > device that every write bio issued from the filesystem will succeed
> > without ENOSPC.
> >
> > If the block device can provide a guarantee that a provisioned LBA
> > range is always writable, then everything else is a filesystem level
> > optimisation problem and we don't have to involve the block device
> > in any way. All we need is a flag we can ready out of the bdev at
> > mount time to determine if the filesystem should be operating with
> > LBA provisioning enabled...
> >
> > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > filesystem metadata, we can do that ahead of time and track the
> > pre-provisioned range(s) in the filesystem itself.
> >
> > In XFS, That could be as simple as having small chunks of each AG
> > reserved to metadata (e.g. start with the first 100MB) and limiting
> > all metadata allocation free space searches to that specific block
> > range. When we run low on that space, we pre-provision another 100MB
> > chunk and then allocate all metadata out of that new range. If we
> > start getting ENOSPC to pre-provisioning, then we reduce the size of
> > the regions and log low space warnings to userspace. If we can't
> > pre-provision any space at all and we've completely run out, we
> > simply declare ENOSPC for all incoming operations that require
> > metadata allocation until pre-provisioning succeeds again.
>
> This is basically saying the same thing but:
>
> It could be that the LBA space is fragmented and so falling back to
> the smallest region size (that matches the thinp block size) would be
> the last resort? Then if/when thinp cannot even service allocating a
> new free thin block, dm-thinp will transition to out-of-data-space
> mode.
Yes, something of that sort, though we'd probably give up if we
can't get at least megabyte scale reservations - a single
modification in XFS can modify many structures and require
allocation of a lot of new metadata, so the fileystem cut-off would
for metadata provisioning failure would be much larger than the
dm-thinp region size....
> > This is built entirely on the premise that once proactive backing
> > device provisioning fails, the backing device is at ENOSPC and we
> > have to wait for that situation to go away before allowing new data
> > to be ingested. Hence the block device really doesn't need to know
> > anything about what the filesystem is doing and vice versa - The
> > block dev just says "yes" or "no" and the filesystem handles
> > everything else.
>
> Yes.
>
> > It's worth noting that XFS already has a coarse-grained
> > implementation of preferred regions for metadata storage. It will
> > currently not use those metadata-preferred regions for user data
> > unless all the remaining user data space is full. Hence I'm pretty
> > sure that a pre-provisioning enhancment like this can be done
> > entirely in-memory without requiring any new on-disk state to be
> > added.
> >
> > Sure, if we crash and remount, then we might chose a different LBA
> > region for pre-provisioning. But that's not really a huge deal as we
> > could also run an internal background post-mount fstrim operation to
> > remove any unused pre-provisioning that was left over from when the
> > system went down.
>
> This would be the FITRIM with extension you mention below? Which is a
> filesystem interface detail?
No. We might reuse some of the internal infrastructure we use to
implement FITRIM, but that's about it. It's just something kinda
like FITRIM but with different constraints determined by the
filesystem rather than the user...
As it is, I'm not sure we'd even need it - a preiodic userspace
FITRIM would acheive the same result, so leaked provisioned spaces
would get cleaned up eventually without the filesystem having to do
anything specific...
> So dm-thinp would _not_ need to have new
> state that tracks "provisioned but unused" block?
No idea - that's your domain. :)
dm-snapshot, for certain, will need to track provisioned regions
because it has to guarantee that overwrites to provisioned space in
the origin device will always succeed. Hence it needs to know how
much space breaking sharing in provisioned regions after a snapshot
has been taken with be required...
> Nor would the block
> layer need an extra discard flag for a new class of "provisioned"
> blocks.
Right, I don't see that the discard operations need to care whether
the underlying storage is provisioned. dm-thinp and dm-snapshot can
treat REQ_OP_DISCARD as "this range is not longer in use" and do
whatever they want with them.
> If XFS tracked this "provisioned but unused" state, dm-thinp could
> just discard the block like its told. Would be nice to avoid dm-thinp
> needing to track "provisioned but unused".
>
> That said, dm-thinp does still need to know if a block was provisioned
> (given our previous designed discussion, to allow proper guarantees
> from this interface at snapshot time) so that XFS and other
> filesystems don't need to re-provision areas they already
> pre-provisioned.
Right.
I've simply assumed that dm-thinp would need to track entire
provisioned regions - used or unused - so it knows which writes to
empty or shared regions have a reservation to allow allocation to
succeed when the backing pool is otherwise empty.....
> However, it may be that if thinp did track "provisioned but unused"
> it'd be useful to allow snapshots to share provisioned blocks that
> were never used. Meaning, we could then avoid "breaking sharing" at
> snapshot-time for "provisioned but unused" blocks. But allowing this
> "optimization" undercuts the gaurantee that XFS needs for thinp
> storage that allows snapshots... SO, I think I answered my own
> question: thinp doesnt need to track "provisioned but unused" blocks
> but we must always ensure snapshots inherit provisoned blocks ;)
Sounds like a potential optimisation, but I haven't thought through
a potential snapshot device implementation that far to comment
sanely. I stopped once I got to the point where accounting tricks
count be used to guarantee space is available for breaking sharing
of used provisioned space after a snapshot was taken....
> > Further, managing shared pool exhaustion doesn't require a
> > reservation pool in the backing device and for the filesystems to
> > request space from it. Filesystems already have their own reserve
> > pools via pre-provisioning. If we want the filesystems to be able to
> > release that space back to the shared pool (e.g. because the shared
> > backing pool is critically short on space) then all we need is an
> > extension to FITRIM to tell the filesystem to also release internal
> > pre-provisioned reserves.
>
> So by default FITRIM will _not_ discard provisioned blocks. Only if
> a flag is used will it result in discarding provisioned blocks.
No. FITRIM results in discard of any unused free space in the
filesystem that matches the criteria set by the user. We don't care
if free space was once provisioned used space - we'll issue a
discard for the range regardless. The "special" FITRIM extension I
mentioned is to get filesystem metadata provisioning released;
that's completely separate to user data provisioning through
fallocate() which FITRIM will always discard if it has been freed...
IOWs, normal behaviour will be that a FITRIM ends up discarding a
mix of unprovisioned and provisioned space. Nobody will be able to
predict what mix the device is going to get at any point in time.
Also, if we turn on online discard, the block device is going to get
a constant stream of discard operations that will also be a mix of
provisioned and unprovisioned space that is not longer in use by the
filesystem.
I suspect that you need to stop trying to double guess what
operations the filesystem will use provisioning for, what it will
send discards for and when it will send discards for them.. Just
assume the device will receive a constant stream of both
REQ_PROVISION and REQ_OP_DISCARD (for both provisioned and
unprovisioned regions) operations whenver the filesystem is active
on a thinp device.....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-25 11:39 ` Dave Chinner
@ 2023-05-25 16:00 ` Mike Snitzer
2023-05-25 22:47 ` Sarthak Kukreti
0 siblings, 1 reply; 52+ messages in thread
From: Mike Snitzer @ 2023-05-25 16:00 UTC (permalink / raw)
To: Dave Chinner, Joe Thornber
Cc: Jens Axboe, linux-block, Theodore Ts'o, Stefan Hajnoczi,
Michael S. Tsirkin, Darrick J. Wong, Brian Foster,
Bart Van Assche, linux-kernel, Joe Thornber, Christoph Hellwig,
dm-devel, Andreas Dilger, Sarthak Kukreti, linux-fsdevel,
linux-ext4, Jason Wang, Alasdair Kergon
On Thu, May 25 2023 at 7:39P -0400,
Dave Chinner <david@fromorbit.com> wrote:
> On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > On Tue, May 23 2023 at 8:40P -0400,
> > Dave Chinner <david@fromorbit.com> wrote:
> >
> > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote:
> > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > > ... since I also happen to think there is a potentially interesting
> > > > > development path to make this sort of reserve pool configurable in terms
> > > > > of size and active/inactive state, which would allow the fs to use an
> > > > > emergency pool scheme for managing metadata provisioning and not have to
> > > > > track and provision individual metadata buffers at all (dealing with
> > > > > user data is much easier to provision explicitly). So the space
> > > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > > filesystems that want more granularity for better behavior could achieve
> > > > > that with more work. Filesystems that don't would be free to rely on the
> > > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > > > protection with very minimal changes.
> > > > >
> > > > > That's getting too far into the weeds on the future bits, though. This
> > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > > clean it up further and have dm guys look at it, or if you guys see any
> > > > > obvious issues in what it does that makes it potentially problematic, or
> > > > > if you would just prefer to go down the path described above...
> > > >
> > > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > > me because it is quite precise. That said, it doesn't account for
> > > > hard requirements that _all_ blocks will always succeed.
> > >
> > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > > but I don't think we'd ever need a hard guarantee from the block
> > > device that every write bio issued from the filesystem will succeed
> > > without ENOSPC.
> > >
> > > If the block device can provide a guarantee that a provisioned LBA
> > > range is always writable, then everything else is a filesystem level
> > > optimisation problem and we don't have to involve the block device
> > > in any way. All we need is a flag we can ready out of the bdev at
> > > mount time to determine if the filesystem should be operating with
> > > LBA provisioning enabled...
> > >
> > > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > > filesystem metadata, we can do that ahead of time and track the
> > > pre-provisioned range(s) in the filesystem itself.
> > >
> > > In XFS, That could be as simple as having small chunks of each AG
> > > reserved to metadata (e.g. start with the first 100MB) and limiting
> > > all metadata allocation free space searches to that specific block
> > > range. When we run low on that space, we pre-provision another 100MB
> > > chunk and then allocate all metadata out of that new range. If we
> > > start getting ENOSPC to pre-provisioning, then we reduce the size of
> > > the regions and log low space warnings to userspace. If we can't
> > > pre-provision any space at all and we've completely run out, we
> > > simply declare ENOSPC for all incoming operations that require
> > > metadata allocation until pre-provisioning succeeds again.
> >
> > This is basically saying the same thing but:
> >
> > It could be that the LBA space is fragmented and so falling back to
> > the smallest region size (that matches the thinp block size) would be
> > the last resort? Then if/when thinp cannot even service allocating a
> > new free thin block, dm-thinp will transition to out-of-data-space
> > mode.
>
> Yes, something of that sort, though we'd probably give up if we
> can't get at least megabyte scale reservations - a single
> modification in XFS can modify many structures and require
> allocation of a lot of new metadata, so the fileystem cut-off would
> for metadata provisioning failure would be much larger than the
> dm-thinp region size....
>
> > > This is built entirely on the premise that once proactive backing
> > > device provisioning fails, the backing device is at ENOSPC and we
> > > have to wait for that situation to go away before allowing new data
> > > to be ingested. Hence the block device really doesn't need to know
> > > anything about what the filesystem is doing and vice versa - The
> > > block dev just says "yes" or "no" and the filesystem handles
> > > everything else.
> >
> > Yes.
> >
> > > It's worth noting that XFS already has a coarse-grained
> > > implementation of preferred regions for metadata storage. It will
> > > currently not use those metadata-preferred regions for user data
> > > unless all the remaining user data space is full. Hence I'm pretty
> > > sure that a pre-provisioning enhancment like this can be done
> > > entirely in-memory without requiring any new on-disk state to be
> > > added.
> > >
> > > Sure, if we crash and remount, then we might chose a different LBA
> > > region for pre-provisioning. But that's not really a huge deal as we
> > > could also run an internal background post-mount fstrim operation to
> > > remove any unused pre-provisioning that was left over from when the
> > > system went down.
> >
> > This would be the FITRIM with extension you mention below? Which is a
> > filesystem interface detail?
>
> No. We might reuse some of the internal infrastructure we use to
> implement FITRIM, but that's about it. It's just something kinda
> like FITRIM but with different constraints determined by the
> filesystem rather than the user...
>
> As it is, I'm not sure we'd even need it - a preiodic userspace
> FITRIM would acheive the same result, so leaked provisioned spaces
> would get cleaned up eventually without the filesystem having to do
> anything specific...
>
> > So dm-thinp would _not_ need to have new
> > state that tracks "provisioned but unused" block?
>
> No idea - that's your domain. :)
>
> dm-snapshot, for certain, will need to track provisioned regions
> because it has to guarantee that overwrites to provisioned space in
> the origin device will always succeed. Hence it needs to know how
> much space breaking sharing in provisioned regions after a snapshot
> has been taken with be required...
dm-thinp offers its own much more scalable snapshot support (doesn't
use old dm-snapshot N-way copyout target).
dm-snapshot isn't going to be modified to support this level of
hardening (dm-snapshot is basically in "maintenance only" now).
But I understand your meaning: what you said is 100% applicable to
dm-thinp's snapshot implementation and needs to be accounted for in
thinp's metadata (inherent 'provisioned' flag).
> > Nor would the block
> > layer need an extra discard flag for a new class of "provisioned"
> > blocks.
>
> Right, I don't see that the discard operations need to care whether
> the underlying storage is provisioned. dm-thinp and dm-snapshot can
> treat REQ_OP_DISCARD as "this range is not longer in use" and do
> whatever they want with them.
>
> > If XFS tracked this "provisioned but unused" state, dm-thinp could
> > just discard the block like its told. Would be nice to avoid dm-thinp
> > needing to track "provisioned but unused".
> >
> > That said, dm-thinp does still need to know if a block was provisioned
> > (given our previous designed discussion, to allow proper guarantees
> > from this interface at snapshot time) so that XFS and other
> > filesystems don't need to re-provision areas they already
> > pre-provisioned.
>
> Right.
>
> I've simply assumed that dm-thinp would need to track entire
> provisioned regions - used or unused - so it knows which writes to
> empty or shared regions have a reservation to allow allocation to
> succeed when the backing pool is otherwise empty.....
>
> > However, it may be that if thinp did track "provisioned but unused"
> > it'd be useful to allow snapshots to share provisioned blocks that
> > were never used. Meaning, we could then avoid "breaking sharing" at
> > snapshot-time for "provisioned but unused" blocks. But allowing this
> > "optimization" undercuts the gaurantee that XFS needs for thinp
> > storage that allows snapshots... SO, I think I answered my own
> > question: thinp doesnt need to track "provisioned but unused" blocks
> > but we must always ensure snapshots inherit provisoned blocks ;)
>
> Sounds like a potential optimisation, but I haven't thought through
> a potential snapshot device implementation that far to comment
> sanely. I stopped once I got to the point where accounting tricks
> count be used to guarantee space is available for breaking sharing
> of used provisioned space after a snapshot was taken....
>
> > > Further, managing shared pool exhaustion doesn't require a
> > > reservation pool in the backing device and for the filesystems to
> > > request space from it. Filesystems already have their own reserve
> > > pools via pre-provisioning. If we want the filesystems to be able to
> > > release that space back to the shared pool (e.g. because the shared
> > > backing pool is critically short on space) then all we need is an
> > > extension to FITRIM to tell the filesystem to also release internal
> > > pre-provisioned reserves.
> >
> > So by default FITRIM will _not_ discard provisioned blocks. Only if
> > a flag is used will it result in discarding provisioned blocks.
>
> No. FITRIM results in discard of any unused free space in the
> filesystem that matches the criteria set by the user. We don't care
> if free space was once provisioned used space - we'll issue a
> discard for the range regardless. The "special" FITRIM extension I
> mentioned is to get filesystem metadata provisioning released;
> that's completely separate to user data provisioning through
> fallocate() which FITRIM will always discard if it has been freed...
>
> IOWs, normal behaviour will be that a FITRIM ends up discarding a
> mix of unprovisioned and provisioned space. Nobody will be able to
> predict what mix the device is going to get at any point in time.
> Also, if we turn on online discard, the block device is going to get
> a constant stream of discard operations that will also be a mix of
> provisioned and unprovisioned space that is not longer in use by the
> filesystem.
>
> I suspect that you need to stop trying to double guess what
> operations the filesystem will use provisioning for, what it will
> send discards for and when it will send discards for them.. Just
> assume the device will receive a constant stream of both
> REQ_PROVISION and REQ_OP_DISCARD (for both provisioned and
> unprovisioned regions) operations whenver the filesystem is active
> on a thinp device.....
Yeah, I was getting tripped up in the weeds a bit. It's pretty
straight-forward (and like I said at the start of our subthread here:
this follow-on work, to inherit provisioned flag, can build on this
REQ_PROVISION patchset).
All said, I've now gotten this sub-thread on Joe Thornber's radar and
we've started discussing. We'll be discussing with more focus
tomorrow.
Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-25 16:00 ` Mike Snitzer
@ 2023-05-25 22:47 ` Sarthak Kukreti
2023-05-26 1:36 ` Dave Chinner
0 siblings, 1 reply; 52+ messages in thread
From: Sarthak Kukreti @ 2023-05-25 22:47 UTC (permalink / raw)
To: Mike Snitzer
Cc: Dave Chinner, Joe Thornber, Jens Axboe, linux-block,
Theodore Ts'o, Stefan Hajnoczi, Michael S. Tsirkin,
Darrick J. Wong, Brian Foster, Bart Van Assche, linux-kernel,
Christoph Hellwig, dm-devel, Andreas Dilger, linux-fsdevel,
linux-ext4, Jason Wang, Alasdair Kergon
On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <snitzer@kernel.org> wrote:
>
> On Thu, May 25 2023 at 7:39P -0400,
> Dave Chinner <david@fromorbit.com> wrote:
>
> > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > On Tue, May 23 2023 at 8:40P -0400,
> > > Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote:
> > > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > > > ... since I also happen to think there is a potentially interesting
> > > > > > development path to make this sort of reserve pool configurable in terms
> > > > > > of size and active/inactive state, which would allow the fs to use an
> > > > > > emergency pool scheme for managing metadata provisioning and not have to
> > > > > > track and provision individual metadata buffers at all (dealing with
> > > > > > user data is much easier to provision explicitly). So the space
> > > > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > > > filesystems that want more granularity for better behavior could achieve
> > > > > > that with more work. Filesystems that don't would be free to rely on the
> > > > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > > > > protection with very minimal changes.
> > > > > >
> > > > > > That's getting too far into the weeds on the future bits, though. This
> > > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > > > clean it up further and have dm guys look at it, or if you guys see any
> > > > > > obvious issues in what it does that makes it potentially problematic, or
> > > > > > if you would just prefer to go down the path described above...
> > > > >
> > > > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > > > me because it is quite precise. That said, it doesn't account for
> > > > > hard requirements that _all_ blocks will always succeed.
> > > >
> > > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > > > but I don't think we'd ever need a hard guarantee from the block
> > > > device that every write bio issued from the filesystem will succeed
> > > > without ENOSPC.
> > > >
> > > > If the block device can provide a guarantee that a provisioned LBA
> > > > range is always writable, then everything else is a filesystem level
> > > > optimisation problem and we don't have to involve the block device
> > > > in any way. All we need is a flag we can ready out of the bdev at
> > > > mount time to determine if the filesystem should be operating with
> > > > LBA provisioning enabled...
> > > >
> > > > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > > > filesystem metadata, we can do that ahead of time and track the
> > > > pre-provisioned range(s) in the filesystem itself.
> > > >
> > > > In XFS, That could be as simple as having small chunks of each AG
> > > > reserved to metadata (e.g. start with the first 100MB) and limiting
> > > > all metadata allocation free space searches to that specific block
> > > > range. When we run low on that space, we pre-provision another 100MB
> > > > chunk and then allocate all metadata out of that new range. If we
> > > > start getting ENOSPC to pre-provisioning, then we reduce the size of
> > > > the regions and log low space warnings to userspace. If we can't
> > > > pre-provision any space at all and we've completely run out, we
> > > > simply declare ENOSPC for all incoming operations that require
> > > > metadata allocation until pre-provisioning succeeds again.
> > >
> > > This is basically saying the same thing but:
> > >
> > > It could be that the LBA space is fragmented and so falling back to
> > > the smallest region size (that matches the thinp block size) would be
> > > the last resort? Then if/when thinp cannot even service allocating a
> > > new free thin block, dm-thinp will transition to out-of-data-space
> > > mode.
> >
> > Yes, something of that sort, though we'd probably give up if we
> > can't get at least megabyte scale reservations - a single
> > modification in XFS can modify many structures and require
> > allocation of a lot of new metadata, so the fileystem cut-off would
> > for metadata provisioning failure would be much larger than the
> > dm-thinp region size....
> >
> > > > This is built entirely on the premise that once proactive backing
> > > > device provisioning fails, the backing device is at ENOSPC and we
> > > > have to wait for that situation to go away before allowing new data
> > > > to be ingested. Hence the block device really doesn't need to know
> > > > anything about what the filesystem is doing and vice versa - The
> > > > block dev just says "yes" or "no" and the filesystem handles
> > > > everything else.
> > >
> > > Yes.
> > >
> > > > It's worth noting that XFS already has a coarse-grained
> > > > implementation of preferred regions for metadata storage. It will
> > > > currently not use those metadata-preferred regions for user data
> > > > unless all the remaining user data space is full. Hence I'm pretty
> > > > sure that a pre-provisioning enhancment like this can be done
> > > > entirely in-memory without requiring any new on-disk state to be
> > > > added.
> > > >
> > > > Sure, if we crash and remount, then we might chose a different LBA
> > > > region for pre-provisioning. But that's not really a huge deal as we
> > > > could also run an internal background post-mount fstrim operation to
> > > > remove any unused pre-provisioning that was left over from when the
> > > > system went down.
> > >
> > > This would be the FITRIM with extension you mention below? Which is a
> > > filesystem interface detail?
> >
> > No. We might reuse some of the internal infrastructure we use to
> > implement FITRIM, but that's about it. It's just something kinda
> > like FITRIM but with different constraints determined by the
> > filesystem rather than the user...
> >
> > As it is, I'm not sure we'd even need it - a preiodic userspace
> > FITRIM would acheive the same result, so leaked provisioned spaces
> > would get cleaned up eventually without the filesystem having to do
> > anything specific...
> >
> > > So dm-thinp would _not_ need to have new
> > > state that tracks "provisioned but unused" block?
> >
> > No idea - that's your domain. :)
> >
> > dm-snapshot, for certain, will need to track provisioned regions
> > because it has to guarantee that overwrites to provisioned space in
> > the origin device will always succeed. Hence it needs to know how
> > much space breaking sharing in provisioned regions after a snapshot
> > has been taken with be required...
>
> dm-thinp offers its own much more scalable snapshot support (doesn't
> use old dm-snapshot N-way copyout target).
>
> dm-snapshot isn't going to be modified to support this level of
> hardening (dm-snapshot is basically in "maintenance only" now).
>
> But I understand your meaning: what you said is 100% applicable to
> dm-thinp's snapshot implementation and needs to be accounted for in
> thinp's metadata (inherent 'provisioned' flag).
>
A bit orthogonal: would dm-thinp need to differentiate between
user-triggered provision requests (eg. from fallocate()) vs
fs-triggered requests? I would lean towards user provisioned areas not
getting dedup'd on snapshot creation, but that would entail tracking
the state of the original request and possibly a provision request
flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag
(REQ_PROVISION_NODEDUP). Possibly too convoluted...
> > > Nor would the block
> > > layer need an extra discard flag for a new class of "provisioned"
> > > blocks.
> >
> > Right, I don't see that the discard operations need to care whether
> > the underlying storage is provisioned. dm-thinp and dm-snapshot can
> > treat REQ_OP_DISCARD as "this range is not longer in use" and do
> > whatever they want with them.
> >
> > > If XFS tracked this "provisioned but unused" state, dm-thinp could
> > > just discard the block like its told. Would be nice to avoid dm-thinp
> > > needing to track "provisioned but unused".
> > >
> > > That said, dm-thinp does still need to know if a block was provisioned
> > > (given our previous designed discussion, to allow proper guarantees
> > > from this interface at snapshot time) so that XFS and other
> > > filesystems don't need to re-provision areas they already
> > > pre-provisioned.
> >
> > Right.
> >
> > I've simply assumed that dm-thinp would need to track entire
> > provisioned regions - used or unused - so it knows which writes to
> > empty or shared regions have a reservation to allow allocation to
> > succeed when the backing pool is otherwise empty.....
> >
> > > However, it may be that if thinp did track "provisioned but unused"
> > > it'd be useful to allow snapshots to share provisioned blocks that
> > > were never used. Meaning, we could then avoid "breaking sharing" at
> > > snapshot-time for "provisioned but unused" blocks. But allowing this
> > > "optimization" undercuts the gaurantee that XFS needs for thinp
> > > storage that allows snapshots... SO, I think I answered my own
> > > question: thinp doesnt need to track "provisioned but unused" blocks
> > > but we must always ensure snapshots inherit provisoned blocks ;)
> >
> > Sounds like a potential optimisation, but I haven't thought through
> > a potential snapshot device implementation that far to comment
> > sanely. I stopped once I got to the point where accounting tricks
> > count be used to guarantee space is available for breaking sharing
> > of used provisioned space after a snapshot was taken....
> >
> > > > Further, managing shared pool exhaustion doesn't require a
> > > > reservation pool in the backing device and for the filesystems to
> > > > request space from it. Filesystems already have their own reserve
> > > > pools via pre-provisioning. If we want the filesystems to be able to
> > > > release that space back to the shared pool (e.g. because the shared
> > > > backing pool is critically short on space) then all we need is an
> > > > extension to FITRIM to tell the filesystem to also release internal
> > > > pre-provisioned reserves.
> > >
> > > So by default FITRIM will _not_ discard provisioned blocks. Only if
> > > a flag is used will it result in discarding provisioned blocks.
> >
> > No. FITRIM results in discard of any unused free space in the
> > filesystem that matches the criteria set by the user. We don't care
> > if free space was once provisioned used space - we'll issue a
> > discard for the range regardless. The "special" FITRIM extension I
> > mentioned is to get filesystem metadata provisioning released;
> > that's completely separate to user data provisioning through
> > fallocate() which FITRIM will always discard if it has been freed...
> >
> > IOWs, normal behaviour will be that a FITRIM ends up discarding a
> > mix of unprovisioned and provisioned space. Nobody will be able to
> > predict what mix the device is going to get at any point in time.
> > Also, if we turn on online discard, the block device is going to get
> > a constant stream of discard operations that will also be a mix of
> > provisioned and unprovisioned space that is not longer in use by the
> > filesystem.
> >
> > I suspect that you need to stop trying to double guess what
> > operations the filesystem will use provisioning for, what it will
> > send discards for and when it will send discards for them.. Just
> > assume the device will receive a constant stream of both
> > REQ_PROVISION and REQ_OP_DISCARD (for both provisioned and
> > unprovisioned regions) operations whenver the filesystem is active
> > on a thinp device.....
>
> Yeah, I was getting tripped up in the weeds a bit. It's pretty
> straight-forward (and like I said at the start of our subthread here:
> this follow-on work, to inherit provisioned flag, can build on this
> REQ_PROVISION patchset).
>
> All said, I've now gotten this sub-thread on Joe Thornber's radar and
> we've started discussing. We'll be discussing with more focus
> tomorrow.
>
From the perspective of this patch series, I'll wait for more feedback
before sending out v8 (which would be the above patches and the
follow-on patch to pass through FALLOC_FL_UNSHARE_RANGE [1]).
[1] https://listman.redhat.com/archives/dm-devel/2023-May/054188.html
Thanks!
Sarthak
> Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-25 22:47 ` Sarthak Kukreti
@ 2023-05-26 1:36 ` Dave Chinner
2023-05-26 2:35 ` Sarthak Kukreti
0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2023-05-26 1:36 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: Mike Snitzer, Joe Thornber, Jens Axboe, linux-block,
Theodore Ts'o, Stefan Hajnoczi, Michael S. Tsirkin,
Darrick J. Wong, Brian Foster, Bart Van Assche, linux-kernel,
Christoph Hellwig, dm-devel, Andreas Dilger, linux-fsdevel,
linux-ext4, Jason Wang, Alasdair Kergon
On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote:
> On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <snitzer@kernel.org> wrote:
> > On Thu, May 25 2023 at 7:39P -0400,
> > Dave Chinner <david@fromorbit.com> wrote:
> > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > > On Tue, May 23 2023 at 8:40P -0400,
> > > > Dave Chinner <david@fromorbit.com> wrote:
> > > > > It's worth noting that XFS already has a coarse-grained
> > > > > implementation of preferred regions for metadata storage. It will
> > > > > currently not use those metadata-preferred regions for user data
> > > > > unless all the remaining user data space is full. Hence I'm pretty
> > > > > sure that a pre-provisioning enhancment like this can be done
> > > > > entirely in-memory without requiring any new on-disk state to be
> > > > > added.
> > > > >
> > > > > Sure, if we crash and remount, then we might chose a different LBA
> > > > > region for pre-provisioning. But that's not really a huge deal as we
> > > > > could also run an internal background post-mount fstrim operation to
> > > > > remove any unused pre-provisioning that was left over from when the
> > > > > system went down.
> > > >
> > > > This would be the FITRIM with extension you mention below? Which is a
> > > > filesystem interface detail?
> > >
> > > No. We might reuse some of the internal infrastructure we use to
> > > implement FITRIM, but that's about it. It's just something kinda
> > > like FITRIM but with different constraints determined by the
> > > filesystem rather than the user...
> > >
> > > As it is, I'm not sure we'd even need it - a preiodic userspace
> > > FITRIM would acheive the same result, so leaked provisioned spaces
> > > would get cleaned up eventually without the filesystem having to do
> > > anything specific...
> > >
> > > > So dm-thinp would _not_ need to have new
> > > > state that tracks "provisioned but unused" block?
> > >
> > > No idea - that's your domain. :)
> > >
> > > dm-snapshot, for certain, will need to track provisioned regions
> > > because it has to guarantee that overwrites to provisioned space in
> > > the origin device will always succeed. Hence it needs to know how
> > > much space breaking sharing in provisioned regions after a snapshot
> > > has been taken with be required...
> >
> > dm-thinp offers its own much more scalable snapshot support (doesn't
> > use old dm-snapshot N-way copyout target).
> >
> > dm-snapshot isn't going to be modified to support this level of
> > hardening (dm-snapshot is basically in "maintenance only" now).
Ah, of course. Sorry for the confusion, I was kinda using
dm-snapshot as shorthand for "dm-thinp + snapshots".
> > But I understand your meaning: what you said is 100% applicable to
> > dm-thinp's snapshot implementation and needs to be accounted for in
> > thinp's metadata (inherent 'provisioned' flag).
*nod*
> A bit orthogonal: would dm-thinp need to differentiate between
> user-triggered provision requests (eg. from fallocate()) vs
> fs-triggered requests?
Why? How is the guarantee the block device has to provide to
provisioned areas different for user vs filesystem internal
provisioned space?
> I would lean towards user provisioned areas not
> getting dedup'd on snapshot creation,
<twitch>
Snapshotting is a clone operation, not a dedupe operation.
Yes, the end result of both is that you have a block shared between
multiple indexes that needs COW on the next overwrite, but the two
operations that get to that point are very different...
</pedantic mode disegaged>
> but that would entail tracking
> the state of the original request and possibly a provision request
> flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag
> (REQ_PROVISION_NODEDUP). Possibly too convoluted...
Let's not try to add everyone's favourite pony to this interface
before we've even got it off the ground.
It's the simple precision of the API, the lack of cross-layer
communication requirements and the ability to implement and optimise
the independent layers independently that makes this a very
appealing solution.
We need to start with getting the simple stuff working and prove the
concept. Then once we can observe the behaviour of a working system
we can start working on optimising individual layers for efficiency
and performance....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-26 1:36 ` Dave Chinner
@ 2023-05-26 2:35 ` Sarthak Kukreti
2023-05-26 15:56 ` Brian Foster
0 siblings, 1 reply; 52+ messages in thread
From: Sarthak Kukreti @ 2023-05-26 2:35 UTC (permalink / raw)
To: Dave Chinner
Cc: Mike Snitzer, Joe Thornber, Jens Axboe, linux-block,
Theodore Ts'o, Stefan Hajnoczi, Michael S. Tsirkin,
Darrick J. Wong, Brian Foster, Bart Van Assche, linux-kernel,
Christoph Hellwig, dm-devel, Andreas Dilger, linux-fsdevel,
linux-ext4, Jason Wang, Alasdair Kergon
On Thu, May 25, 2023 at 6:36 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote:
> > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <snitzer@kernel.org> wrote:
> > > On Thu, May 25 2023 at 7:39P -0400,
> > > Dave Chinner <david@fromorbit.com> wrote:
> > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > > > On Tue, May 23 2023 at 8:40P -0400,
> > > > > Dave Chinner <david@fromorbit.com> wrote:
> > > > > > It's worth noting that XFS already has a coarse-grained
> > > > > > implementation of preferred regions for metadata storage. It will
> > > > > > currently not use those metadata-preferred regions for user data
> > > > > > unless all the remaining user data space is full. Hence I'm pretty
> > > > > > sure that a pre-provisioning enhancment like this can be done
> > > > > > entirely in-memory without requiring any new on-disk state to be
> > > > > > added.
> > > > > >
> > > > > > Sure, if we crash and remount, then we might chose a different LBA
> > > > > > region for pre-provisioning. But that's not really a huge deal as we
> > > > > > could also run an internal background post-mount fstrim operation to
> > > > > > remove any unused pre-provisioning that was left over from when the
> > > > > > system went down.
> > > > >
> > > > > This would be the FITRIM with extension you mention below? Which is a
> > > > > filesystem interface detail?
> > > >
> > > > No. We might reuse some of the internal infrastructure we use to
> > > > implement FITRIM, but that's about it. It's just something kinda
> > > > like FITRIM but with different constraints determined by the
> > > > filesystem rather than the user...
> > > >
> > > > As it is, I'm not sure we'd even need it - a preiodic userspace
> > > > FITRIM would acheive the same result, so leaked provisioned spaces
> > > > would get cleaned up eventually without the filesystem having to do
> > > > anything specific...
> > > >
> > > > > So dm-thinp would _not_ need to have new
> > > > > state that tracks "provisioned but unused" block?
> > > >
> > > > No idea - that's your domain. :)
> > > >
> > > > dm-snapshot, for certain, will need to track provisioned regions
> > > > because it has to guarantee that overwrites to provisioned space in
> > > > the origin device will always succeed. Hence it needs to know how
> > > > much space breaking sharing in provisioned regions after a snapshot
> > > > has been taken with be required...
> > >
> > > dm-thinp offers its own much more scalable snapshot support (doesn't
> > > use old dm-snapshot N-way copyout target).
> > >
> > > dm-snapshot isn't going to be modified to support this level of
> > > hardening (dm-snapshot is basically in "maintenance only" now).
>
> Ah, of course. Sorry for the confusion, I was kinda using
> dm-snapshot as shorthand for "dm-thinp + snapshots".
>
> > > But I understand your meaning: what you said is 100% applicable to
> > > dm-thinp's snapshot implementation and needs to be accounted for in
> > > thinp's metadata (inherent 'provisioned' flag).
>
> *nod*
>
> > A bit orthogonal: would dm-thinp need to differentiate between
> > user-triggered provision requests (eg. from fallocate()) vs
> > fs-triggered requests?
>
> Why? How is the guarantee the block device has to provide to
> provisioned areas different for user vs filesystem internal
> provisioned space?
>
After thinking this through, I stand corrected. I was primarily
concerned with how this would balloon thin snapshot sizes if users
potentially provision a large chunk of the filesystem but that's
putting the cart way before the horse.
Best
Sarthak
> > I would lean towards user provisioned areas not
> > getting dedup'd on snapshot creation,
>
> <twitch>
>
> Snapshotting is a clone operation, not a dedupe operation.
>
> Yes, the end result of both is that you have a block shared between
> multiple indexes that needs COW on the next overwrite, but the two
> operations that get to that point are very different...
>
> </pedantic mode disegaged>
>
> > but that would entail tracking
> > the state of the original request and possibly a provision request
> > flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag
> > (REQ_PROVISION_NODEDUP). Possibly too convoluted...
>
> Let's not try to add everyone's favourite pony to this interface
> before we've even got it off the ground.
>
> It's the simple precision of the API, the lack of cross-layer
> communication requirements and the ability to implement and optimise
> the independent layers independently that makes this a very
> appealing solution.
>
> We need to start with getting the simple stuff working and prove the
> concept. Then once we can observe the behaviour of a working system
> we can start working on optimising individual layers for efficiency
> and performance....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-26 2:35 ` Sarthak Kukreti
@ 2023-05-26 15:56 ` Brian Foster
0 siblings, 0 replies; 52+ messages in thread
From: Brian Foster @ 2023-05-26 15:56 UTC (permalink / raw)
To: Sarthak Kukreti
Cc: Dave Chinner, Mike Snitzer, Joe Thornber, Jens Axboe, linux-block,
Theodore Ts'o, Stefan Hajnoczi, Michael S. Tsirkin,
Darrick J. Wong, Bart Van Assche, linux-kernel, Christoph Hellwig,
dm-devel, Andreas Dilger, linux-fsdevel, linux-ext4, Jason Wang,
Alasdair Kergon
On Thu, May 25, 2023 at 07:35:14PM -0700, Sarthak Kukreti wrote:
> On Thu, May 25, 2023 at 6:36 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote:
> > > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <snitzer@kernel.org> wrote:
> > > > On Thu, May 25 2023 at 7:39P -0400,
> > > > Dave Chinner <david@fromorbit.com> wrote:
> > > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > > > > On Tue, May 23 2023 at 8:40P -0400,
> > > > > > Dave Chinner <david@fromorbit.com> wrote:
> > > > > > > It's worth noting that XFS already has a coarse-grained
> > > > > > > implementation of preferred regions for metadata storage. It will
> > > > > > > currently not use those metadata-preferred regions for user data
> > > > > > > unless all the remaining user data space is full. Hence I'm pretty
> > > > > > > sure that a pre-provisioning enhancment like this can be done
> > > > > > > entirely in-memory without requiring any new on-disk state to be
> > > > > > > added.
> > > > > > >
> > > > > > > Sure, if we crash and remount, then we might chose a different LBA
> > > > > > > region for pre-provisioning. But that's not really a huge deal as we
> > > > > > > could also run an internal background post-mount fstrim operation to
> > > > > > > remove any unused pre-provisioning that was left over from when the
> > > > > > > system went down.
> > > > > >
> > > > > > This would be the FITRIM with extension you mention below? Which is a
> > > > > > filesystem interface detail?
> > > > >
> > > > > No. We might reuse some of the internal infrastructure we use to
> > > > > implement FITRIM, but that's about it. It's just something kinda
> > > > > like FITRIM but with different constraints determined by the
> > > > > filesystem rather than the user...
> > > > >
> > > > > As it is, I'm not sure we'd even need it - a preiodic userspace
> > > > > FITRIM would acheive the same result, so leaked provisioned spaces
> > > > > would get cleaned up eventually without the filesystem having to do
> > > > > anything specific...
> > > > >
> > > > > > So dm-thinp would _not_ need to have new
> > > > > > state that tracks "provisioned but unused" block?
> > > > >
> > > > > No idea - that's your domain. :)
> > > > >
> > > > > dm-snapshot, for certain, will need to track provisioned regions
> > > > > because it has to guarantee that overwrites to provisioned space in
> > > > > the origin device will always succeed. Hence it needs to know how
> > > > > much space breaking sharing in provisioned regions after a snapshot
> > > > > has been taken with be required...
> > > >
> > > > dm-thinp offers its own much more scalable snapshot support (doesn't
> > > > use old dm-snapshot N-way copyout target).
> > > >
> > > > dm-snapshot isn't going to be modified to support this level of
> > > > hardening (dm-snapshot is basically in "maintenance only" now).
> >
> > Ah, of course. Sorry for the confusion, I was kinda using
> > dm-snapshot as shorthand for "dm-thinp + snapshots".
> >
> > > > But I understand your meaning: what you said is 100% applicable to
> > > > dm-thinp's snapshot implementation and needs to be accounted for in
> > > > thinp's metadata (inherent 'provisioned' flag).
> >
> > *nod*
> >
> > > A bit orthogonal: would dm-thinp need to differentiate between
> > > user-triggered provision requests (eg. from fallocate()) vs
> > > fs-triggered requests?
> >
> > Why? How is the guarantee the block device has to provide to
> > provisioned areas different for user vs filesystem internal
> > provisioned space?
> >
> After thinking this through, I stand corrected. I was primarily
> concerned with how this would balloon thin snapshot sizes if users
> potentially provision a large chunk of the filesystem but that's
> putting the cart way before the horse.
>
I think that's a legitimate concern. At some point to provide full
-ENOSPC protection the filesystem needs to provision space before it
writes to it, whether it be data or metadata, right? At what point does
that turn into a case where pretty much everything the fs wrote is
provisioned, and therefore a snapshot is just a full copy operation?
That might be Ok I guess, but if that's an eventuality then what's the
need to track provision state at dm-thin block level? Using some kind of
flag you mention below could be a good way to qualify which blocks you'd
want to copy vs. which to share on snapshot and perhaps mitigate that
problem.
> Best
> Sarthak
>
> > > I would lean towards user provisioned areas not
> > > getting dedup'd on snapshot creation,
> >
> > <twitch>
> >
> > Snapshotting is a clone operation, not a dedupe operation.
> >
> > Yes, the end result of both is that you have a block shared between
> > multiple indexes that needs COW on the next overwrite, but the two
> > operations that get to that point are very different...
> >
> > </pedantic mode disegaged>
> >
> > > but that would entail tracking
> > > the state of the original request and possibly a provision request
> > > flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag
> > > (REQ_PROVISION_NODEDUP). Possibly too convoluted...
> >
> > Let's not try to add everyone's favourite pony to this interface
> > before we've even got it off the ground.
> >
> > It's the simple precision of the API, the lack of cross-layer
> > communication requirements and the ability to implement and optimise
> > the independent layers independently that makes this a very
> > appealing solution.
> >
> > We need to start with getting the simple stuff working and prove the
> > concept. Then once we can observe the behaviour of a working system
> > we can start working on optimising individual layers for efficiency
> > and performance....
> >
I think to prove the concept may not necessarily require changes to
dm-thin at all. If you want to guarantee preexisting metadata block
writeability, just scan through and provision all metadata blocks at
mount time. Hit the log, AG bufs, IIRC XFS already has btree walking
code that can be used for btrees and associated metadata, etc. Maybe
online scrub has something even better to hook into temporarily for this
sort of thing?
Mount performance would obviously be bad, but that doesn't matter for
the purposes of a prototype. The goal should really be that once
mounted, you have established expected writeability invariants and have
the ability to test for reliable prevention of -ENOSPC errors from
dm-thin from that point forward. If that ultimately works, then refine
the ideal implementation from there and ask dm to do whatever
writeability tracking and whatnot.
FWIW, that may also help deal with things like the fact that xfs_repair
can basically relocate the entire set of filesystem metadata to
completely different ranges of free space, completely breaking any
writeability guarantees tracked by previous provisions of those ranges.
Brian
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-24 0:40 ` Dave Chinner
2023-05-24 20:02 ` Mike Snitzer
@ 2023-05-25 16:19 ` Brian Foster
2023-05-26 9:37 ` Dave Chinner
1 sibling, 1 reply; 52+ messages in thread
From: Brian Foster @ 2023-05-25 16:19 UTC (permalink / raw)
To: Dave Chinner
Cc: Mike Snitzer, Jens Axboe, Christoph Hellwig, Theodore Ts'o,
Sarthak Kukreti, dm-devel, Michael S. Tsirkin, Darrick J. Wong,
Jason Wang, Bart Van Assche, linux-kernel, linux-block,
Joe Thornber, Andreas Dilger, Stefan Hajnoczi, linux-fsdevel,
linux-ext4, Alasdair Kergon
On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote:
> On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote:
> > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > ... since I also happen to think there is a potentially interesting
> > > development path to make this sort of reserve pool configurable in terms
> > > of size and active/inactive state, which would allow the fs to use an
> > > emergency pool scheme for managing metadata provisioning and not have to
> > > track and provision individual metadata buffers at all (dealing with
> > > user data is much easier to provision explicitly). So the space
> > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > filesystems that want more granularity for better behavior could achieve
> > > that with more work. Filesystems that don't would be free to rely on the
> > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > protection with very minimal changes.
> > >
> > > That's getting too far into the weeds on the future bits, though. This
> > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > sufficient interest in this sort of "reserve mode" approach to try and
> > > clean it up further and have dm guys look at it, or if you guys see any
> > > obvious issues in what it does that makes it potentially problematic, or
> > > if you would just prefer to go down the path described above...
> >
> > The model that Dave detailed, which builds on REQ_PROVISION and is
> > sticky (by provisioning same blocks for snapshot) seems more useful to
> > me because it is quite precise. That said, it doesn't account for
> > hard requirements that _all_ blocks will always succeed.
>
> Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> but I don't think we'd ever need a hard guarantee from the block
> device that every write bio issued from the filesystem will succeed
> without ENOSPC.
>
The bigger picture goal that I didn't get into in my previous mail is
the "full device" reservation model is intended to be a simple, crude
reference implementation that can be enabled for any arbitrary thin
volume consumer (filesystem or application). The idea is to build that
on a simple enough reservation mechanism that any such consumer could
override it based on its own operational model. The goal is to guarantee
that a particular filesystem never receives -ENOSPC from dm-thin on
writes, but the first phase of implementing that is to simply guarantee
every block is writeable.
As a specific filesystem is able to more explicitly provision its own
allocations in a way that it can guarantee to return -ENOSPC from
dm-thin up front (rather than at write bio time), it can reduce the need
for the amount of reservation required, ultimately to zero if that
filesystem provides the ability to pre-provision all of its writes to
storage in some way or another.
I think for filesystems with complex metadata management like XFS, it's
not very realistic to expect explicit 1-1 provisioning for all metadata
changes on a per-transaction basis in the same way that can (fairly
easily) be done for data, which means a pool mechanism is probably still
needed for the metadata class of writes. Therefore, my expectation for
something like XFS is that it grows the ability to explicitly provision
data writes up front (we solved this part years ago), and then uses a
much smaller pool of reservation for the purpose of dealing with
metadata.
I think what you describe below around preprovisioned perag metadata
ranges is interesting because it _very closely_ maps conceptually to
what I envisioned the evolution of the reserve pool scheme to end up
looking like, but just implemented rather differently to use
reservations instead of specific LBA ranges.
Let me try to connect the dots and identify the differences/tradeoffs...
> If the block device can provide a guarantee that a provisioned LBA
> range is always writable, then everything else is a filesystem level
> optimisation problem and we don't have to involve the block device
> in any way. All we need is a flag we can ready out of the bdev at
> mount time to determine if the filesystem should be operating with
> LBA provisioning enabled...
>
> e.g. If we need to "pre-provision" a chunk of the LBA space for
> filesystem metadata, we can do that ahead of time and track the
> pre-provisioned range(s) in the filesystem itself.
>
> In XFS, That could be as simple as having small chunks of each AG
> reserved to metadata (e.g. start with the first 100MB) and limiting
> all metadata allocation free space searches to that specific block
> range. When we run low on that space, we pre-provision another 100MB
> chunk and then allocate all metadata out of that new range. If we
> start getting ENOSPC to pre-provisioning, then we reduce the size of
> the regions and log low space warnings to userspace. If we can't
> pre-provision any space at all and we've completely run out, we
> simply declare ENOSPC for all incoming operations that require
> metadata allocation until pre-provisioning succeeds again.
>
The more interesting aspect of this is not so much how space is
provisioned and allocated, but how the filesystem is going to consume
that space in a way that guarantees -ENOSPC is provided up front before
userspace is allowed to make modifications. You didn't really touch on
that here, so I'm going to assume we'd have something like a perag
counter of how many free blocks currently live in preprovisioned ranges,
and then an fs-wide total somewhere so a transaction has the ability to
consume these blocks at trans reservation time, the fs knows when to
preprovision more space (or go into -ENOSPC mode), etc.
Some accounting of that nature is necessary here in order to prevent the
filesystem from ever writing to unprovisioned space. So what I was
envisioning is rather than explicitly preprovision a physical range of
each AG and tracking all that, just reserve that number of arbitrarily
located blocks from dm for each AG.
The initial perag reservations can be populated at mount time,
replenished as needed in a very similar way as what you describe, and
100% released back to the thin pool at unmount time. On top of that,
there's no need to track physical preprovisioned ranges at all. Not just
for allocation purposes, but also to avoid things like having to protect
background trims from preprovisioned ranges of free space dedicated for
metadata, etc.
> This is built entirely on the premise that once proactive backing
> device provisioning fails, the backing device is at ENOSPC and we
> have to wait for that situation to go away before allowing new data
> to be ingested. Hence the block device really doesn't need to know
> anything about what the filesystem is doing and vice versa - The
> block dev just says "yes" or "no" and the filesystem handles
> everything else.
>
Yup, everything you describe about going into a simulated -ENOSPC mode
would work exactly the same. The primary difference is that the internal
provisioned space accounting in the filesystem is backed by dynamic
reservation in dm, rather than physically provisioned LBA ranges.
> It's worth noting that XFS already has a coarse-grained
> implementation of preferred regions for metadata storage. It will
> currently not use those metadata-preferred regions for user data
> unless all the remaining user data space is full. Hence I'm pretty
> sure that a pre-provisioning enhancment like this can be done
> entirely in-memory without requiring any new on-disk state to be
> added.
>
> Sure, if we crash and remount, then we might chose a different LBA
> region for pre-provisioning. But that's not really a huge deal as we
> could also run an internal background post-mount fstrim operation to
> remove any unused pre-provisioning that was left over from when the
> system went down.
>
None of this is really needed..
> Further, managing shared pool exhaustion doesn't require a
> reservation pool in the backing device and for the filesystems to
> request space from it. Filesystems already have their own reserve
> pools via pre-provisioning. If we want the filesystems to be able to
> release that space back to the shared pool (e.g. because the shared
> backing pool is critically short on space) then all we need is an
> extension to FITRIM to tell the filesystem to also release internal
> pre-provisioned reserves.
>
> Then the backing pool admin (person or automated daemon!) can simply
> issue a trim on all the filesystems in the pool and spce will be
> returned. Then filesystems will ask for new pre-provisioned space
> when they next need to ingest modifications, and the backing pool
> can manage the new pre-provisioning space requests directly....
>
This is written as to imply that the reservation pool is some big
complex thing, which makes me think there is some
confusion/miscommunication. It's basically just an in memory counter of
space that is allocated out of a shared thin pool and is held in a
specific thin volume while it is currently in use. The counter on the
volume is managed indirectly by filesystem requests and/or direct
operations on the volume (like dm snapshots).
Sure, you could replace the counter and reservation interface with
explicitly provisioned/trimmed LBA ranges that the fs can manage to
provide -ENOSPC guarantees, but then the fs has to do those various
things you've mentioned:
- Provision those ranges in the fs and change allocation behavior
accordingly.
- Do the background post-crash fitrim preprovision clean up thing.
- Distinguish between trims that are intended to return preprovisioned
space vs. those that come from userspace.
- Have some daemon or whatever (?) responsible for communicating the
need for trims in the fs to return space back to the pool.
Then this still depends on changing how dm thin snapshots work and needs
a way to deal with delayed allocation to actually guarantee -ENOSPC
protection..?
> Hence I think if we get the basic REQ_PROVISION overwrite-in-place
> guarantees defined and implemented as previously outlined, then we
> don't need any special coordination between the fs and block devices
> to avoid fatal ENOSPC issues with sparse and/or snapshot capable
> block devices...
>
This all sounds like a good amount of coordination and unnecessary
complexity to me. What I was thinking as a next phase (i.e. after
initial phase full device reservation) approach for a filesystem like
XFS would be something like this.
- Support a mount option for a configurable size metadata reservation
pool (with sane/conservative default).
- The pool is populated at mount time, else the fs goes right into
simulated -ENOSPC mode.
- Thin pool reservation consumption is controlled by a flag on write
bios that is managed by the fs (flag polarity TBD).
- All fs data writes are explicitly reserved up front in the write path.
Delalloc maps to explicit reservation, overwrites are easy and just
involve an explicit provision.
- Metadata writes are not reserved or provisioned at all. They allocate
out of the thin pool on write (if needed), just as they do today. On
an -ENOSPC metadata write error, the fs goes into simulated -ENOSPC mode
and allows outstanding metadata writes to now use the bio flag to
consume emergency reservation.
So this means that metadata -ENOSPC protection is only as reliable as
the size of the specified pool. This is by design, so the filesystem
still does not have to track provisioning, allocation or overwrites of
its own metadata usage. Users with metadata heavy workloads or who
happen to be sensitive to -ENOSPC errors can be more aggressive with
pool size, while other users might be able to get away with a smaller
pool. Users who are super paranoid and want perfection can continue to
reserve the entire device and pay for the extra storage.
Users who are not sure can test their workload in an appropriate
environment, collect some data/metrics on maximum outstanding dirty
metadata, and then use that as a baseline/minimum pool size for reliable
behavior going forward. This is also where something like Stratis can
come in to generate this sort of information, make recommendations or
implement heuristics (based on things like fs size, amount of RAM, for
e.g.) to provide sane defaults based on use case. I.e., this is
initially exposed as a userspace/tuning issue instead of a
filesystem/dm-thin hard guarantee.
Finally, if you really want to get to that last step of maximally
efficient and safe provisioning in the fs, implement a
'thinreserve=adaptive' mode in the fs that alters the acquisition and
consumption of dm-thin reserved blocks to be adaptive in nature and
promises to do it's own usage throttling against outstanding
reservation. I think this is the mode that most closely resembles your
preprovisioned range mechanism.
For example, adaptive mode could add the logic/complexity where you do
the per-ag provision thing (just using reservation instead of physical
ranges), change the transaction path to attempt to increase the
reservation pool or go into -ENOSPC mode, and flag all writes to be
satisfied from the reserve pool (because you've done the
provision/reservation up front).
At this point the "reserve pool" concept is very different and pretty
much managed entirely by the filesystem. It's just a counter of the set
of blocks the fs is anticipating to write to in the near term, but it's
built on the same underlying reservation mechanism used differently by
other filesystems. So something like ext4 could elide the need for an
adaptive mode, implement the moderate data/metadata pool mechanism and
rely on userspace tools or qualified administrators to do the sizing
correctly, while simultaneously using the same underlying mechanism that
XFS is using for finer grained provisioning.
> As a bonus, if we can implement the guarantees in dm-thin/-snapshot
> and have a filesystem make use of it, then we also have a reference
> implementation to point at device vendors and standards
> associations....
>
I think that's a ways ahead yet.. :P
Thoughts on any of the above? Does that describe enough of the big
picture? (Mike, I hope this at least addresses the whole "why even do
this?" question). I am deliberately trying to work through a progression
that starts simple and generic but actually 100% solves the problem
(even if in a dumb way), then iterates to something that addresses the
biggest drawback with the reference implementation with minimal changes
required to individual filesystems (i.e. metadata pool sizing), and
finally ends up allowing any particular filesystem to refine from there
to achieve maximal efficiency based on its own cost/benefit analysis.
Another way to look at it is... step 1 is to implement the
'thinreserve=full' mount option, which can be trivially implemented by
any filesystem with a couple function calls. Step two is to implement
'thinsreserve=N' support, which consists of a standard iomap
provisioning implementation for data and a bio tagging/error handling
approach that is still pretty simple for most filesystems to implement.
Finally, 'thinreserve=adaptive' is the filesystems best effort to
guarantee -ENOSPC safety with maximal space efficiency.
One general tradeoff with using reservations vs. preprovisioning is the
the latter can just use the provision/trim primitives to alloc/free LBA
ranges. My thought on that is those primitives could possibly be
modified to do the same sort of things with reservation as for physical
allocations. That seems fairly easy to do with bio op flags/modifiers,
though one thing I'm not sure about is how to submit a provision bio to
request a certain amount location agnostic blocks. I'd have to
investigate that more.
So in summary, I can sort of see how this problem could be solved with
this combination of physically preprovisioned ranges and changes to
dm-thin snapshot behavior and whatnot (I'm still missing how this is
supposed to handle delalloc, mostly), but I think that involves more
complexity and customization work than is really necessary. Either way,
this is a distinctly different approach to what I was thinking of
morphing the prototype bits into. So to me the relevant question is does
something like the progression that is outlined above for a block
reservation approach seem a reasonable path to ultimately be able to
accomplish the same sort of results in the fs? If so, then I'm happy to
try and push things in that direction to at least try to prove it out.
If not, then I'm also happy to just leave it alone.. ;)
Brian
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-25 16:19 ` Brian Foster
@ 2023-05-26 9:37 ` Dave Chinner
2023-05-26 15:47 ` Brian Foster
[not found] ` <CAJ0trDbspRaDKzTzTjFdPHdB9n0Q9unfu1cEk8giTWoNu3jP8g@mail.gmail.com>
0 siblings, 2 replies; 52+ messages in thread
From: Dave Chinner @ 2023-05-26 9:37 UTC (permalink / raw)
To: Brian Foster
Cc: Mike Snitzer, Jens Axboe, Christoph Hellwig, Theodore Ts'o,
Sarthak Kukreti, dm-devel, Michael S. Tsirkin, Darrick J. Wong,
Jason Wang, Bart Van Assche, linux-kernel, linux-block,
Joe Thornber, Andreas Dilger, Stefan Hajnoczi, linux-fsdevel,
linux-ext4, Alasdair Kergon
On Thu, May 25, 2023 at 12:19:47PM -0400, Brian Foster wrote:
> On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote:
> > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote:
> > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > ... since I also happen to think there is a potentially interesting
> > > > development path to make this sort of reserve pool configurable in terms
> > > > of size and active/inactive state, which would allow the fs to use an
> > > > emergency pool scheme for managing metadata provisioning and not have to
> > > > track and provision individual metadata buffers at all (dealing with
> > > > user data is much easier to provision explicitly). So the space
> > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > filesystems that want more granularity for better behavior could achieve
> > > > that with more work. Filesystems that don't would be free to rely on the
> > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > > protection with very minimal changes.
> > > >
> > > > That's getting too far into the weeds on the future bits, though. This
> > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > clean it up further and have dm guys look at it, or if you guys see any
> > > > obvious issues in what it does that makes it potentially problematic, or
> > > > if you would just prefer to go down the path described above...
> > >
> > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > me because it is quite precise. That said, it doesn't account for
> > > hard requirements that _all_ blocks will always succeed.
> >
> > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > but I don't think we'd ever need a hard guarantee from the block
> > device that every write bio issued from the filesystem will succeed
> > without ENOSPC.
> >
>
> The bigger picture goal that I didn't get into in my previous mail is
> the "full device" reservation model is intended to be a simple, crude
> reference implementation that can be enabled for any arbitrary thin
> volume consumer (filesystem or application). The idea is to build that
> on a simple enough reservation mechanism that any such consumer could
> override it based on its own operational model. The goal is to guarantee
> that a particular filesystem never receives -ENOSPC from dm-thin on
> writes, but the first phase of implementing that is to simply guarantee
> every block is writeable.
>
> As a specific filesystem is able to more explicitly provision its own
> allocations in a way that it can guarantee to return -ENOSPC from
> dm-thin up front (rather than at write bio time), it can reduce the need
> for the amount of reservation required, ultimately to zero if that
> filesystem provides the ability to pre-provision all of its writes to
> storage in some way or another.
>
> I think for filesystems with complex metadata management like XFS, it's
> not very realistic to expect explicit 1-1 provisioning for all metadata
> changes on a per-transaction basis in the same way that can (fairly
> easily) be done for data, which means a pool mechanism is probably still
> needed for the metadata class of writes.
I'm trying to avoid need for 1-1 provisioning and the need for a
accounting based reservation pool approach. I've tried the
reservation pool thing several times, and they've all collapsed
under the complexity of behaviour guarantees under worst case write
amplification situations.
The whole point of the LBA provisioning approach is that it
completely avoids the need to care about write amplification because
the underlying device guarantees any write to a LBA that is
provisioned will succeed. It takes care of the write amplification
problem for us, and we can make it even easier for the backing
device by aligning LBA range provision requests to device region
sizes.
> > If the block device can provide a guarantee that a provisioned LBA
> > range is always writable, then everything else is a filesystem level
> > optimisation problem and we don't have to involve the block device
> > in any way. All we need is a flag we can ready out of the bdev at
> > mount time to determine if the filesystem should be operating with
> > LBA provisioning enabled...
> >
> > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > filesystem metadata, we can do that ahead of time and track the
> > pre-provisioned range(s) in the filesystem itself.
> >
> > In XFS, That could be as simple as having small chunks of each AG
> > reserved to metadata (e.g. start with the first 100MB) and limiting
> > all metadata allocation free space searches to that specific block
> > range. When we run low on that space, we pre-provision another 100MB
> > chunk and then allocate all metadata out of that new range. If we
> > start getting ENOSPC to pre-provisioning, then we reduce the size of
> > the regions and log low space warnings to userspace. If we can't
> > pre-provision any space at all and we've completely run out, we
> > simply declare ENOSPC for all incoming operations that require
> > metadata allocation until pre-provisioning succeeds again.
> >
>
> The more interesting aspect of this is not so much how space is
> provisioned and allocated, but how the filesystem is going to consume
> that space in a way that guarantees -ENOSPC is provided up front before
> userspace is allowed to make modifications.
Yeah, that's trivial with REQ_PROVISION.
If, at transaction reservation time, we don't have enough
provisioned metadata space available for the potential allocations
we'll need to make, we kick provisioning work off wait for more to
come available. If that fails and none is available, we'll get an
enospc error right there, same as if the filesystem itself has no
blocks available for allocation.
This is no different to, say, having xfs_create() fail reservation
because ENOSPC, then calling xfs_flush_inodes() to kick off an inode
cache walk to trim away all the unused post-eof allocations in
memory to free up some space we can use. When that completes,
we try the reservation again.
There's no new behaviours we need to introduce here - it's just
replication of existing behaviours and infrastructure.
> You didn't really touch on
> that here, so I'm going to assume we'd have something like a perag
> counter of how many free blocks currently live in preprovisioned ranges,
> and then an fs-wide total somewhere so a transaction has the ability to
> consume these blocks at trans reservation time, the fs knows when to
> preprovision more space (or go into -ENOSPC mode), etc.
Sure, something like that. Those are all implementation details, and
not really that complex to implement and is largely replication of
reservation infrastructure we already have.
> Some accounting of that nature is necessary here in order to prevent the
> filesystem from ever writing to unprovisioned space. So what I was
> envisioning is rather than explicitly preprovision a physical range of
> each AG and tracking all that, just reserve that number of arbitrarily
> located blocks from dm for each AG.
>
> The initial perag reservations can be populated at mount time,
> replenished as needed in a very similar way as what you describe, and
> 100% released back to the thin pool at unmount time. On top of that,
> there's no need to track physical preprovisioned ranges at all. Not just
> for allocation purposes, but also to avoid things like having to protect
> background trims from preprovisioned ranges of free space dedicated for
> metadata, etc.
That's all well and good, but reading further down the email the
breadth and depth of changes to filesystem and block device
behaviour to enable this are ... significant.
> > Further, managing shared pool exhaustion doesn't require a
> > reservation pool in the backing device and for the filesystems to
> > request space from it. Filesystems already have their own reserve
> > pools via pre-provisioning. If we want the filesystems to be able to
> > release that space back to the shared pool (e.g. because the shared
> > backing pool is critically short on space) then all we need is an
> > extension to FITRIM to tell the filesystem to also release internal
> > pre-provisioned reserves.
> >
> > Then the backing pool admin (person or automated daemon!) can simply
> > issue a trim on all the filesystems in the pool and spce will be
> > returned. Then filesystems will ask for new pre-provisioned space
> > when they next need to ingest modifications, and the backing pool
> > can manage the new pre-provisioning space requests directly....
> >
>
> This is written as to imply that the reservation pool is some big
> complex thing, which makes me think there is some
> confusion/miscommunication.
No confusion, I'm just sceptical that it will work given my
experience trying to implement reservation based solutions multiple
different ways over the past decade. They've all failed because
they collapse under either the complexity explosion or space
overhead required to handle the worst case behavioural scenarios.
At one point I calculated the worst case reservation needed ensure
log recovery will always succeeded, ignoring write amplification,
was about 16x the size of the log. If I took write amplification for
dm-thinp having 64kB blocks and each inode hitting a different
cluster in it's own dm thinp block, that write amplication hit 64x.
So for recovering a 2GB log, if dm-thinp doesn't have a reserve of
well over 100GB of pool space, there is no guarantee that log
recovery will -always- succeed.
It's worst case numbers like this which made me conclude that
reservation based approaches cannot provide guarantees that ENOSPC
will never occur. The numbers are just too large when you start
considering journals that can hold a million dirty objects,
intent chains that might require modifying hundreds of metadata
blocks across a dozen transactions before they complete, etc.
OTOH, REQ_PROVISION makes this "log recovery needs new space to be
allocated" problem go away entirely. It provides a mechanism that
ensures log recovery does not consume any new space in the backing
pool as all the overwrites it performs are to previously provisioned
metadata.....
This is just one of the many reasons why I think the REQ_PROVISION
method is far better than reservations - it solves problems that
pure runtime reservations can't.
> It's basically just an in memory counter of
> space that is allocated out of a shared thin pool and is held in a
> specific thin volume while it is currently in use. The counter on the
> volume is managed indirectly by filesystem requests and/or direct
> operations on the volume (like dm snapshots).
>
> Sure, you could replace the counter and reservation interface with
> explicitly provisioned/trimmed LBA ranges that the fs can manage to
> provide -ENOSPC guarantees, but then the fs has to do those various
> things you've mentioned:
>
> - Provision those ranges in the fs and change allocation behavior
> accordingly.
This is relatively simple - most of the allocator functionality is
already there.
> - Do the background post-crash fitrim preprovision clean up thing.
We've already decided this is not needed.
> - Distinguish between trims that are intended to return preprovisioned
> space vs. those that come from userspace.
It's about ten lines of code in xfs_trim_extents() to do this. i.e.
the free space tree walk simply skips over free extents in the
metadata provisioned region based on a flag value.
> - Have some daemon or whatever (?) responsible for communicating the
> need for trims in the fs to return space back to the pool.
Systems are already configured to run a periodic fstrim passes to do
this via systemd units. And I'm pretty sure dm-thinp has a low space
notification to userspace (via dbus?) that is already used by
userspace agents to handle "near ENOSPC" events automatically.
> Then this still depends on changing how dm thin snapshots work and needs
> a way to deal with delayed allocation to actually guarantee -ENOSPC
> protection..?
I think you misunderstand: I'm not proposing to use REQ_PROVISION
for writes the filesystem does not guarantee will succeed. Never
have, I think it makes no sense at all. If the filesystem
can return ENOSPC for an unprovisioned user data write, then the
block device can too.
> > Hence I think if we get the basic REQ_PROVISION overwrite-in-place
> > guarantees defined and implemented as previously outlined, then we
> > don't need any special coordination between the fs and block devices
> > to avoid fatal ENOSPC issues with sparse and/or snapshot capable
> > block devices...
> >
>
> This all sounds like a good amount of coordination and unnecessary
> complexity to me. What I was thinking as a next phase (i.e. after
> initial phase full device reservation) approach for a filesystem like
> XFS would be something like this.
>
> - Support a mount option for a configurable size metadata reservation
> pool (with sane/conservative default).
I want this to all to work without the user having be aware that
there filesystem is running on a sparse device.
> - The pool is populated at mount time, else the fs goes right into
> simulated -ENOSPC mode.
What are the rules of this mode?
Hmmmm.
Log recovery needs to be able to allocate new metadata (i.e. in
intent replay), so I'm guessing reservation is needed before log
recovery? But if pool reservation fails, how do we then safely
perform log recovery given the filesystem is in ENOSPC mode?
> - Thin pool reservation consumption is controlled by a flag on write
> bios that is managed by the fs (flag polarity TBD).
So we still need a bio flag to communicate "this IO consumes
reservation".
What are the semantics of this flag? What happens on submission
error? e.g. the bio is failed before it gets to the layer that
consumes it - how does the filesystem know that reservation was
consumed or not at completion?
How do we know when to set it for user data writes?
What happens if the device recieves a bio with this flag but there
is no reservation remaining? e.g. the filesystem or device
accounting have got out of whack?
Hmmm. On that note, what about write amplification? Or should I call
it "reservation amplification". i.e. a 4kB bio with a "consume
reservation" flag might trigger a dm-region COW or allocation and
require 512kB of dm-thinp pool space to be allocated. How much
reservation actually gets consumed, and how do we reconcile the
differences in physical consumption vs reservation consumption?
> - All fs data writes are explicitly reserved up front in the write path.
> Delalloc maps to explicit reservation, overwrites are easy and just
> involve an explicit provision.
This is the first you've mentioned an "explicit provision"
operation. Is this like REQ_PROVISION, or something else?
This seems to imply that the ->iomap_begin method has to do
explicit provisioning callouts when we get a write that lands in an
IOMAP_MAPPED extent? Or something else?
Can you describe this mechanism in more detail?
> - Metadata writes are not reserved or provisioned at all. They allocate
> out of the thin pool on write (if needed), just as they do today. On
> an -ENOSPC metadata write error, the fs goes into simulated -ENOSPC mode
> and allows outstanding metadata writes to now use the bio flag to
> consume emergency reservation.
Okay. We need two pools in the backing device? The normal free space
pool, and an emergency reservation pool?
Without reading further, this implies that the filesystem is
reliant on the emergency reservation pool being large enough that
it can write any dirty metadata it has outstanding without ENOSPC
occuring. How does the size of this emergency pool get configured?
> So this means that metadata -ENOSPC protection is only as reliable as
> the size of the specified pool. This is by design, so the filesystem
> still does not have to track provisioning, allocation or overwrites of
> its own metadata usage. Users with metadata heavy workloads or who
> happen to be sensitive to -ENOSPC errors can be more aggressive with
> pool size, while other users might be able to get away with a smaller
> pool. Users who are super paranoid and want perfection can continue to
> reserve the entire device and pay for the extra storage.
Oh. Hand tuning. :(
> Users who are not sure can test their workload in an appropriate
> environment, collect some data/metrics on maximum outstanding dirty
> metadata, and then use that as a baseline/minimum pool size for reliable
> behavior going forward. This is also where something like Stratis can
> come in to generate this sort of information, make recommendations or
> implement heuristics (based on things like fs size, amount of RAM, for
> e.g.) to provide sane defaults based on use case. I.e., this is
> initially exposed as a userspace/tuning issue instead of a
> filesystem/dm-thin hard guarantee.
Which are the same things people have been complaining about for years.
> Finally, if you really want to get to that last step of maximally
> efficient and safe provisioning in the fs, implement a
> 'thinreserve=adaptive' mode in the fs that alters the acquisition and
> consumption of dm-thin reserved blocks to be adaptive in nature and
> promises to do it's own usage throttling against outstanding
> reservation. I think this is the mode that most closely resembles your
> preprovisioned range mechanism.
>
> For example, adaptive mode could add the logic/complexity where you do
> the per-ag provision thing (just using reservation instead of physical
> ranges), change the transaction path to attempt to increase the
> reservation pool or go into -ENOSPC mode, and flag all writes to be
> satisfied from the reserve pool (because you've done the
> provision/reservation up front).
Ok, so why not just go straight to this model using REQ_PROVISION?
If we then want to move to a different "accounting only" model for
provisioning, we just change REQ_PROVISION?
But I still see the problem of write amplification accounting being
unsolved by the "filesystem accounting only" approach advocated
here. We have no idea when the backing device has snapshots taken,
we have no idea when a filesystem write IO actually consumes more
thinp blocks than filesystem blocks, etc. How does the filesystem
level reservation pool address these problems?
> Thoughts on any of the above?
I'd say it went wrong at the requirements stage, resulting in an
overly complex, over-engineered solution.
> One general tradeoff with using reservations vs. preprovisioning is the
> the latter can just use the provision/trim primitives to alloc/free LBA
> ranges. My thought on that is those primitives could possibly be
> modified to do the same sort of things with reservation as for physical
> allocations. That seems fairly easy to do with bio op flags/modifiers,
> though one thing I'm not sure about is how to submit a provision bio to
> request a certain amount location agnostic blocks. I'd have to
> investigate that more.
Sure, if the constrained LBA space aspect of the REQ_PROVISION
implementation causes issues, then we see if we can optimise away
the fixed LBA space requirement.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v7 0/5] Introduce provisioning primitives
2023-05-26 9:37 ` Dave Chinner
@ 2023-05-26 15:47 ` Brian Foster
[not found] ` <CAJ0trDbspRaDKzTzTjFdPHdB9n0Q9unfu1cEk8giTWoNu3jP8g@mail.gmail.com>
1 sibling, 0 replies; 52+ messages in thread
From: Brian Foster @ 2023-05-26 15:47 UTC (permalink / raw)
To: Dave Chinner
Cc: Mike Snitzer, Jens Axboe, Christoph Hellwig, Theodore Ts'o,
Sarthak Kukreti, dm-devel, Michael S. Tsirkin, Darrick J. Wong,
Jason Wang, Bart Van Assche, linux-kernel, linux-block,
Joe Thornber, Andreas Dilger, Stefan Hajnoczi, linux-fsdevel,
linux-ext4, Alasdair Kergon
On Fri, May 26, 2023 at 07:37:43PM +1000, Dave Chinner wrote:
> On Thu, May 25, 2023 at 12:19:47PM -0400, Brian Foster wrote:
> > On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote:
> > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <bfoster@redhat.com> wrote:
> > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > > ... since I also happen to think there is a potentially interesting
> > > > > development path to make this sort of reserve pool configurable in terms
> > > > > of size and active/inactive state, which would allow the fs to use an
> > > > > emergency pool scheme for managing metadata provisioning and not have to
> > > > > track and provision individual metadata buffers at all (dealing with
> > > > > user data is much easier to provision explicitly). So the space
> > > > > inefficiency thing is potentially just a tradeoff for simplicity, and
> > > > > filesystems that want more granularity for better behavior could achieve
> > > > > that with more work. Filesystems that don't would be free to rely on the
> > > > > simple/basic mechanism provided by dm-thin and still have basic -ENOSPC
> > > > > protection with very minimal changes.
> > > > >
> > > > > That's getting too far into the weeds on the future bits, though. This
> > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if there's
> > > > > sufficient interest in this sort of "reserve mode" approach to try and
> > > > > clean it up further and have dm guys look at it, or if you guys see any
> > > > > obvious issues in what it does that makes it potentially problematic, or
> > > > > if you would just prefer to go down the path described above...
> > > >
> > > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > > sticky (by provisioning same blocks for snapshot) seems more useful to
> > > > me because it is quite precise. That said, it doesn't account for
> > > > hard requirements that _all_ blocks will always succeed.
> > >
> > > Hmmm. Maybe I'm misunderstanding the "reserve pool" context here,
> > > but I don't think we'd ever need a hard guarantee from the block
> > > device that every write bio issued from the filesystem will succeed
> > > without ENOSPC.
> > >
> >
> > The bigger picture goal that I didn't get into in my previous mail is
> > the "full device" reservation model is intended to be a simple, crude
> > reference implementation that can be enabled for any arbitrary thin
> > volume consumer (filesystem or application). The idea is to build that
> > on a simple enough reservation mechanism that any such consumer could
> > override it based on its own operational model. The goal is to guarantee
> > that a particular filesystem never receives -ENOSPC from dm-thin on
> > writes, but the first phase of implementing that is to simply guarantee
> > every block is writeable.
> >
> > As a specific filesystem is able to more explicitly provision its own
> > allocations in a way that it can guarantee to return -ENOSPC from
> > dm-thin up front (rather than at write bio time), it can reduce the need
> > for the amount of reservation required, ultimately to zero if that
> > filesystem provides the ability to pre-provision all of its writes to
> > storage in some way or another.
> >
> > I think for filesystems with complex metadata management like XFS, it's
> > not very realistic to expect explicit 1-1 provisioning for all metadata
> > changes on a per-transaction basis in the same way that can (fairly
> > easily) be done for data, which means a pool mechanism is probably still
> > needed for the metadata class of writes.
>
> I'm trying to avoid need for 1-1 provisioning and the need for a
> accounting based reservation pool approach. I've tried the
> reservation pool thing several times, and they've all collapsed
> under the complexity of behaviour guarantees under worst case write
> amplification situations.
>
> The whole point of the LBA provisioning approach is that it
> completely avoids the need to care about write amplification because
> the underlying device guarantees any write to a LBA that is
> provisioned will succeed. It takes care of the write amplification
> problem for us, and we can make it even easier for the backing
> device by aligning LBA range provision requests to device region
> sizes.
>
> > > If the block device can provide a guarantee that a provisioned LBA
> > > range is always writable, then everything else is a filesystem level
> > > optimisation problem and we don't have to involve the block device
> > > in any way. All we need is a flag we can ready out of the bdev at
> > > mount time to determine if the filesystem should be operating with
> > > LBA provisioning enabled...
> > >
> > > e.g. If we need to "pre-provision" a chunk of the LBA space for
> > > filesystem metadata, we can do that ahead of time and track the
> > > pre-provisioned range(s) in the filesystem itself.
> > >
> > > In XFS, That could be as simple as having small chunks of each AG
> > > reserved to metadata (e.g. start with the first 100MB) and limiting
> > > all metadata allocation free space searches to that specific block
> > > range. When we run low on that space, we pre-provision another 100MB
> > > chunk and then allocate all metadata out of that new range. If we
> > > start getting ENOSPC to pre-provisioning, then we reduce the size of
> > > the regions and log low space warnings to userspace. If we can't
> > > pre-provision any space at all and we've completely run out, we
> > > simply declare ENOSPC for all incoming operations that require
> > > metadata allocation until pre-provisioning succeeds again.
> > >
> >
> > The more interesting aspect of this is not so much how space is
> > provisioned and allocated, but how the filesystem is going to consume
> > that space in a way that guarantees -ENOSPC is provided up front before
> > userspace is allowed to make modifications.
>
> Yeah, that's trivial with REQ_PROVISION.
>
> If, at transaction reservation time, we don't have enough
> provisioned metadata space available for the potential allocations
> we'll need to make, we kick provisioning work off wait for more to
> come available. If that fails and none is available, we'll get an
> enospc error right there, same as if the filesystem itself has no
> blocks available for allocation.
>
> This is no different to, say, having xfs_create() fail reservation
> because ENOSPC, then calling xfs_flush_inodes() to kick off an inode
> cache walk to trim away all the unused post-eof allocations in
> memory to free up some space we can use. When that completes,
> we try the reservation again.
>
> There's no new behaviours we need to introduce here - it's just
> replication of existing behaviours and infrastructure.
>
Yes, this is just context. What I'm trying to say is the semantics for
this aspect would be the same irrespective of "guaranteed writeable
space" being implemented as a reservation or preprovisioned LBA range.
I.e., it's a limited resource that has to be managed in a way to provide
specific user visible behavior.
> > You didn't really touch on
> > that here, so I'm going to assume we'd have something like a perag
> > counter of how many free blocks currently live in preprovisioned ranges,
> > and then an fs-wide total somewhere so a transaction has the ability to
> > consume these blocks at trans reservation time, the fs knows when to
> > preprovision more space (or go into -ENOSPC mode), etc.
>
> Sure, something like that. Those are all implementation details, and
> not really that complex to implement and is largely replication of
> reservation infrastructure we already have.
>
Ack.
> > Some accounting of that nature is necessary here in order to prevent the
> > filesystem from ever writing to unprovisioned space. So what I was
> > envisioning is rather than explicitly preprovision a physical range of
> > each AG and tracking all that, just reserve that number of arbitrarily
> > located blocks from dm for each AG.
> >
> > The initial perag reservations can be populated at mount time,
> > replenished as needed in a very similar way as what you describe, and
> > 100% released back to the thin pool at unmount time. On top of that,
> > there's no need to track physical preprovisioned ranges at all. Not just
> > for allocation purposes, but also to avoid things like having to protect
> > background trims from preprovisioned ranges of free space dedicated for
> > metadata, etc.
>
> That's all well and good, but reading further down the email the
> breadth and depth of changes to filesystem and block device
> behaviour to enable this are ... significant.
>
> > > Further, managing shared pool exhaustion doesn't require a
> > > reservation pool in the backing device and for the filesystems to
> > > request space from it. Filesystems already have their own reserve
> > > pools via pre-provisioning. If we want the filesystems to be able to
> > > release that space back to the shared pool (e.g. because the shared
> > > backing pool is critically short on space) then all we need is an
> > > extension to FITRIM to tell the filesystem to also release internal
> > > pre-provisioned reserves.
> > >
> > > Then the backing pool admin (person or automated daemon!) can simply
> > > issue a trim on all the filesystems in the pool and spce will be
> > > returned. Then filesystems will ask for new pre-provisioned space
> > > when they next need to ingest modifications, and the backing pool
> > > can manage the new pre-provisioning space requests directly....
> > >
> >
> > This is written as to imply that the reservation pool is some big
> > complex thing, which makes me think there is some
> > confusion/miscommunication.
>
> No confusion, I'm just sceptical that it will work given my
> experience trying to implement reservation based solutions multiple
> different ways over the past decade. They've all failed because
> they collapse under either the complexity explosion or space
> overhead required to handle the worst case behavioural scenarios.
>
> At one point I calculated the worst case reservation needed ensure
> log recovery will always succeeded, ignoring write amplification,
> was about 16x the size of the log. If I took write amplification for
> dm-thinp having 64kB blocks and each inode hitting a different
> cluster in it's own dm thinp block, that write amplication hit 64x.
>
Ok. Can you give some examples of operations that lead to this worst
case behavior? It sounds like you're talking about inode chunk intent
initialization or some such, but I'd like to be sure I understand.
> So for recovering a 2GB log, if dm-thinp doesn't have a reserve of
> well over 100GB of pool space, there is no guarantee that log
> recovery will -always- succeed.
>
> It's worst case numbers like this which made me conclude that
> reservation based approaches cannot provide guarantees that ENOSPC
> will never occur. The numbers are just too large when you start
> considering journals that can hold a million dirty objects,
> intent chains that might require modifying hundreds of metadata
> blocks across a dozen transactions before they complete, etc.
>
> OTOH, REQ_PROVISION makes this "log recovery needs new space to be
> allocated" problem go away entirely. It provides a mechanism that
> ensures log recovery does not consume any new space in the backing
> pool as all the overwrites it performs are to previously provisioned
> metadata.....
>
Ah, I see. So this relies on the change in behavior to dm-thin
snapshots to preserve overwriteably of previously provisioned metadata
to indirectly manage log recovery -> metadata write amplification.
This is useful context and helps understand the intent of that
suggestion. I also think it calls out some of the disconnect..
> This is just one of the many reasons why I think the REQ_PROVISION
> method is far better than reservations - it solves problems that
> pure runtime reservations can't.
>
> > It's basically just an in memory counter of
> > space that is allocated out of a shared thin pool and is held in a
> > specific thin volume while it is currently in use. The counter on the
> > volume is managed indirectly by filesystem requests and/or direct
> > operations on the volume (like dm snapshots).
> >
> > Sure, you could replace the counter and reservation interface with
> > explicitly provisioned/trimmed LBA ranges that the fs can manage to
> > provide -ENOSPC guarantees, but then the fs has to do those various
> > things you've mentioned:
> >
> > - Provision those ranges in the fs and change allocation behavior
> > accordingly.
>
> This is relatively simple - most of the allocator functionality is
> already there.
>
> > - Do the background post-crash fitrim preprovision clean up thing.
>
> We've already decided this is not needed.
>
> > - Distinguish between trims that are intended to return preprovisioned
> > space vs. those that come from userspace.
>
> It's about ten lines of code in xfs_trim_extents() to do this. i.e.
> the free space tree walk simply skips over free extents in the
> metadata provisioned region based on a flag value.
>
> > - Have some daemon or whatever (?) responsible for communicating the
> > need for trims in the fs to return space back to the pool.
>
> Systems are already configured to run a periodic fstrim passes to do
> this via systemd units. And I'm pretty sure dm-thinp has a low space
> notification to userspace (via dbus?) that is already used by
> userspace agents to handle "near ENOSPC" events automatically.
>
Yeah, Ok. To be clear, I'm not trying to suggest any of these particular
things are complex or the whole thing is intractable or anything like
that. I'm pretty sure I understand how this can all be made to work, at
least for metadata. But I am saying this makes a bunch of customization
changes that could lead to a very XFS centric approach, may be more work
than necessary, and hasn't been described in a way that explains how it
actually (or helps) solves the broader -ENOSPC problem.
> > Then this still depends on changing how dm thin snapshots work and needs
> > a way to deal with delayed allocation to actually guarantee -ENOSPC
> > protection..?
>
> I think you misunderstand: I'm not proposing to use REQ_PROVISION
> for writes the filesystem does not guarantee will succeed. Never
> have, I think it makes no sense at all. If the filesystem
> can return ENOSPC for an unprovisioned user data write, then the
> block device can too.
>
Well, yes.. that's why I was asking. :) I still don't parse what you're
saying here.
Is this intended to prevent -ENOSPC from dm-thin for data writes, or is
that an exercise for the reader?
> > > Hence I think if we get the basic REQ_PROVISION overwrite-in-place
> > > guarantees defined and implemented as previously outlined, then we
> > > don't need any special coordination between the fs and block devices
> > > to avoid fatal ENOSPC issues with sparse and/or snapshot capable
> > > block devices...
> > >
> >
> > This all sounds like a good amount of coordination and unnecessary
> > complexity to me. What I was thinking as a next phase (i.e. after
> > initial phase full device reservation) approach for a filesystem like
> > XFS would be something like this.
> >
> > - Support a mount option for a configurable size metadata reservation
> > pool (with sane/conservative default).
>
> I want this to all to work without the user having be aware that
> there filesystem is running on a sparse device.
>
Reservation (or provision) support can certainly be autodetected.
> > - The pool is populated at mount time, else the fs goes right into
> > simulated -ENOSPC mode.
>
> What are the rules of this mode?
>
Earlier you mentioned that the filesystem would "declare -ENOSPC" when
preprovisioning starts to fail. The rules here would be exactly the
same.
> Hmmmm.
>
> Log recovery needs to be able to allocate new metadata (i.e. in
> intent replay), so I'm guessing reservation is needed before log
> recovery? But if pool reservation fails, how do we then safely
> perform log recovery given the filesystem is in ENOSPC mode?
>
> > - Thin pool reservation consumption is controlled by a flag on write
> > bios that is managed by the fs (flag polarity TBD).
>
> So we still need a bio flag to communicate "this IO consumes
> reservation".
>
> What are the semantics of this flag? What happens on submission
> error? e.g. the bio is failed before it gets to the layer that
> consumes it - how does the filesystem know that reservation was
> consumed or not at completion?
>
The semantics are to use reservation or not. If so, then it's implied
reservation exists and if not that's a bug. If reservation is not
enabled, then dm-thin processes those writes exactly as it does today
(allocates out of the pool when necessary, fails otherwise).
> How do we know when to set it for user data writes?
>
User data is always reserved before writes are accepted, so reservation
is always enabled for user data write bios.
> What happens if the device recieves a bio with this flag but there
> is no reservation remaining? e.g. the filesystem or device
> accounting have got out of whack?
>
That's a bug. Almost the same as if the filesystem were to allow a
delalloc write that can't ultimately allocate because in-core counters
become inconsistent with free space btrees (not like that hasn't
happened before ;).
Let's not get too into the weeds here as if to imply coding errors
translate to design flaws. I'd say it probably should warn, fallback
to normal pool allocation.
> Hmmm. On that note, what about write amplification? Or should I call
> it "reservation amplification". i.e. a 4kB bio with a "consume
> reservation" flag might trigger a dm-region COW or allocation and
> require 512kB of dm-thinp pool space to be allocated. How much
> reservation actually gets consumed, and how do we reconcile the
> differences in physical consumption vs reservation consumption?
>
The filesystem has no concept of the amount of reservation. Only whether
outstanding writes have been reserved or not. In this specific reference
implementation, all data writes are reserved and metadata writes are not
(until an -ENOSPC error happens and the fs decides to "declare -ENOSPC"
based on underlying volume state).
> > - All fs data writes are explicitly reserved up front in the write path.
> > Delalloc maps to explicit reservation, overwrites are easy and just
> > involve an explicit provision.
>
> This is the first you've mentioned an "explicit provision"
> operation. Is this like REQ_PROVISION, or something else?
>
> This seems to imply that the ->iomap_begin method has to do
> explicit provisioning callouts when we get a write that lands in an
> IOMAP_MAPPED extent? Or something else?
>
> Can you describe this mechanism in more detail?
>
So something I've hacked up from the older prototype is to essentially
implement a simple form of a REQ_PROVISION|REQ_RESERVE type operation.
You can think of it like REQ_PROVISION as implemented by this series,
but it doesn't actually do the COW breaking and allocation and whatnot
right away. Instead, it reserves however many blocks out of the pool
might be required to guarantee subsequent writes to the specified region
are guaranteed not to fail with -ENOSPC.
(Note that the prototype isn't currently using REQ_PROVISION. It's just
a function call at the moment. I'm just explaining the concept.)
So the idea for user data in general is something like:
iomap looks up an extent, does a "reserve provision" over it based on
the size of the write, etc. If that succeeds, then the write can proceed
to dirty pages with a guarantee that dm-thin will not -ENOSPC at
writeback time.
If the extent is a hole, then delalloc translates to location agnostic
reservation that is eventually translated to a "reserve provision" at
filesystem allocation time. Note that this does introduce an aspect of
reservation amplification due to block size differences, but this was
already addressed by the older prototype. The same 'flush inodes on
-ENOSPC' mechanism you refer to above provides a feedback mechanism to
allow outstanding reservations to flush and prevent any premature error
problems.
And that can be optimized further in various ways. For example, to
simply map outstanding delalloc extents in the fs and do the "reserve
provision" across the ultimate LBA ranges to release overprovisioned
reserves, while still deferring writeback to later. A shrinker could be
used to allow the thin pool to signal lower space conditions to active
volumes to smooth out behavior rather than waiting for an actual
-ENOSPC, etc. etc.
> > - Metadata writes are not reserved or provisioned at all. They allocate
> > out of the thin pool on write (if needed), just as they do today. On
> > an -ENOSPC metadata write error, the fs goes into simulated -ENOSPC mode
> > and allows outstanding metadata writes to now use the bio flag to
> > consume emergency reservation.
>
> Okay. We need two pools in the backing device? The normal free space
> pool, and an emergency reservation pool?
>
Err not exactly.. it's really just selective use of the bio flag that
allows reserve consumption. Always enabled on data, never on metadata,
until -ENOSPC error and the fs decides to open reserves for metadata and
attempt to allow a full quiesce.
> Without reading further, this implies that the filesystem is
> reliant on the emergency reservation pool being large enough that
> it can write any dirty metadata it has outstanding without ENOSPC
> occuring. How does the size of this emergency pool get configured?
>
> > So this means that metadata -ENOSPC protection is only as reliable as
> > the size of the specified pool. This is by design, so the filesystem
> > still does not have to track provisioning, allocation or overwrites of
> > its own metadata usage. Users with metadata heavy workloads or who
> > happen to be sensitive to -ENOSPC errors can be more aggressive with
> > pool size, while other users might be able to get away with a smaller
> > pool. Users who are super paranoid and want perfection can continue to
> > reserve the entire device and pay for the extra storage.
>
> Oh. Hand tuning. :(
>
Yes and no.. from the fs perspective it's hand tuning. From a user
perspective it can be if desired I guess, but really intended to be done
by the management software that already exists with intent to help
manage this problem. To put another way, any complex user of thin
provisioning who is concerned about this problem is already doing some
degree of tuning here to try and prevent it.
Also, an alternative to what I describe above could be for a filesystem
to implement thinreserve=N mode with a throttle to best ensure -ENOSPC
reliability at the cost of performance. It's still a tunable, but maybe
easier to turn into a heuristic. Not sure, just a random thought.
> > Users who are not sure can test their workload in an appropriate
> > environment, collect some data/metrics on maximum outstanding dirty
> > metadata, and then use that as a baseline/minimum pool size for reliable
> > behavior going forward. This is also where something like Stratis can
> > come in to generate this sort of information, make recommendations or
> > implement heuristics (based on things like fs size, amount of RAM, for
> > e.g.) to provide sane defaults based on use case. I.e., this is
> > initially exposed as a userspace/tuning issue instead of a
> > filesystem/dm-thin hard guarantee.
>
> Which are the same things people have been complaining about for years.
>
Priorities and progress. I don't think that because this isn't the
absolute perfect, most easily usable, completely efficient solution
right off the bat is a very good argument to imply it's not a feasible
approach in general. This is why I'm trying to describe the intended
progression here. Users are already dealing with this sort of thing
through Stratis, and this is a step to at least try to make things
better.
And really, the same could be said for preprovisioning until/unless it's
able to fully guarantee prevention of -ENOSPC errors for data and
metadata. That is exactly the same sort of thing "people have been
complaining about for years."
> > Finally, if you really want to get to that last step of maximally
> > efficient and safe provisioning in the fs, implement a
> > 'thinreserve=adaptive' mode in the fs that alters the acquisition and
> > consumption of dm-thin reserved blocks to be adaptive in nature and
> > promises to do it's own usage throttling against outstanding
> > reservation. I think this is the mode that most closely resembles your
> > preprovisioned range mechanism.
> >
> > For example, adaptive mode could add the logic/complexity where you do
> > the per-ag provision thing (just using reservation instead of physical
> > ranges), change the transaction path to attempt to increase the
> > reservation pool or go into -ENOSPC mode, and flag all writes to be
> > satisfied from the reserve pool (because you've done the
> > provision/reservation up front).
>
> Ok, so why not just go straight to this model using REQ_PROVISION?
>
A couple reasons.. one is I've also been trying to hack around this
problem for a while and have yet to see a full solution that can't
either be completely broken or just driven to impracticality performance
or space consumption wise.
We had an offline discussion with some of the Stratis and dm folks
fairly recently where they explained what Stratis is doing now to
mitigate this. Almost everybody agrees this approach stinks, which is
why I expected a similar first reaction to my "phase 1 full reservation"
model. The thought process there, however, is that rather than continue
to try and hack up various invasive solutions to provide such a simple
user visible behavior and ultimately not making progress, why not take
what they're doing and users are apparently already using and work to
make it better?
IOW, when thinking about the prospect of "hand tuning" above, I think
the more appropriate way to look at it is not that we're providing some
full end-user solution right off the bat. Instead, we're taking Stratis'
already existing "no overprovision" mode and starting to improve on it.
Step one lifts it into the kernel to make it dynamic (runtime
reservation vs provision time no overprovision enforcement), next steps
focus on making it more efficient while preserving the -ENOSPC safety
guarantee. No end user really needs to interact with it directly
until/unless filesystems grow the super-smart ability to do everything
automagically.
So it could very well be that this all remains an experimental feature
until "adaptive mode" can be made to work, but at least we have the
ability to solve the problem incrementally and without permanent
changes. The ability to just rip it out without having made any
permanent changes to filesystems or thin pool metadata should it just
happen to fail spectatularly is also a feature IMO. It could also be the
case that the simple sizing mechanism works well enough, and Stratis is
able to make good enough recommendations that most users are satisifed,
and there's really no need for the levels of complexity we're talking
about for adaptive mode (or preprovisioning) at all.
> If we then want to move to a different "accounting only" model for
> provisioning, we just change REQ_PROVISION?
>
AFAIU this relies on some permanent changes to dm that are not
necessarily trivial to undo..? If so, I think it's actually wiser to
move in the opposite direction. If reservation proves too broad and
expensive due to things like amplification, then move toward physical
provisioning and permanent snapshot changes to address that problem.
The benefit of this is that the reservation approach solves the
fundamental problem from the start, even if the implementation is
ultimately too impractical to be useful, so this mitigates the risk of
getting too far down the road with permanent changes to disk formats and
whatnot only to find the solution doesn't ultimately work.
> But I still see the problem of write amplification accounting being
> unsolved by the "filesystem accounting only" approach advocated
> here. We have no idea when the backing device has snapshots taken,
> we have no idea when a filesystem write IO actually consumes more
> thinp blocks than filesystem blocks, etc. How does the filesystem
> level reservation pool address these problems?
>
This is a fair concern in general, but as mentioned above, I think still
highlights a misunderstanding with the reserve metadata pool idea.
The key point I think is that metadata writes are not actually reserved.
The reserve pool exists solely to handle the -ENOSPC mode transition,
not to continuously supply ongoing metadata transactions. This means
write amplification is less of a concern: every successful FSB sized
write allocates DMB (device mapper block) sized blocks out of the thin
pool, further reducing the odds that subsequent writes to any
overlapping FSB blocks will ever require reservation for the current
active cycle of the log.
A snapshot could happen at any point, but dm-thin snapshots already call
into freeze infrastructure, which already quiesces the log. After that
point the game starts all over, all overwrites require allocation, and
the pool has to be sized large enough to acommodate that the filesystem
is able to quiesce in the event of -ENOSPC, particularly if snapshots
are in use.
So I'm not saying write amplification is not a problem. I think things like
crashing a filesystem, doing a snapshot, then running recovery could
exacerbate this problem, for example. But that's another corner case
that I don't necessarily think discredits the idea. For example, if XFS
really wanted to, it could add another pass to log recovery to do
"reserve provisions" on affected metadata before recovering anything, or
just scan and reserve provision the entire metadata allocated portion of
the fs, or refuse to proceed and require full reservation for any time
the log is dirty, etc. etc.
I think this is something that could use some examples (re: my earlier
question) to help work through whether the pool approach is sane, or if
the size would just always be too big. If not, you could still decide
that the configurable pool approach just doesn't work at all for XFS,
but track the outstanding metadata block usage somewhere (or estimate
based on existing ag buffer btree block counters), reserve that number
of blocks at mount time, and use that to guarantee all metadata block
overwrites will always succeed in the exact same way a preprovision
scheme would. A snapshot while mounted would bump the volume side
reservation appropriately or fail.
I suspect that trades off mount time performance for better snapshot
behavior, but again is just another handwavy option on the table for
consideration that doesn't preclude other fs' from possibly doing
something more simple.
Brian
> > Thoughts on any of the above?
>
> I'd say it went wrong at the requirements stage, resulting in an
> overly complex, over-engineered solution.
>
> > One general tradeoff with using reservations vs. preprovisioning is the
> > the latter can just use the provision/trim primitives to alloc/free LBA
> > ranges. My thought on that is those primitives could possibly be
> > modified to do the same sort of things with reservation as for physical
> > allocations. That seems fairly easy to do with bio op flags/modifiers,
> > though one thing I'm not sure about is how to submit a provision bio to
> > request a certain amount location agnostic blocks. I'd have to
> > investigate that more.
>
> Sure, if the constrained LBA space aspect of the REQ_PROVISION
> implementation causes issues, then we see if we can optimise away
> the fixed LBA space requirement.
>
>
> --
> Dave Chinner
> david@fromorbit.com
>
^ permalink raw reply [flat|nested] 52+ messages in thread
[parent not found: <CAJ0trDbspRaDKzTzTjFdPHdB9n0Q9unfu1cEk8giTWoNu3jP8g@mail.gmail.com>]
* Re: [PATCH v7 0/5] Introduce provisioning primitives
[not found] ` <CAJ0trDbspRaDKzTzTjFdPHdB9n0Q9unfu1cEk8giTWoNu3jP8g@mail.gmail.com>
@ 2023-05-26 23:45 ` Dave Chinner
[not found] ` <CAJ0trDZJQwvAzngZLBJ1hB0XkQ1HRHQOdNQNTw9nK-U5i-0bLA@mail.gmail.com>
0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2023-05-26 23:45 UTC (permalink / raw)
To: Joe Thornber
Cc: Brian Foster, Mike Snitzer, Jens Axboe, Christoph Hellwig,
Theodore Ts'o, Sarthak Kukreti, dm-devel, Michael S. Tsirkin,
Darrick J. Wong, Jason Wang, Bart Van Assche, linux-kernel,
linux-block, Joe Thornber, Andreas Dilger, Stefan Hajnoczi,
linux-fsdevel, linux-ext4, Alasdair Kergon
On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote:
> Here's my take:
>
> I don't see why the filesystem cares if thinp is doing a reservation or
> provisioning under the hood. All that matters is that a future write
> to that region will be honoured (barring device failure etc.).
>
> I agree that the reservation/force mapped status needs to be inherited
> by snapshots.
>
>
> One of the few strengths of thinp is the performance of taking a snapshot.
> Most snapshots created are never activated. Many other snapshots are
> only alive for a brief period, and used read-only. eg, blk-archive
> (https://github.com/jthornber/blk-archive) uses snapshots to do very
> fast incremental backups. As such I'm strongly against any scheme that
> requires provisioning as part of the snapshot operation.
>
> Hank and I are in the middle of the range tree work which requires a
> metadata
> change. So now is a convenient time to piggyback other metadata changes to
> support reservations.
>
>
> Given the above this is what I suggest:
>
> 1) We have an api (ioctl, bio flag, whatever) that lets you
> reserve/guarantee a region:
>
> int reserve_region(dev, sector_t begin, sector_t end);
A C-based interface is not sufficient because the layer that must do
provsioning is not guaranteed to be directly under the filesystem.
We must be able to propagate the request down to the layers that
need to provision storage, and that includes hardware devices.
e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it
allocates in it's backing device to guarantee that the provisioned
LBA range it allocates is also fully provisioned by the storage
below it....
> This api should be used minimally, eg, critical FS metadata only.
Keep in mind that "critical FS metadata" in this context is any
metadata which could cause the filesystem to hang or enter a global
error state if an unexpected ENOSPC error occurs during a metadata
write IO.
Which, in pretty much every journalling filesystem, equates to all
metadata in the filesystem. For a typical root filesystem, that
might be a in the range of a 1-200MB (depending on journal size).
For larger filesytems with lots of files in them, it will be in the
range of GBs of space.
Plan for having to support tens of GBs of provisioned space in
filesystems, not tens of MBs....
[snip]
> Now this is a lot of work. As well as the kernel changes we'll need to
> update the userland tools: thin_check, thin_ls, thin_metadata_unpack,
> thin_rmap, thin_delta, thin_metadata_pack, thin_repair, thin_trim,
> thin_dump, thin_metadata_size, thin_restore. Are we confident that we
> have buy in from the FS teams that this will be widely adopted? Are users
> asking for this? I really don't want to do 6 months of work for nothing.
I think there's a 2-3 solid days of coding to fully implement
REQ_PROVISION support in XFS, including userspace tool support.
Maybe a couple of weeks more to flush the bugs out before it's
largely ready to go.
So if there's buy in from the block layer and DM people for
REQ_PROVISION as described, then I'll definitely have XFS support
ready for you to test whenever dm-thinp is ready to go.
I can't speak for other filesystems, I suspect the only one we care
about is ext4. btrfs and f2fs don't need dm-thinp and there aren't
any other filesystems that are used in production on top of
dm-thinp, so I think only XFS and ext4 matter at this point in time.
I suspect that ext4 would be fairly easy to add support for as well.
ext4 has a lot more fixed-place metadata than XFS has so much more
of it's metadata is covered by mkfs-time provisioning. Limiting
dynamic metadata to specific fully provisioned block groups and
provisioning new block groups for metadata when they are near full
would be equivalent to how I plan to provision metadata space in
XFS. Hence the implementation for ext4 looks to be broadly similar
in scope and complexity as XFS....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 52+ messages in thread