From: Hannes Reinecke <hare@suse.de>
To: Damien Le Moal <dlemoal@kernel.org>,
linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
linux-scsi@vger.kernel.org,
"Martin K . Petersen" <martin.petersen@oracle.com>,
dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com>,
linux-nvme@lists.infradead.org, Keith Busch <kbusch@kernel.org>,
Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v5 07/28] block: Introduce zone write plugging
Date: Wed, 3 Apr 2024 17:28:06 +0200 [thread overview]
Message-ID: <5c89baf3-68f1-4e91-8fa9-efa183297ca8@suse.de> (raw)
In-Reply-To: <20240403084247.856481-8-dlemoal@kernel.org>
On 4/3/24 10:42, Damien Le Moal wrote:
> Zone write plugging implements a per-zone "plug" for write operations
> to control the submission and execution order of write operations to
> sequential write required zones of a zoned block device. Per-zone
> plugging guarantees that at any time there is at most only one write
> request per zone being executed. This mechanism is intended to replace
> zone write locking which implements a similar per-zone write throttling
> at the scheduler level, but is implemented only by mq-deadline.
>
> Unlike zone write locking which operates on requests, zone write
> plugging operates on BIOs. A zone write plug is simply a BIO list that
> is atomically manipulated using a spinlock and a kblockd submission
> work. A write BIO to a zone is "plugged" to delay its execution if a
> write BIO for the same zone was already issued, that is, if a write
> request for the same zone is being executed. The next plugged BIO is
> unplugged and issued once the write request completes.
>
> This mechanism allows to:
> - Untangle zone write ordering from block IO schedulers. This allows
> removing the restriction on using mq-deadline for writing to zoned
> block devices. Any block IO scheduler, including "none" can be used.
> - Zone write plugging operates on BIOs instead of requests. Plugged
> BIOs waiting for execution thus do not hold scheduling tags and thus
> are not preventing other BIOs from executing (reads or writes to
> other zones). Depending on the workload, this can significantly
> improve the device use (higher queue depth operation) and
> performance.
> - Both blk-mq (request based) zoned devices and BIO-based zoned devices
> (e.g. device mapper) can use zone write plugging. It is mandatory
> for the former but optional for the latter. BIO-based drivers can
> use zone write plugging to implement write ordering guarantees, or
> the drivers can implement their own if needed.
> - The code is less invasive in the block layer and is mostly limited to
> blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
> bio.c.
>
> Zone write plugging is implemented using struct blk_zone_wplug. This
> structure includes a spinlock, a BIO list and a work structure to
> handle the submission of plugged BIOs. Zone write plugs structures are
> managed using a per-disk hash table.
>
> Plugging of zone write BIOs is done using the function
> blk_zone_write_plug_bio() which returns false if a BIO execution does
> not need to be delayed and true otherwise. This function is called
> from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
> spanning multiple zones which would cause mishandling of zone write
> plugs. This ichange enables by default zone write plugging for any mq
> request-based block device. BIO-based device drivers can also use zone
> write plugging by expliclty calling blk_zone_write_plug_bio() in their
> ->submit_bio method. For such devices, the driver must ensure that a
> BIO passed to blk_zone_write_plug_bio() is already split and not
> straddling zone boundaries.
>
> Only write and write zeroes BIOs are plugged. Zone write plugging does
> not introduce any significant overhead for other operations. A BIO that
> is being handled through zone write plugging is flagged using the new
> BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
> this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
> The completion of BIOs and requests flagged trigger respectively calls
> to the functions blk_zone_write_bio_endio() and
> blk_zone_write_complete_request(). The latter function is used to
> trigger submission of the next plugged BIO using the zone plug work.
> blk_zone_write_bio_endio() does the same for BIO-based devices.
> This ensures that at any time, at most one request (blk-mq devices) or
> one BIO (BIO-based devices) is being executed for any zone. The
> handling of zone write plugs using a per-zone plug spinlock maximizes
> parallelism and device usage by allowing multiple zones to be writen
> simultaneously without lock contention.
>
> Zone write plugging ignores flush BIOs without data. Hovever, any flush
> BIO that has data is always plugged so that the write part of the flush
> sequence is serialized with other regular writes.
>
> Given that any BIO handled through zone write plugging will be the only
> BIO in flight for the target zone when it is executed, the unplugging
> and submission of a BIO will have no chance of successfully merging with
> plugged requests or requests in the scheduler. To overcome this
> potential performance degradation, blk_mq_submit_bio() calls the
> function blk_zone_write_plug_attempt_merge() to try to merge other
> plugged BIOs with the one just unplugged and submitted. Successful
> merging is signaled using blk_zone_write_plug_bio_merged(), called from
> bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
> of segments of plugged BIOs to attempt merging, the number of segments
> of a plugged BIO is saved using the new struct bio field
> __bi_nr_segments. To avoid growing the size of struct bio, this field is
> added as a union with the bio_cookie field. This is safe to do as
> polling is always disabled for plugged BIOs.
>
> When BIOs are plugged in a zone write plug, the device request queue
> usage counter is always incremented. This reference is kept and reused
> for blk-mq devices when the plugged BIO is unplugged and submitted
> again using submit_bio_noacct_nocheck(). For this case, the unplugged
> BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
> blk_mq_submit_bio() proceeds directly to allocating a new request for
> the BIO, re-using the usage reference count taken when the BIO was
> plugged. This extra reference count is dropped in
> blk_zone_write_plug_attempt_merge() for any plugged BIO that is
> successfully merged. Given that BIO-based devices will not take this
> path, the extra reference is dropped after a plugged BIO is unplugged
> and submitted.
>
> Zone write plugs are dynamically allocated and managed using a hash
> table (an array of struct hlist_head) with RCU protection.
> A zone write plug is allocated when a write BIO is received for the
> zone and not freed until the zone is fully written, reset or finished.
> To detect when a zone write plug can be freed, the write state of each
> zone is tracked using a write pointer offset which corresponds to the
> offset of a zone write pointer relative to the zone start. Write
> operations always increment this write pointer offset. Zone reset
> operations set it to 0 and zone finish operations set it to the zone
> size.
>
> If a write error happens, the wp_offset value of a zone write plug may
> become incorrect and out of sync with the device managed write pointer.
> This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
> The function blk_zone_wplug_handle_error() is called from the new disk
> zone write plug work when this flag is set. This function executes a
> report zone to update the zone write pointer offset to the current
> value as indicated by the device. The disk zone write plug work is
> scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
> with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
> write. Once scheduled, the disk zone write plugs work keeps running
> until all zone errors are handled.
>
> To match the new data structures used for zoned disks, the function
> disk_free_zone_bitmaps() is renamed to the more generic
> disk_free_zone_resources(). The function disk_init_zone_resources() is
> also introduced to initialize zone write plugs resources when a gendisk
> is allocated.
>
> In order to guarantee that the user can simultaneously write up to a
> number of zones equal to a device max active zone limit or max open zone
> limit, zone write plugs are allocated using a mempool sized to the
> maximum of these 2 device limits. For a device that does not have
> active and open zone limits, 128 is used as the default mempool size.
>
> If a change to the device active and open zone limits is detected, the
> disk mempool is resized when blk_revalidate_disk_zones() is executed.
>
> This commit contains contributions from Christoph Hellwig <hch@lst.de>.
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/bio.c | 6 +
> block/blk-merge.c | 11 +
> block/blk-mq.c | 32 +-
> block/blk-zoned.c | 1090 ++++++++++++++++++++++++++++++++++++-
> block/blk.h | 47 +-
> block/genhd.c | 3 +-
> include/linux/blk-mq.h | 2 +
> include/linux/blk_types.h | 8 +-
> include/linux/blkdev.h | 12 +
> 9 files changed, 1200 insertions(+), 11 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
next prev parent reply other threads:[~2024-04-03 15:28 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-03 8:42 [PATCH v5 00/28] Zone write plugging Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 01/28] block: Restore sector of flush requests Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 02/28] block: Remove req_bio_endio() Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 03/28] block: Introduce blk_zone_update_request_bio() Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 04/28] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start() Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 05/28] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 06/28] block: Remember zone capacity when revalidating zones Damien Le Moal
2024-04-04 17:35 ` Bart Van Assche
2024-04-03 8:42 ` [PATCH v5 07/28] block: Introduce zone write plugging Damien Le Moal
2024-04-03 9:44 ` Christoph Hellwig
2024-04-03 15:28 ` Hannes Reinecke [this message]
2024-04-04 18:31 ` Bart Van Assche
2024-04-04 23:07 ` Damien Le Moal
2024-04-04 23:18 ` Damien Le Moal
2024-04-04 23:42 ` Bart Van Assche
2024-04-05 0:02 ` Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 08/28] block: Fake max open zones limit when there is no limit Damien Le Moal
2024-04-03 9:43 ` Christoph Hellwig
2024-04-03 15:28 ` Hannes Reinecke
2024-04-04 19:04 ` Bart Van Assche
2024-04-03 8:42 ` [PATCH v5 09/28] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 10/28] block: Implement zone append emulation Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 11/28] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
2024-04-04 19:10 ` Bart Van Assche
2024-04-03 8:42 ` [PATCH v5 12/28] dm: Use the block layer zone append emulation Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 13/28] scsi: sd: " Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 14/28] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 15/28] null_blk: " Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 16/28] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
2024-04-04 19:19 ` Bart Van Assche
2024-04-03 8:42 ` [PATCH v5 17/28] null_blk: Introduce fua attribute Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 18/28] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 19/28] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 20/28] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 21/28] block: mq-deadline: Remove support for zone write locking Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 22/28] block: Remove elevator required features Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 23/28] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 24/28] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 25/28] block: Replace zone_wlock debugfs entry with zone_wplugs entry Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 26/28] block: Remove zone write locking Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 27/28] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED Damien Le Moal
2024-04-03 8:42 ` [PATCH v5 28/28] block: Do not special-case plugging of zone write operations Damien Le Moal
2024-04-04 13:40 ` [PATCH v5 00/28] Zone write plugging Hans Holmberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5c89baf3-68f1-4e91-8fa9-efa183297ca8@suse.de \
--to=hare@suse.de \
--cc=axboe@kernel.dk \
--cc=dlemoal@kernel.org \
--cc=dm-devel@lists.linux.dev \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=snitzer@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).