DM-Devel Archive mirror
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Damien Le Moal <dlemoal@kernel.org>,
	linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
	linux-scsi@vger.kernel.org,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com>,
	linux-nvme@lists.infradead.org, Keith Busch <kbusch@kernel.org>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v5 07/28] block: Introduce zone write plugging
Date: Wed, 3 Apr 2024 17:28:06 +0200	[thread overview]
Message-ID: <5c89baf3-68f1-4e91-8fa9-efa183297ca8@suse.de> (raw)
In-Reply-To: <20240403084247.856481-8-dlemoal@kernel.org>

On 4/3/24 10:42, Damien Le Moal wrote:
> Zone write plugging implements a per-zone "plug" for write operations
> to control the submission and execution order of write operations to
> sequential write required zones of a zoned block device. Per-zone
> plugging guarantees that at any time there is at most only one write
> request per zone being executed. This mechanism is intended to replace
> zone write locking which implements a similar per-zone write throttling
> at the scheduler level, but is implemented only by mq-deadline.
> 
> Unlike zone write locking which operates on requests, zone write
> plugging operates on BIOs. A zone write plug is simply a BIO list that
> is atomically manipulated using a spinlock and a kblockd submission
> work. A write BIO to a zone is "plugged" to delay its execution if a
> write BIO for the same zone was already issued, that is, if a write
> request for the same zone is being executed. The next plugged BIO is
> unplugged and issued once the write request completes.
> 
> This mechanism allows to:
>   - Untangle zone write ordering from block IO schedulers. This allows
>     removing the restriction on using mq-deadline for writing to zoned
>     block devices. Any block IO scheduler, including "none" can be used.
>   - Zone write plugging operates on BIOs instead of requests. Plugged
>     BIOs waiting for execution thus do not hold scheduling tags and thus
>     are not preventing other BIOs from executing (reads or writes to
>     other zones). Depending on the workload, this can significantly
>     improve the device use (higher queue depth operation) and
>     performance.
>   - Both blk-mq (request based) zoned devices and BIO-based zoned devices
>     (e.g.  device mapper) can use zone write plugging. It is mandatory
>     for the former but optional for the latter. BIO-based drivers can
>     use zone write plugging to implement write ordering guarantees, or
>     the drivers can implement their own if needed.
>   - The code is less invasive in the block layer and is mostly limited to
>     blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
>     bio.c.
> 
> Zone write plugging is implemented using struct blk_zone_wplug. This
> structure includes a spinlock, a BIO list and a work structure to
> handle the submission of plugged BIOs. Zone write plugs structures are
> managed using a per-disk hash table.
> 
> Plugging of zone write BIOs is done using the function
> blk_zone_write_plug_bio() which returns false if a BIO execution does
> not need to be delayed and true otherwise. This function is called
> from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
> spanning multiple zones which would cause mishandling of zone write
> plugs. This ichange enables by default zone write plugging for any mq
> request-based block device. BIO-based device drivers can also use zone
> write plugging by expliclty calling blk_zone_write_plug_bio() in their
> ->submit_bio method. For such devices, the driver must ensure that a
> BIO passed to blk_zone_write_plug_bio() is already split and not
> straddling zone boundaries.
> 
> Only write and write zeroes BIOs are plugged. Zone write plugging does
> not introduce any significant overhead for other operations. A BIO that
> is being handled through zone write plugging is flagged using the new
> BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
> this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
> The completion of BIOs and requests flagged trigger respectively calls
> to the functions blk_zone_write_bio_endio() and
> blk_zone_write_complete_request(). The latter function is used to
> trigger submission of the next plugged BIO using the zone plug work.
> blk_zone_write_bio_endio() does the same for BIO-based devices.
> This ensures that at any time, at most one request (blk-mq devices) or
> one BIO (BIO-based devices) is being executed for any zone. The
> handling of zone write plugs using a per-zone plug spinlock maximizes
> parallelism and device usage by allowing multiple zones to be writen
> simultaneously without lock contention.
> 
> Zone write plugging ignores flush BIOs without data. Hovever, any flush
> BIO that has data is always plugged so that the write part of the flush
> sequence is serialized with other regular writes.
> 
> Given that any BIO handled through zone write plugging will be the only
> BIO in flight for the target zone when it is executed, the unplugging
> and submission of a BIO will have no chance of successfully merging with
> plugged requests or requests in the scheduler. To overcome this
> potential performance degradation, blk_mq_submit_bio() calls the
> function blk_zone_write_plug_attempt_merge() to try to merge other
> plugged BIOs with the one just unplugged and submitted. Successful
> merging is signaled using blk_zone_write_plug_bio_merged(), called from
> bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
> of segments of plugged BIOs to attempt merging, the number of segments
> of a plugged BIO is saved using the new struct bio field
> __bi_nr_segments. To avoid growing the size of struct bio, this field is
> added as a union with the bio_cookie field. This is safe to do as
> polling is always disabled for plugged BIOs.
> 
> When BIOs are plugged in a zone write plug, the device request queue
> usage counter is always incremented. This reference is kept and reused
> for blk-mq devices when the plugged BIO is unplugged and submitted
> again using submit_bio_noacct_nocheck(). For this case, the unplugged
> BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
> blk_mq_submit_bio() proceeds directly to allocating a new request for
> the BIO, re-using the usage reference count taken when the BIO was
> plugged. This extra reference count is dropped in
> blk_zone_write_plug_attempt_merge() for any plugged BIO that is
> successfully merged. Given that BIO-based devices will not take this
> path, the extra reference is dropped after a plugged BIO is unplugged
> and submitted.
> 
> Zone write plugs are dynamically allocated and managed using a hash
> table (an array of struct hlist_head) with RCU protection.
> A zone write plug is allocated when a write BIO is received for the
> zone and not freed until the zone is fully written, reset or finished.
> To detect when a zone write plug can be freed, the write state of each
> zone is tracked using a write pointer offset which corresponds to the
> offset of a zone write pointer relative to the zone start. Write
> operations always increment this write pointer offset. Zone reset
> operations set it to 0 and zone finish operations set it to the zone
> size.
> 
> If a write error happens, the wp_offset value of a zone write plug may
> become incorrect and out of sync with the device managed write pointer.
> This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
> The function blk_zone_wplug_handle_error() is called from the new disk
> zone write plug work when this flag is set. This function executes a
> report zone to update the zone write pointer offset to the current
> value as indicated by the device. The disk zone write plug work is
> scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
> with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
> write. Once scheduled, the disk zone write plugs work keeps running
> until all zone errors are handled.
> 
> To match the new data structures used for zoned disks, the function
> disk_free_zone_bitmaps() is renamed to the more generic
> disk_free_zone_resources(). The function disk_init_zone_resources() is
> also introduced to initialize zone write plugs resources when a gendisk
> is allocated.
> 
> In order to guarantee that the user can simultaneously write up to a
> number of zones equal to a device max active zone limit or max open zone
> limit, zone write plugs are allocated using a mempool sized to the
> maximum of these 2 device limits. For a device that does not have
> active and open zone limits, 128 is used as the default mempool size.
> 
> If a change to the device active and open zone limits is detected, the
> disk mempool is resized when blk_revalidate_disk_zones() is executed.
> 
> This commit contains contributions from Christoph Hellwig <hch@lst.de>.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/bio.c               |    6 +
>   block/blk-merge.c         |   11 +
>   block/blk-mq.c            |   32 +-
>   block/blk-zoned.c         | 1090 ++++++++++++++++++++++++++++++++++++-
>   block/blk.h               |   47 +-
>   block/genhd.c             |    3 +-
>   include/linux/blk-mq.h    |    2 +
>   include/linux/blk_types.h |    8 +-
>   include/linux/blkdev.h    |   12 +
>   9 files changed, 1200 insertions(+), 11 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


  parent reply	other threads:[~2024-04-03 15:28 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-03  8:42 [PATCH v5 00/28] Zone write plugging Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 01/28] block: Restore sector of flush requests Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 02/28] block: Remove req_bio_endio() Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 03/28] block: Introduce blk_zone_update_request_bio() Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 04/28] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start() Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 05/28] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 06/28] block: Remember zone capacity when revalidating zones Damien Le Moal
2024-04-04 17:35   ` Bart Van Assche
2024-04-03  8:42 ` [PATCH v5 07/28] block: Introduce zone write plugging Damien Le Moal
2024-04-03  9:44   ` Christoph Hellwig
2024-04-03 15:28   ` Hannes Reinecke [this message]
2024-04-04 18:31   ` Bart Van Assche
2024-04-04 23:07     ` Damien Le Moal
2024-04-04 23:18     ` Damien Le Moal
2024-04-04 23:42       ` Bart Van Assche
2024-04-05  0:02         ` Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 08/28] block: Fake max open zones limit when there is no limit Damien Le Moal
2024-04-03  9:43   ` Christoph Hellwig
2024-04-03 15:28   ` Hannes Reinecke
2024-04-04 19:04   ` Bart Van Assche
2024-04-03  8:42 ` [PATCH v5 09/28] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 10/28] block: Implement zone append emulation Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 11/28] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
2024-04-04 19:10   ` Bart Van Assche
2024-04-03  8:42 ` [PATCH v5 12/28] dm: Use the block layer zone append emulation Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 13/28] scsi: sd: " Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 14/28] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 15/28] null_blk: " Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 16/28] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
2024-04-04 19:19   ` Bart Van Assche
2024-04-03  8:42 ` [PATCH v5 17/28] null_blk: Introduce fua attribute Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 18/28] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 19/28] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 20/28] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 21/28] block: mq-deadline: Remove support for zone write locking Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 22/28] block: Remove elevator required features Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 23/28] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 24/28] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 25/28] block: Replace zone_wlock debugfs entry with zone_wplugs entry Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 26/28] block: Remove zone write locking Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 27/28] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED Damien Le Moal
2024-04-03  8:42 ` [PATCH v5 28/28] block: Do not special-case plugging of zone write operations Damien Le Moal
2024-04-04 13:40 ` [PATCH v5 00/28] Zone write plugging Hans Holmberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5c89baf3-68f1-4e91-8fa9-efa183297ca8@suse.de \
    --to=hare@suse.de \
    --cc=axboe@kernel.dk \
    --cc=dlemoal@kernel.org \
    --cc=dm-devel@lists.linux.dev \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).