All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
To: David Howells <dhowells@redhat.com>,
	Christian Brauner <christian@brauner.io>,
	Jeff Layton <jlayton@kernel.org>,
	Gao Xiang <hsiangkao@linux.alibaba.com>,
	Dominique Martinet <asmadeus@codewreck.org>
Cc: Matthew Wilcox <willy@infradead.org>,
	Steve French <smfrench@gmail.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	netfs@lists.linux.dev, linux-cachefs@redhat.com,
	linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev, linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Latchesar Ionkov <lucho@ionkov.net>,
	Christian Schoenebeck <linux_oss@crudebyte.com>
Subject: Re: [PATCH 19/26] netfs: New writeback implementation
Date: Fri, 29 Mar 2024 18:03:21 -0700	[thread overview]
Message-ID: <b1dd93cb-2025-44c5-87db-17302564c040@linux.dev> (raw)
In-Reply-To: <20240328163424.2781320-20-dhowells@redhat.com>

On 28/03/2024 16:34, David Howells wrote:
> The current netfslib writeback implementation creates writeback requests of
> contiguous folio data and then separately tiles subrequests over the space
> twice, once for the server and once for the cache.  This creates a few
> issues:
> 
>   (1) Every time there's a discontiguity or a change between writing to only
>       one destination or writing to both, it must create a new request.
>       This makes it harder to do vectored writes.
> 
>   (2) The folios don't have the writeback mark removed until the end of the
>       request - and a request could be hundreds of megabytes.
> 
>   (3) In future, I want to support a larger cache granularity, which will
>       require aggregation of some folios that contain unmodified data (which
>       only need to go to the cache) and some which contain modifications
>       (which need to be uploaded and stored to the cache) - but, currently,
>       these are treated as discontiguous.
> 
> There's also a move to get everyone to use writeback_iter() to extract
> writable folios from the pagecache.  That said, currently writeback_iter()
> has some issues that make it less than ideal:
> 
>   (1) there's no way to cancel the iteration, even if you find a "temporary"
>       error that means the current folio and all subsequent folios are going
>       to fail;
> 
>   (2) there's no way to filter the folios being written back - something
>       that will impact Ceph with it's ordered snap system;
> 
>   (3) and if you get a folio you can't immediately deal with (say you need
>       to flush the preceding writes), you are left with a folio hanging in
>       the locked state for the duration, when really we should unlock it and
>       relock it later.
> 
> In this new implementation, I use writeback_iter() to pump folios,
> progressively creating two parallel, but separate streams and cleaning up
> the finished folios as the subrequests complete.  Either or both streams
> can contain gaps, and the subrequests in each stream can be of variable
> size, don't need to align with each other and don't need to align with the
> folios.
> 
> Indeed, subrequests can cross folio boundaries, may cover several folios or
> a folio may be spanned by multiple folios, e.g.:
> 
>           +---+---+-----+-----+---+----------+
> Folios:  |   |   |     |     |   |          |
>           +---+---+-----+-----+---+----------+
> 
>             +------+------+     +----+----+
> Upload:    |      |      |.....|    |    |
>             +------+------+     +----+----+
> 
>           +------+------+------+------+------+
> Cache:   |      |      |      |      |      |
>           +------+------+------+------+------+
> 
> The progressive subrequest construction permits the algorithm to be
> preparing both the next upload to the server and the next write to the
> cache whilst the previous ones are already in progress.  Throttling can be
> applied to control the rate of production of subrequests - and, in any
> case, we probably want to write them to the server in ascending order,
> particularly if the file will be extended.
> 
> Content crypto can also be prepared at the same time as the subrequests and
> run asynchronously, with the prepped requests being stalled until the
> crypto catches up with them.  This might also be useful for transport
> crypto, but that happens at a lower layer, so probably would be harder to
> pull off.
> 
> The algorithm is split into three parts:
> 
>   (1) The issuer.  This walks through the data, packaging it up, encrypting
>       it and creating subrequests.  The part of this that generates
>       subrequests only deals with file positions and spans and so is usable
>       for DIO/unbuffered writes as well as buffered writes.
> 
>   (2) The collector. This asynchronously collects completed subrequests,
>       unlocks folios, frees crypto buffers and performs any retries.  This
>       runs in a work queue so that the issuer can return to the caller for
>       writeback (so that the VM can have its kswapd thread back) or async
>       writes.
> 
>   (3) The retryer.  This pauses the issuer, waits for all outstanding
>       subrequests to complete and then goes through the failed subrequests
>       to reissue them.  This may involve reprepping them (with cifs, the
>       credits must be renegotiated, and a subrequest may need splitting),
>       and doing RMW for content crypto if there's a conflicting change on
>       the server.
> 
> [!] Note that some of the functions are prefixed with "new_" to avoid
> clashes with existing functions.  These will be renamed in a later patch
> that cuts over to the new algorithm.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Eric Van Hensbergen <ericvh@kernel.org>
> cc: Latchesar Ionkov <lucho@ionkov.net>
> cc: Dominique Martinet <asmadeus@codewreck.org>
> cc: Christian Schoenebeck <linux_oss@crudebyte.com>
> cc: Marc Dionne <marc.dionne@auristor.com>
> cc: v9fs@lists.linux.dev
> cc: linux-afs@lists.infradead.org
> cc: netfs@lists.linux.dev
> cc: linux-fsdevel@vger.kernel.org

[..snip..]
> +/*
> + * Begin a write operation for writing through the pagecache.
> + */
> +struct netfs_io_request *new_netfs_begin_writethrough(struct kiocb *iocb, size_t len)
> +{
> +	struct netfs_io_request *wreq = NULL;
> +	struct netfs_inode *ictx = netfs_inode(file_inode(iocb->ki_filp));
> +
> +	mutex_lock(&ictx->wb_lock);
> +
> +	wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp,
> +				      iocb->ki_pos, NETFS_WRITETHROUGH);
> +	if (IS_ERR(wreq))
> +		mutex_unlock(&ictx->wb_lock);
> +
> +	wreq->io_streams[0].avail = true;

in case IS_ERR(wreq) is true, the execution falls through and this
derefere is invalid.

> +	trace_netfs_write(wreq, netfs_write_trace_writethrough);

not sure if we still need trace function call in case of error

> +	return wreq;
> +}
> +

[..snip..]




WARNING: multiple messages have this Message-ID (diff)
From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
To: David Howells <dhowells@redhat.com>,
	Christian Brauner <christian@brauner.io>,
	Jeff Layton <jlayton@kernel.org>,
	Gao Xiang <hsiangkao@linux.alibaba.com>,
	Dominique Martinet <asmadeus@codewreck.org>
Cc: Latchesar Ionkov <lucho@ionkov.net>,
	Christian Schoenebeck <linux_oss@crudebyte.com>,
	linux-mm@kvack.org, Marc Dionne <marc.dionne@auristor.com>,
	linux-afs@lists.infradead.org, Paulo Alcantara <pc@manguebit.com>,
	linux-cifs@vger.kernel.org, Matthew Wilcox <willy@infradead.org>,
	Steve French <smfrench@gmail.com>,
	linux-cachefs@redhat.com, Ilya Dryomov <idryomov@gmail.com>,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>,
	ceph-devel@vger.kernel.org,
	Eric Van Hensbergen <ericvh@kernel.org>,
	linux-nfs@vger.kernel.org, netdev@vger.kernel.org,
	v9fs@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, netfs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org
Subject: Re: [PATCH 19/26] netfs: New writeback implementation
Date: Fri, 29 Mar 2024 18:03:21 -0700	[thread overview]
Message-ID: <b1dd93cb-2025-44c5-87db-17302564c040@linux.dev> (raw)
In-Reply-To: <20240328163424.2781320-20-dhowells@redhat.com>

On 28/03/2024 16:34, David Howells wrote:
> The current netfslib writeback implementation creates writeback requests of
> contiguous folio data and then separately tiles subrequests over the space
> twice, once for the server and once for the cache.  This creates a few
> issues:
> 
>   (1) Every time there's a discontiguity or a change between writing to only
>       one destination or writing to both, it must create a new request.
>       This makes it harder to do vectored writes.
> 
>   (2) The folios don't have the writeback mark removed until the end of the
>       request - and a request could be hundreds of megabytes.
> 
>   (3) In future, I want to support a larger cache granularity, which will
>       require aggregation of some folios that contain unmodified data (which
>       only need to go to the cache) and some which contain modifications
>       (which need to be uploaded and stored to the cache) - but, currently,
>       these are treated as discontiguous.
> 
> There's also a move to get everyone to use writeback_iter() to extract
> writable folios from the pagecache.  That said, currently writeback_iter()
> has some issues that make it less than ideal:
> 
>   (1) there's no way to cancel the iteration, even if you find a "temporary"
>       error that means the current folio and all subsequent folios are going
>       to fail;
> 
>   (2) there's no way to filter the folios being written back - something
>       that will impact Ceph with it's ordered snap system;
> 
>   (3) and if you get a folio you can't immediately deal with (say you need
>       to flush the preceding writes), you are left with a folio hanging in
>       the locked state for the duration, when really we should unlock it and
>       relock it later.
> 
> In this new implementation, I use writeback_iter() to pump folios,
> progressively creating two parallel, but separate streams and cleaning up
> the finished folios as the subrequests complete.  Either or both streams
> can contain gaps, and the subrequests in each stream can be of variable
> size, don't need to align with each other and don't need to align with the
> folios.
> 
> Indeed, subrequests can cross folio boundaries, may cover several folios or
> a folio may be spanned by multiple folios, e.g.:
> 
>           +---+---+-----+-----+---+----------+
> Folios:  |   |   |     |     |   |          |
>           +---+---+-----+-----+---+----------+
> 
>             +------+------+     +----+----+
> Upload:    |      |      |.....|    |    |
>             +------+------+     +----+----+
> 
>           +------+------+------+------+------+
> Cache:   |      |      |      |      |      |
>           +------+------+------+------+------+
> 
> The progressive subrequest construction permits the algorithm to be
> preparing both the next upload to the server and the next write to the
> cache whilst the previous ones are already in progress.  Throttling can be
> applied to control the rate of production of subrequests - and, in any
> case, we probably want to write them to the server in ascending order,
> particularly if the file will be extended.
> 
> Content crypto can also be prepared at the same time as the subrequests and
> run asynchronously, with the prepped requests being stalled until the
> crypto catches up with them.  This might also be useful for transport
> crypto, but that happens at a lower layer, so probably would be harder to
> pull off.
> 
> The algorithm is split into three parts:
> 
>   (1) The issuer.  This walks through the data, packaging it up, encrypting
>       it and creating subrequests.  The part of this that generates
>       subrequests only deals with file positions and spans and so is usable
>       for DIO/unbuffered writes as well as buffered writes.
> 
>   (2) The collector. This asynchronously collects completed subrequests,
>       unlocks folios, frees crypto buffers and performs any retries.  This
>       runs in a work queue so that the issuer can return to the caller for
>       writeback (so that the VM can have its kswapd thread back) or async
>       writes.
> 
>   (3) The retryer.  This pauses the issuer, waits for all outstanding
>       subrequests to complete and then goes through the failed subrequests
>       to reissue them.  This may involve reprepping them (with cifs, the
>       credits must be renegotiated, and a subrequest may need splitting),
>       and doing RMW for content crypto if there's a conflicting change on
>       the server.
> 
> [!] Note that some of the functions are prefixed with "new_" to avoid
> clashes with existing functions.  These will be renamed in a later patch
> that cuts over to the new algorithm.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Eric Van Hensbergen <ericvh@kernel.org>
> cc: Latchesar Ionkov <lucho@ionkov.net>
> cc: Dominique Martinet <asmadeus@codewreck.org>
> cc: Christian Schoenebeck <linux_oss@crudebyte.com>
> cc: Marc Dionne <marc.dionne@auristor.com>
> cc: v9fs@lists.linux.dev
> cc: linux-afs@lists.infradead.org
> cc: netfs@lists.linux.dev
> cc: linux-fsdevel@vger.kernel.org

[..snip..]
> +/*
> + * Begin a write operation for writing through the pagecache.
> + */
> +struct netfs_io_request *new_netfs_begin_writethrough(struct kiocb *iocb, size_t len)
> +{
> +	struct netfs_io_request *wreq = NULL;
> +	struct netfs_inode *ictx = netfs_inode(file_inode(iocb->ki_filp));
> +
> +	mutex_lock(&ictx->wb_lock);
> +
> +	wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp,
> +				      iocb->ki_pos, NETFS_WRITETHROUGH);
> +	if (IS_ERR(wreq))
> +		mutex_unlock(&ictx->wb_lock);
> +
> +	wreq->io_streams[0].avail = true;

in case IS_ERR(wreq) is true, the execution falls through and this
derefere is invalid.

> +	trace_netfs_write(wreq, netfs_write_trace_writethrough);

not sure if we still need trace function call in case of error

> +	return wreq;
> +}
> +

[..snip..]




  parent reply	other threads:[~2024-03-30  1:03 UTC|newest]

Thread overview: 125+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-28 16:33 [PATCH 00/26] netfs, afs, 9p, cifs: Rework netfs to use ->writepages() to copy to cache David Howells
2024-03-28 16:33 ` David Howells
2024-03-28 16:33 ` [PATCH 01/26] cifs: Fix duplicate fscache cookie warnings David Howells
2024-03-28 16:33   ` David Howells
2024-04-15 11:25   ` Jeff Layton
2024-04-15 11:25     ` Jeff Layton
2024-04-15 13:03   ` David Howells
2024-04-15 13:03     ` David Howells
2024-04-15 22:51     ` Steve French
2024-04-15 22:51       ` Steve French
2024-04-16 22:40     ` David Howells
2024-04-16 22:40       ` David Howells
2024-03-28 16:33 ` [PATCH 02/26] 9p: Clean up some kdoc and unused var warnings David Howells
2024-03-28 16:33   ` David Howells
2024-03-28 16:33 ` [PATCH 03/26] netfs: Update i_blocks when write committed to pagecache David Howells
2024-03-28 16:33   ` David Howells
2024-04-15 11:28   ` Jeff Layton
2024-04-15 11:28     ` Jeff Layton
2024-04-16 22:47   ` David Howells
2024-04-16 22:47     ` David Howells
2024-03-28 16:33 ` [PATCH 04/26] netfs: Replace PG_fscache by setting folio->private and marking dirty David Howells
2024-03-28 16:33   ` David Howells
2024-03-28 16:33 ` [PATCH 05/26] mm: Remove the PG_fscache alias for PG_private_2 David Howells
2024-03-28 16:33   ` David Howells
2024-03-28 16:33 ` [PATCH 06/26] netfs: Remove deprecated use of PG_private_2 as a second writeback flag David Howells
2024-03-28 16:33   ` David Howells
2024-03-28 16:33 ` [PATCH 07/26] netfs: Make netfs_io_request::subreq_counter an atomic_t David Howells
2024-03-28 16:33   ` David Howells
2024-03-28 16:34 ` [PATCH 08/26] netfs: Use subreq_counter to allocate subreq debug_index values David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 09/26] mm: Provide a means of invalidation without using launder_folio David Howells
2024-03-28 16:34   ` David Howells
2024-04-15 11:41   ` Jeff Layton
2024-04-15 11:41     ` Jeff Layton
2024-04-17  9:02   ` David Howells
2024-03-28 16:34 ` [PATCH 10/26] cifs: Use alternative invalidation to " David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 11/26] 9p: " David Howells
2024-03-28 16:34   ` David Howells
2024-04-15 11:43   ` Jeff Layton
2024-04-15 11:43     ` Jeff Layton
2024-04-16 23:03   ` David Howells
2024-04-16 23:03     ` David Howells
2024-03-28 16:34 ` [PATCH 12/26] afs: " David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 13/26] netfs: Remove ->launder_folio() support David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 14/26] netfs: Use mempools for allocating requests and subrequests David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 15/26] mm: Export writeback_iter() David Howells
2024-03-28 16:34   ` David Howells
2024-04-03  8:59   ` Christoph Hellwig
2024-04-03  8:59     ` Christoph Hellwig
2024-04-03 10:10   ` David Howells
2024-04-03 10:10     ` David Howells
2024-04-03 10:14     ` Christoph Hellwig
2024-04-03 10:14       ` Christoph Hellwig
2024-04-03 10:55     ` David Howells
2024-04-03 10:55       ` David Howells
2024-04-03 12:41       ` Christoph Hellwig
2024-04-03 12:41         ` Christoph Hellwig
2024-04-03 12:58       ` David Howells
2024-04-03 12:58         ` David Howells
2024-04-05  6:53         ` Christoph Hellwig
2024-04-05  6:53           ` Christoph Hellwig
2024-04-05 10:15         ` Christian Brauner
2024-04-05 10:15           ` Christian Brauner
2024-03-28 16:34 ` [PATCH 16/26] netfs: Switch to using unsigned long long rather than loff_t David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 17/26] netfs: Fix writethrough-mode error handling David Howells
2024-03-28 16:34   ` David Howells
2024-04-15 12:40   ` Jeff Layton
2024-04-15 12:40     ` Jeff Layton
2024-04-17  9:04   ` David Howells
2024-04-17  9:04     ` David Howells
2024-03-28 16:34 ` [PATCH 18/26] netfs: Add some write-side stats and clean up some stat names David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 19/26] netfs: New writeback implementation David Howells
2024-03-28 16:34   ` David Howells
2024-03-29 10:34   ` Naveen Mamindlapalli
2024-03-29 10:34     ` Naveen Mamindlapalli
2024-03-30  1:06     ` Vadim Fedorenko
2024-03-30  1:06       ` Vadim Fedorenko
2024-03-30  1:06       ` Vadim Fedorenko
2024-03-30  1:06       ` Vadim Fedorenko
2024-03-30  1:03   ` Vadim Fedorenko [this message]
2024-03-30  1:03     ` Vadim Fedorenko
2024-03-28 16:34 ` [PATCH 20/26] netfs, afs: Implement helpers for new write code David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 21/26] netfs, 9p: " David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 22/26] netfs, cachefiles: " David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 23/26] netfs: Cut over to using new writeback code David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 24/26] netfs: Remove the old " David Howells
2024-03-28 16:34   ` David Howells
2024-04-15 12:20   ` Jeff Layton
2024-04-15 12:20     ` Jeff Layton
2024-04-17 10:36   ` David Howells
2024-04-17 10:36     ` David Howells
2024-03-28 16:34 ` [PATCH 25/26] netfs: Miscellaneous tidy ups David Howells
2024-03-28 16:34   ` David Howells
2024-03-28 16:34 ` [PATCH 26/26] netfs, afs: Use writeback retry to deal with alternate keys David Howells
2024-03-28 16:34   ` David Howells
2024-04-01 13:53   ` Simon Horman
2024-04-01 13:53     ` Simon Horman
2024-04-02  8:32   ` David Howells
2024-04-02  8:32     ` David Howells
2024-04-10 17:38     ` Simon Horman
2024-04-10 17:38       ` Simon Horman
2024-04-11  7:09     ` David Howells
2024-04-11  7:09       ` David Howells
2024-04-02  8:46 ` [PATCH 19/26] netfs: New writeback implementation David Howells
2024-04-02  8:46   ` David Howells
2024-04-02 10:48 ` [PATCH 00/26] netfs, afs, 9p, cifs: Rework netfs to use ->writepages() to copy to cache Christian Brauner
2024-04-02 10:48   ` Christian Brauner
2024-04-04  7:51 ` [PATCH 21/26] netfs, 9p: Implement helpers for new write code David Howells
2024-04-04  7:51   ` David Howells
2024-04-04  8:01 ` David Howells
2024-04-04  8:01   ` David Howells
2024-04-08 15:53 ` [PATCH 23/26] netfs: Cut over to using new writeback code David Howells
2024-04-08 15:53   ` David Howells
2024-04-15 12:49 ` [PATCH 00/26] netfs, afs, 9p, cifs: Rework netfs to use ->writepages() to copy to cache Jeff Layton
2024-04-15 12:49   ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b1dd93cb-2025-44c5-87db-17302564c040@linux.dev \
    --to=vadim.fedorenko@linux.dev \
    --cc=asmadeus@codewreck.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=christian@brauner.io \
    --cc=dhowells@redhat.com \
    --cc=ericvh@kernel.org \
    --cc=hsiangkao@linux.alibaba.com \
    --cc=idryomov@gmail.com \
    --cc=jlayton@kernel.org \
    --cc=linux-afs@lists.infradead.org \
    --cc=linux-cachefs@redhat.com \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-erofs@lists.ozlabs.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux_oss@crudebyte.com \
    --cc=lucho@ionkov.net \
    --cc=marc.dionne@auristor.com \
    --cc=netdev@vger.kernel.org \
    --cc=netfs@lists.linux.dev \
    --cc=pc@manguebit.com \
    --cc=smfrench@gmail.com \
    --cc=sprasad@microsoft.com \
    --cc=tom@talpey.com \
    --cc=v9fs@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.